Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A more cursed technique... =) #3

Open
billywhizz opened this issue Sep 18, 2023 · 6 comments
Open

A more cursed technique... =) #3

billywhizz opened this issue Sep 18, 2023 · 6 comments

Comments

@billywhizz
Copy link

billywhizz commented Sep 18, 2023

Thanks for this Bryan - i didn't know this was possible. Out of interest, i ran some benchmarks of this on a custom v8 runtime i am hacking on and compared it to another technique i have been playing with. Of course, this is very dangerous and not something I would expect to see in Node.js or Deno, but the numbers are interesting all the same.

The technique I use is:

  • use system calloc (using FFI or bindings) to allocate the memory and get back an address
  • wrap the allocated memory in a backing store with an empty deleter so it won't ever get freed by v8
  • use system free (using FFI or bindings) to free the memory when we are done. the wrapping ArrayBuffer should be collected at some point by GC

This proves to be ~30 times faster on my setup, but your detach technique does not seem to work for me in freeing up the memory for the wrapping ArrayBuffer in the hot loop so I see memory constantly growing.

Screenshot from 2023-09-18 15-20-36

this is what the JS code looks like. I had to set the --allow-natives-syntax flag on the command line as v8 i am on barfs when i try to change the flags after initialising v8 platform.

import { Bench } from 'lib/bench.js'
import { system } from 'lib/system.js'

const { wrapMemory } = spin

const handle = {
  buffer: new ArrayBuffer(0),
  address: 0
}

function allocCBuffer(size) {
  const address = system.calloc(1, size)
  handle.buffer = wrapMemory(address, address + size)
  handle.address = address
}

function makeDetach () {
  const internalDetach = new Function('buf', '%ArrayBufferDetach(buf)')
  return function detach (buf) {
    if (buf.buffer) {
      if (buf.byteOffset !== 0 || buf.byteLength !== buf.buffer.byteLength) return
      buf = buf.buffer
    }
    internalDetach(buf)
  }
}

const detach = makeDetach()

const bench = new Bench()

while (1) {


for (let i = 0; i < 5; i++) {
  bench.start('buffers')
  for (let j = 0; j < 2500; j++) {
    const buf = new ArrayBuffer(100 * 1024 * 1024)
  }
  bench.end(2500)
}

for (let i = 0; i < 5; i++) {
  bench.start('buffers detach')
  for (let j = 0; j < 3000; j++) {
    const buf = new ArrayBuffer(100 * 1024 * 1024)
    detach(buf)
  }
  bench.end(3000)
}

for (let i = 0; i < 5; i++) {
  bench.start('c-buffers')
  for (let j = 0; j < 100000; j++) {
    allocCBuffer(100 * 1024 * 1024)
    system.free(handle.address)
    detach(handle.buffer)
  }
  bench.end(100000)
}

}

Will have a further look when I get a chance and hopefully I can share this code soon.

v8/C++ WrapMemory Function

void spin::WrapMemory(const FunctionCallbackInfo<Value> &args) {
  Isolate* isolate = args.GetIsolate();
  uint64_t start64 = (uint64_t)Local<Integer>::Cast(args[0])->Value();
  uint64_t end64 = (uint64_t)Local<Integer>::Cast(args[1])->Value();
  const uint64_t size = end64 - start64;
  void* start = reinterpret_cast<void*>(start64);
  int free = 0;
  if (args.Length() > 2) free = Local<Integer>::Cast(args[2])->Value();
  if (free == 0) {
    std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
        start, size, v8::BackingStore::EmptyDeleter, nullptr);
    // this line causes memory allocation that never seems to be collected
    Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
    args.GetReturnValue().Set(ab);
    return;
  }
  std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
      start, size, spin::FreeMemory, nullptr);
  Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
  args.GetReturnValue().Set(ab);
}

this is all horribly dangerous of course, but it's fun to test the boundaries of what v8/JS can do I think.

@billywhizz
Copy link
Author

i also tried this detach technique with the process pinned to a single core and the rate is pretty much same as the normal way of doing it - if even a tiny bit slower. so it's trading increased cpu usage (for GC, on another thread) up front against reduced memory usage as far as i can see.

@bengl
Copy link
Owner

bengl commented Sep 18, 2023

This is some interesting work. Thanks for digging in!

This proves to be ~30 times faster on my setup

Faster than not freeing them, and using regular ArrayBuffers, right?

I had explored putting this sort of approach together, with a new subclass of ArrayBuffer called "DisposableArrayBuffer", which would be allocated much like in your approach, but ultimately decided against it since the ability to do this without a native addon, or modifying Node.js itself, and for any arbitrary ArrayBuffer is very compelling. It also means that if you don't detach, the GC can still do its job later on as normal.

your detach technique does not seem to work for me in freeing up the memory for the wrapping ArrayBuffer in the hot loop so I see memory constantly growing

Are you using a custom ArrayBufferAllocator? Or the default V8 one? Or something akin to what Node.js does? I wonder if that's what makes the difference here.

I had to set the --allow-natives-syntax flag on the command line as v8 i am on barfs when i try to change the flags after initialising v8 platform.

If you're doing that, then you don't need to use the Function constructor. You can just put the natives syntax in your code directly, even from within your benchmarks. No need to wrap it at all.

@billywhizz
Copy link
Author

billywhizz commented Sep 19, 2023

i've been fiddling around with this approach and it's broken in various ways. trying to find an efficient (and safe) way to wrap external memory in v8.

fyi - i think the speed improvement is likely down to fact i am never writing to the memory and calloc always seems to return the same block of memory if i free it directly after and run in a tight loop.

@billywhizz
Copy link
Author

billywhizz commented Oct 3, 2023

btw - it turns out the memory leak i experienced was down to a current bug in v8 when pointer compression is enabled. thanks to the deno folks for documenting it!

@billywhizz
Copy link
Author

billywhizz commented Oct 3, 2023

another update. i built a new v8 static library on latest v8 beta branch which fixed the issue above, but meant i had to turn off pointer compression. i have verified the technique of yours is indeed faster in a tight loop than leaving v8 to deal with de-allocation, but the results from doing a separate call to calloc and then wrapping the memory in a buffer with no dispose callback are pretty insane. 🤯 over 30x faster.

i'll have to have a dig into v8 source to try to understand why. we are not touching the memory we are allocating so it may just be the fact that memory does not have to be filled with zeros each time around.
Screenshot from 2023-10-03 19-07-13

@billywhizz
Copy link
Author

this is what the js benchmark looks like.

import { Bench } from 'lib/bench.js'
import { system } from 'lib/system.js'

const { wrapMemory, unwrapMemory, assert } = spin

const bench = new Bench()

let runs = 0
const size = 100 * 1024 * 1024

while (1) {
  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`new ArrayBuffer ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = new ArrayBuffer(size)
      assert(buf.byteLength === size)
    }
    bench.end(runs)
  }

  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`new ArrayBuffer w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = new ArrayBuffer(size)
      assert(buf.byteLength === size)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
  }

  runs = 180000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap external ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
      system.free(address)
    }
    bench.end(runs)
  }

  runs = 180000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap external w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
      system.free(address)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
  }

  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap internal ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 1)
      assert(buf.byteLength === size)
    }
    bench.end(runs)
  }

  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap internal w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 1)
      assert(buf.byteLength === size)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
  }

  runs = 6000000

  for (let i = 0; i < 5; i++) {
    const address = system.calloc(1, size)
    bench.start(`wrap existing external ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
    }
    bench.end(runs)
    system.free(address)
  }

  runs = 6000000

  for (let i = 0; i < 5; i++) {
    const address = system.calloc(1, size)
    bench.start(`wrap existing external w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
    system.free(address)
  }
}

and the wrapMemory and unwrapMemory from C++

void spin::WrapMemory(const FunctionCallbackInfo<Value> &args) {
  Isolate* isolate = args.GetIsolate();
  uint64_t start64 = (uint64_t)Local<Integer>::Cast(args[0])->Value();
  uint32_t size = (uint32_t)Local<Integer>::Cast(args[1])->Value();
  void* start = reinterpret_cast<void*>(start64);
  int32_t free_memory = 0;
  if (args.Length() > 2) {
    free_memory = (int32_t)Local<Integer>::Cast(args[2])->Value();
  }
  if (free_memory == 0) {
    std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
        start, size, v8::BackingStore::EmptyDeleter, nullptr);
    Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
    args.GetReturnValue().Set(ab);
    return;
  }
  std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
      start, size, spin::FreeMemory, nullptr);
  Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
  args.GetReturnValue().Set(ab);
}

void spin::UnWrapMemory(const FunctionCallbackInfo<Value> &args) {
  Local<ArrayBuffer> ab = args[0].As<ArrayBuffer>();
  ab->Detach();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants