A more cursed technique... =) #3

billywhizz · 2023-09-18T14:31:41Z

Thanks for this Bryan - i didn't know this was possible. Out of interest, i ran some benchmarks of this on a custom v8 runtime i am hacking on and compared it to another technique i have been playing with. Of course, this is very dangerous and not something I would expect to see in Node.js or Deno, but the numbers are interesting all the same.

The technique I use is:

use system calloc (using FFI or bindings) to allocate the memory and get back an address
wrap the allocated memory in a backing store with an empty deleter so it won't ever get freed by v8
use system free (using FFI or bindings) to free the memory when we are done. the wrapping ArrayBuffer should be collected at some point by GC

This proves to be ~30 times faster on my setup, but your detach technique does not seem to work for me in freeing up the memory for the wrapping ArrayBuffer in the hot loop so I see memory constantly growing.

this is what the JS code looks like. I had to set the --allow-natives-syntax flag on the command line as v8 i am on barfs when i try to change the flags after initialising v8 platform.

import { Bench } from 'lib/bench.js'
import { system } from 'lib/system.js'

const { wrapMemory } = spin

const handle = {
  buffer: new ArrayBuffer(0),
  address: 0
}

function allocCBuffer(size) {
  const address = system.calloc(1, size)
  handle.buffer = wrapMemory(address, address + size)
  handle.address = address
}

function makeDetach () {
  const internalDetach = new Function('buf', '%ArrayBufferDetach(buf)')
  return function detach (buf) {
    if (buf.buffer) {
      if (buf.byteOffset !== 0 || buf.byteLength !== buf.buffer.byteLength) return
      buf = buf.buffer
    }
    internalDetach(buf)
  }
}

const detach = makeDetach()

const bench = new Bench()

while (1) {


for (let i = 0; i < 5; i++) {
  bench.start('buffers')
  for (let j = 0; j < 2500; j++) {
    const buf = new ArrayBuffer(100 * 1024 * 1024)
  }
  bench.end(2500)
}

for (let i = 0; i < 5; i++) {
  bench.start('buffers detach')
  for (let j = 0; j < 3000; j++) {
    const buf = new ArrayBuffer(100 * 1024 * 1024)
    detach(buf)
  }
  bench.end(3000)
}

for (let i = 0; i < 5; i++) {
  bench.start('c-buffers')
  for (let j = 0; j < 100000; j++) {
    allocCBuffer(100 * 1024 * 1024)
    system.free(handle.address)
    detach(handle.buffer)
  }
  bench.end(100000)
}

}

Will have a further look when I get a chance and hopefully I can share this code soon.

v8/C++ WrapMemory Function

void spin::WrapMemory(const FunctionCallbackInfo<Value> &args) {
  Isolate* isolate = args.GetIsolate();
  uint64_t start64 = (uint64_t)Local<Integer>::Cast(args[0])->Value();
  uint64_t end64 = (uint64_t)Local<Integer>::Cast(args[1])->Value();
  const uint64_t size = end64 - start64;
  void* start = reinterpret_cast<void*>(start64);
  int free = 0;
  if (args.Length() > 2) free = Local<Integer>::Cast(args[2])->Value();
  if (free == 0) {
    std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
        start, size, v8::BackingStore::EmptyDeleter, nullptr);
    // this line causes memory allocation that never seems to be collected
    Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
    args.GetReturnValue().Set(ab);
    return;
  }
  std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
      start, size, spin::FreeMemory, nullptr);
  Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
  args.GetReturnValue().Set(ab);
}

this is all horribly dangerous of course, but it's fun to test the boundaries of what v8/JS can do I think.

billywhizz · 2023-09-18T16:53:20Z

i also tried this detach technique with the process pinned to a single core and the rate is pretty much same as the normal way of doing it - if even a tiny bit slower. so it's trading increased cpu usage (for GC, on another thread) up front against reduced memory usage as far as i can see.

bengl · 2023-09-18T18:06:54Z

This is some interesting work. Thanks for digging in!

This proves to be ~30 times faster on my setup

Faster than not freeing them, and using regular ArrayBuffers, right?

I had explored putting this sort of approach together, with a new subclass of ArrayBuffer called "DisposableArrayBuffer", which would be allocated much like in your approach, but ultimately decided against it since the ability to do this without a native addon, or modifying Node.js itself, and for any arbitrary ArrayBuffer is very compelling. It also means that if you don't detach, the GC can still do its job later on as normal.

your detach technique does not seem to work for me in freeing up the memory for the wrapping ArrayBuffer in the hot loop so I see memory constantly growing

Are you using a custom ArrayBufferAllocator? Or the default V8 one? Or something akin to what Node.js does? I wonder if that's what makes the difference here.

I had to set the --allow-natives-syntax flag on the command line as v8 i am on barfs when i try to change the flags after initialising v8 platform.

If you're doing that, then you don't need to use the Function constructor. You can just put the natives syntax in your code directly, even from within your benchmarks. No need to wrap it at all.

billywhizz · 2023-09-19T15:31:25Z

i've been fiddling around with this approach and it's broken in various ways. trying to find an efficient (and safe) way to wrap external memory in v8.

fyi - i think the speed improvement is likely down to fact i am never writing to the memory and calloc always seems to return the same block of memory if i free it directly after and run in a tight loop.

billywhizz · 2023-10-03T08:39:22Z

btw - it turns out the memory leak i experienced was down to a current bug in v8 when pointer compression is enabled. thanks to the deno folks for documenting it!

billywhizz · 2023-10-03T18:12:20Z

another update. i built a new v8 static library on latest v8 beta branch which fixed the issue above, but meant i had to turn off pointer compression. i have verified the technique of yours is indeed faster in a tight loop than leaving v8 to deal with de-allocation, but the results from doing a separate call to calloc and then wrapping the memory in a buffer with no dispose callback are pretty insane. 🤯 over 30x faster.

i'll have to have a dig into v8 source to try to understand why. we are not touching the memory we are allocating so it may just be the fact that memory does not have to be filled with zeros each time around.

billywhizz · 2023-10-03T18:14:52Z

this is what the js benchmark looks like.

import { Bench } from 'lib/bench.js'
import { system } from 'lib/system.js'

const { wrapMemory, unwrapMemory, assert } = spin

const bench = new Bench()

let runs = 0
const size = 100 * 1024 * 1024

while (1) {
  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`new ArrayBuffer ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = new ArrayBuffer(size)
      assert(buf.byteLength === size)
    }
    bench.end(runs)
  }

  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`new ArrayBuffer w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = new ArrayBuffer(size)
      assert(buf.byteLength === size)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
  }

  runs = 180000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap external ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
      system.free(address)
    }
    bench.end(runs)
  }

  runs = 180000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap external w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
      system.free(address)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
  }

  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap internal ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 1)
      assert(buf.byteLength === size)
    }
    bench.end(runs)
  }

  runs = 6000

  for (let i = 0; i < 5; i++) {
    bench.start(`calloc/wrap internal w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const address = system.calloc(1, size)
      const buf = wrapMemory(address, size, 1)
      assert(buf.byteLength === size)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
  }

  runs = 6000000

  for (let i = 0; i < 5; i++) {
    const address = system.calloc(1, size)
    bench.start(`wrap existing external ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
    }
    bench.end(runs)
    system.free(address)
  }

  runs = 6000000

  for (let i = 0; i < 5; i++) {
    const address = system.calloc(1, size)
    bench.start(`wrap existing external w/unwrap ${size}`)
    for (let j = 0; j < runs; j++) {
      const buf = wrapMemory(address, size, 0)
      assert(buf.byteLength === size)
      unwrapMemory(buf)
      assert(buf.byteLength === 0)
    }
    bench.end(runs)
    system.free(address)
  }
}

and the wrapMemory and unwrapMemory from C++

void spin::WrapMemory(const FunctionCallbackInfo<Value> &args) {
  Isolate* isolate = args.GetIsolate();
  uint64_t start64 = (uint64_t)Local<Integer>::Cast(args[0])->Value();
  uint32_t size = (uint32_t)Local<Integer>::Cast(args[1])->Value();
  void* start = reinterpret_cast<void*>(start64);
  int32_t free_memory = 0;
  if (args.Length() > 2) {
    free_memory = (int32_t)Local<Integer>::Cast(args[2])->Value();
  }
  if (free_memory == 0) {
    std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
        start, size, v8::BackingStore::EmptyDeleter, nullptr);
    Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
    args.GetReturnValue().Set(ab);
    return;
  }
  std::unique_ptr<BackingStore> backing = ArrayBuffer::NewBackingStore(
      start, size, spin::FreeMemory, nullptr);
  Local<ArrayBuffer> ab = ArrayBuffer::New(isolate, std::move(backing));
  args.GetReturnValue().Set(ab);
}

void spin::UnWrapMemory(const FunctionCallbackInfo<Value> &args) {
  Local<ArrayBuffer> ab = args[0].As<ArrayBuffer>();
  ab->Detach();
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A more cursed technique... =) #3

A more cursed technique... =) #3

billywhizz commented Sep 18, 2023 •

edited

Loading

billywhizz commented Sep 18, 2023

bengl commented Sep 18, 2023

billywhizz commented Sep 19, 2023 •

edited

Loading

billywhizz commented Oct 3, 2023 •

edited

Loading

billywhizz commented Oct 3, 2023 •

edited

Loading

billywhizz commented Oct 3, 2023

A more cursed technique... =) #3

A more cursed technique... =) #3

Comments

billywhizz commented Sep 18, 2023 • edited Loading

v8/C++ WrapMemory Function

billywhizz commented Sep 18, 2023

bengl commented Sep 18, 2023

billywhizz commented Sep 19, 2023 • edited Loading

billywhizz commented Oct 3, 2023 • edited Loading

billywhizz commented Oct 3, 2023 • edited Loading

billywhizz commented Oct 3, 2023

billywhizz commented Sep 18, 2023 •

edited

Loading

billywhizz commented Sep 19, 2023 •

edited

Loading

billywhizz commented Oct 3, 2023 •

edited

Loading

billywhizz commented Oct 3, 2023 •

edited

Loading