optimize time wasted in passing argument and fetching return #137

muayyad-alsadi · 2023-03-21T09:39:10Z

While discussing #96 I was able to identify a bottleneck
there was needlessly large time wasted in passing the arguments and retrieving the return

basically I made a simple 32-bit hash function cdb_djp_hash.c

in pure python the time is proportional to string length
in wasm it was almost constant time ~40ms regardless of string length 13 bytes to 1300 bytes (100x)
which means that most of those 40ms are in passing the arguments not in the actual loop (if it was the loop it would be proportional to string length or a large fraction of that).

This was confirmed using

pr=Profile()
pr.enable()
for i in range(10000):
        cdb_djp_hash_wasm(large_a)
pr.disable()
#pr.print_stats()
#pr.print_stats('cumulative')
pr.print_stats('calls')

the first note that the actual WASM call is fast (169ms out of 800ms)
which means that we might able to squeeze ~650ms of 800ms making it 4x faster

and I was able to identify bottleneck

things that take too much cumulative time
things that are called more than 10k (number of iterations)

examples of such things

too many calls to isinstance which is 170k (in a 10k iteration benchmark), that is 17 times per call
converting the argument _value.py:129(_convert) too ~200ms out of 800ms
we got dynamically sized lists operations like {method 'append' of 'list' objects}

Are we sure that passing arguments / converting can be made faster?
Yes

I can confirm _value.py:129(_convert) that is too slow

I've moved the conversion outside the loop and confirmed it was the bottleneck

    i32_start = Val.i32(start)
    i32_size = Val.i32(size)
    for i in range(10000):
        np_mem[start:stop]=input
        hash_wasm = ctypes.c_uint32(_cdb_djp_hash_wasm(i32_start, i32_size)).value

also I can confirm that this time is just waste because passing parameter should do the following

convert little endian (which should not take time see below)
set memory just after the stack pointer (which is now fast using Investigate adding the Python Buffer Protocol on the memory #135 Setting a slice of memory without having to enumerate #81 FIXES #81: 400x faster slice assignment #134 )
move the stack pointer (which should not take time)

to proof that we can convert in almost no time

in_struct=Struct('<LL')
in_pack=in_struct.pack
for i in range(10000):
    b=pack(start, size)

The text was updated successfully, but these errors were encountered:

muayyad-alsadi · 2023-03-21T10:01:52Z

in _value._convert
we see a lot of isinstance() and _func.call do a lot of work on each call despite the fact that the signature of the function does not change at runtime

I believe in the constructor of class Func we need to set a property

in_fmt = ''.join([i.struct_code for i in ty])
self._in_struct=Struct('<'+in_fmt) # `<` because WASM is little endian
self._in_pack=self._in_struct.pack
self._out_struct=Struct('<'+out_fmt)
self._out_unpack=self._in_struct.unpack

where Val.struct_code is b, B, l, L, q, Q, f, d for i8, u8, i32, u32, i64, u64, f32, f64 respectively

and this should be done once on Func constructor not on every code
then when we want to encode the arguments

b=self._in_pack(...args)

muayyad-alsadi · 2023-03-21T21:57:36Z

I was able to make it 4x faster
running 10k iteration of this hash function typically took ~350ms and now ~80ms

old:  0.3466089940011443
new:  0.0788318810009514

here is the code

def cdb_djp_hash_wasm(input: bytearray):
    size = len(input)
    start = hb
    stop = start + size
    np_mem[start:stop]=input
    params = (start, size)
    raw=(wasmtime_val_raw_t*len(params))()
    for row_i, param in zip(raw, params): row_i.i32=param
    raw_ptr_casted = ctypes.cast(raw, ctypes.POINTER(wasmtime_val_raw_t))
    with enter_wasm(store) as trap:
            error = ffi.wasmtime_func_call_unchecked(
                store._context,
                ctypes.byref(_cdb_djp_hash_wasm_in._func),
                raw_ptr_casted,
                trap)
            if error:
                raise WasmtimeError._from_ptr(error)
    return ctypes.c_uint32(raw[0].i32).value

and there is still more room from improvement

muayyad-alsadi · 2023-03-21T22:31:38Z

now it's ~66ms instead of ~80ms (compared to old 350ms)

by doing things once outside of the loop (the function won't change it's prototype/signature inside the loop)

def func_init(func, store):
    ty = func.type(store)
    ty_params = ty.params
    ty_results = ty.results
    params_str = (str(i) for i in ty_params)
    params_n = len(ty_params)
    results_n = len(ty_results)
    n = max(params_n, results_n)
    raw_type = wasmtime_val_raw_t*n
    func.raw_type = raw_type
    def _create_raw(params):
        raw = raw_type()
        for i, param_str in enumerate(params_str):
            setattr(raw[i], param_str, params[i])
        return raw
    func._create_raw = _create_raw

func_init(_cdb_djp_hash_wasm_in, store)
#....
    raw = _cdb_djp_hash_wasm_in._create_raw(params)
    raw_ptr_casted = ctypes.cast(raw, ctypes.POINTER(wasmtime_val_raw_t))

muayyad-alsadi · 2023-03-21T22:46:23Z

using struct.Struct gave ~80ms that is no extra benefit

each wasmtime_val_raw_t is 16 bytes (that is each one is 4xi32)
since our function takes two i32 we need a struct of <LLLLLLLL

# outside the loop
st=Struct('<LLLLLLLL')
raw_ptr_type = ctypes.c_uint8*st.size
st_ret=Struct('<L')
# inside the loop
    raw = bytearray(st.pack(start,0,0,0, size, 0,0,0))
    raw_ptr = raw_ptr_type.from_buffer(raw)
    raw_ptr_casted = ctypes.cast(raw_ptr, ctypes.POINTER(wasmtime_val_raw_t))

I'll continue using the previous func_init with wasmtime_val_raw_t

muayyad-alsadi · 2023-03-21T23:00:35Z

when profiler is enabled old method took ~700ms
and new method takes ~110ms
of which the actual wasm took less than 20ms (that is _bindings.py:2485(wasmtime_func_call_unchecked) or _bindings.py:2479(wasmtime_func_call) )

the profile of new method is very clean and does not seem to be possible to optimize it further

here are the report

old:  0.3560021289958968
new:  0.07202652499836404
         180001 function calls in 0.112 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.011    0.000    0.011    0.000 __init__.py:509(cast)
    10000    0.016    0.000    0.016    0.000 _bindings.py:2485(wasmtime_func_call_unchecked)
    20000    0.007    0.000    0.009    0.000 _func.py:256(enter_wasm)
    10000    0.004    0.000    0.004    0.000 cdb_djp_hash.py:72(_create_raw)
    10000    0.039    0.000    0.112    0.000 cdb_djp_hash.py:85(cdb_djp_hash_wasm)
    10000    0.008    0.000    0.010    0.000 contextlib.py:102(__init__)
    10000    0.004    0.000    0.014    0.000 contextlib.py:130(__enter__)
    10000    0.007    0.000    0.011    0.000 contextlib.py:139(__exit__)
    10000    0.004    0.000    0.013    0.000 contextlib.py:279(helper)
    20000    0.002    0.000    0.002    0.000 {built-in method _ctypes.POINTER}
    20000    0.003    0.000    0.003    0.000 {built-in method _ctypes.byref}
    10000    0.002    0.000    0.002    0.000 {built-in method builtins.getattr}
    10000    0.001    0.000    0.001    0.000 {built-in method builtins.len}
    20000    0.005    0.000    0.014    0.000 {built-in method builtins.next}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


         180001 function calls in 0.112 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.039    0.000    0.112    0.000 cdb_djp_hash.py:85(cdb_djp_hash_wasm)
    10000    0.016    0.000    0.016    0.000 _bindings.py:2485(wasmtime_func_call_unchecked)
    20000    0.005    0.000    0.014    0.000 {built-in method builtins.next}
    10000    0.004    0.000    0.014    0.000 contextlib.py:130(__enter__)
    10000    0.004    0.000    0.013    0.000 contextlib.py:279(helper)
    10000    0.011    0.000    0.011    0.000 __init__.py:509(cast)
    10000    0.007    0.000    0.011    0.000 contextlib.py:139(__exit__)
    10000    0.008    0.000    0.010    0.000 contextlib.py:102(__init__)
    20000    0.007    0.000    0.009    0.000 _func.py:256(enter_wasm)
    10000    0.004    0.000    0.004    0.000 cdb_djp_hash.py:72(_create_raw)
    20000    0.003    0.000    0.003    0.000 {built-in method _ctypes.byref}
    20000    0.002    0.000    0.002    0.000 {built-in method _ctypes.POINTER}
    10000    0.002    0.000    0.002    0.000 {built-in method builtins.getattr}
    10000    0.001    0.000    0.001    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


         180001 function calls in 0.112 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20000    0.007    0.000    0.009    0.000 _func.py:256(enter_wasm)
    20000    0.005    0.000    0.014    0.000 {built-in method builtins.next}
    20000    0.002    0.000    0.002    0.000 {built-in method _ctypes.POINTER}
    20000    0.003    0.000    0.003    0.000 {built-in method _ctypes.byref}
    10000    0.016    0.000    0.016    0.000 _bindings.py:2485(wasmtime_func_call_unchecked)
    10000    0.011    0.000    0.011    0.000 __init__.py:509(cast)
    10000    0.008    0.000    0.010    0.000 contextlib.py:102(__init__)
    10000    0.004    0.000    0.014    0.000 contextlib.py:130(__enter__)
    10000    0.007    0.000    0.011    0.000 contextlib.py:139(__exit__)
    10000    0.004    0.000    0.013    0.000 contextlib.py:279(helper)
    10000    0.039    0.000    0.112    0.000 cdb_djp_hash.py:85(cdb_djp_hash_wasm)
    10000    0.004    0.000    0.004    0.000 cdb_djp_hash.py:72(_create_raw)
    10000    0.002    0.000    0.002    0.000 {built-in method builtins.getattr}
    10000    0.001    0.000    0.001    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I just wonder why enter_wasm was called twice per iteration

muayyad-alsadi · 2023-03-23T08:25:26Z

this is a great enhancement from 43 ms to 6 ms in @alexprengere question in #96
yes the pure python implementation beats that at 1ms
although this is a huge improvement we have long way to beat pure python (not to mention C)
but I believe according to "Profile" this is as fast as python binding can go, any further optimization should be in the wasmtime itself not in the binding

0.0002523580042179674 # math_gcd
0.0014094869984546676 # python_gcd
0.043804362998344004  #  wasm_gcd
0.005873051006346941  # wasm_gcd_alt <-------- my change

This comment was marked as outdated.

Sign in to view

muayyad-alsadi mentioned this issue Mar 22, 2023

Performance questions #96

Open

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 25, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

c724ce1

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 25, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

d0dcca2

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 25, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

52a0f1d

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 26, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

da545aa

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 26, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

68fa2a8

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 26, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

f00d32a

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 27, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

7bee159

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 27, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

d0913e8

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Mar 28, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

d383a05

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 2, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

383cad6

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 2, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

dc3d697

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

1446d79

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

9122bb4

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

1ca73e9

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

e6c054b

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

8bc29e6

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 4, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

37b6067

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

2b1d2c0

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

8417fba

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

5bb5c39

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

a67ecc0

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

0388b16

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

4cd48d3

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

9285e6c

muayyad-alsadi added a commit to muayyad-alsadi/wasmtime-py that referenced this issue Apr 5, 2023

FIXES bytecodealliance#137: make calling wasm from python 7x faster

818bb8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize time wasted in passing argument and fetching return #137

optimize time wasted in passing argument and fetching return #137

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

This comment was marked as outdated.

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 23, 2023

optimize time wasted in passing argument and fetching return #137

optimize time wasted in passing argument and fetching return #137

Comments

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

This comment was marked as outdated.

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 21, 2023

muayyad-alsadi commented Mar 23, 2023