Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly #10322

Nekotekina · 2024-11-15T20:41:07Z

What happened?

I have the following code (roughly) executed at some point for prompt processing:

Afterwards, llama_decode for token generation becomes significantly slower (roughly 14t/s against 36t/s).
However if this code is replaced by llama_batch_get_one equivalent, performance remains high.
I'm not sure why this happens, maybe I use llama batch incorrectly.

Name and Version

~ 4083 (09ecbcb)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

Nekotekina · 2024-11-15T20:48:25Z

UPD: Actually I "fixed" it by setting logits argument to false in common_batch_add, but it still seems strange that it has slowdown effect on unrelated llama_decode call.

slaren · 2024-11-16T01:01:53Z

That's weird. Enabling logits with all the tokens will cause a reallocation the output buffer, which uses pinned memory if possible. I wonder if that's the reason.

ggerganov · 2024-11-16T08:25:43Z

For reference, with the patch in f7b0233 to enable logits for the entire batch, I get the following slow-down on M2 Ultra:

./scripts/compare-commits.sh master gg/logits-slowdown \
    -m ./models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf \
    -p 1,1,2,4,8,16,32,64 -n 0 -r 20

Model	Test	t/s master	t/s gg/logits-slowdown	Speedup
llama 3B F16	pp1	70.02	70.49	1.01
llama 3B F16	pp2	61.58	60.82	0.99
llama 3B F16	pp4	122.37	120.92	0.99
llama 3B F16	pp8	243.51	239.42	0.98
llama 3B F16	pp16	480.21	469.60	0.98
llama 3B F16	pp32	941.25	910.40	0.97
llama 3B F16	pp64	1653.07	1531.46	0.93
llama 3B Q4_0	pp1	144.75	146.01	1.01
llama 3B Q4_0	pp2	62.71	59.93	0.96
llama 3B Q4_0	pp4	123.88	117.78	0.95
llama 3B Q4_0	pp8	245.10	232.88	0.95
llama 3B Q4_0	pp16	482.62	456.03	0.94
llama 3B Q4_0	pp32	932.14	870.65	0.93
llama 3B Q4_0	pp64	1560.35	1398.42	0.90

@Nekotekina Can you run this test using your hardware and model and post the numbers?

Nekotekina · 2024-11-16T09:05:33Z

GGML_CUDA=1 ./scripts/compare-commits.sh master gg/logits-slowdown -m ~/Downloads/LLM-Models/aya-expanse-32b-Q4_K_M.gguf -p 1,1,2,4,8,16,32,64,512,1 -n 0 -r 20 -ngl 99
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

Model	Test	t/s master	t/s gg/logits-slowdown	Speedup
command-r 35B Q4_K_M	pp1	38.43	38.26	1.00
command-r 35B Q4_K_M	pp2	68.44	67.46	0.99
command-r 35B Q4_K_M	pp4	86.93	84.17	0.97
command-r 35B Q4_K_M	pp8	108.62	102.61	0.94
command-r 35B Q4_K_M	pp16	363.74	340.73	0.94
command-r 35B Q4_K_M	pp32	663.11	602.33	0.91
command-r 35B Q4_K_M	pp64	961.14	839.48	0.87
command-r 35B Q4_K_M	pp512	1245.76	1025.11	0.82

I'm not sure that the test reproduces the problem though. In my case, slowdown was observed in token generation (measuring llama_decode with batch_one and doing llama_synchronize immediately).

Nekotekina added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly #10322

Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly #10322

Nekotekina commented Nov 15, 2024

Nekotekina commented Nov 15, 2024

slaren commented Nov 16, 2024

ggerganov commented Nov 16, 2024 •

edited

Loading

Nekotekina commented Nov 16, 2024

Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly #10322

Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly #10322

Comments

Nekotekina commented Nov 15, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Nekotekina commented Nov 15, 2024

slaren commented Nov 16, 2024

ggerganov commented Nov 16, 2024 • edited Loading

Nekotekina commented Nov 16, 2024

ggerganov commented Nov 16, 2024 •

edited

Loading