Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly #10322

Open
Nekotekina opened this issue Nov 15, 2024 · 4 comments
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@Nekotekina
Copy link
Contributor

What happened?

I have the following code (roughly) executed at some point for prompt processing:
image
Afterwards, llama_decode for token generation becomes significantly slower (roughly 14t/s against 36t/s).
However if this code is replaced by llama_batch_get_one equivalent, performance remains high.
I'm not sure why this happens, maybe I use llama batch incorrectly.

Name and Version

~ 4083 (09ecbcb)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

@Nekotekina Nekotekina added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Nov 15, 2024
@Nekotekina
Copy link
Contributor Author

UPD: Actually I "fixed" it by setting logits argument to false in common_batch_add, but it still seems strange that it has slowdown effect on unrelated llama_decode call.

@slaren
Copy link
Collaborator

slaren commented Nov 16, 2024

That's weird. Enabling logits with all the tokens will cause a reallocation the output buffer, which uses pinned memory if possible. I wonder if that's the reason.

@ggerganov
Copy link
Owner

ggerganov commented Nov 16, 2024

For reference, with the patch in f7b0233 to enable logits for the entire batch, I get the following slow-down on M2 Ultra:

./scripts/compare-commits.sh master gg/logits-slowdown \
    -m ./models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf \
    -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf \
    -p 1,1,2,4,8,16,32,64 -n 0 -r 20
Model Test t/s master t/s gg/logits-slowdown Speedup
llama 3B F16 pp1 70.02 70.49 1.01
llama 3B F16 pp2 61.58 60.82 0.99
llama 3B F16 pp4 122.37 120.92 0.99
llama 3B F16 pp8 243.51 239.42 0.98
llama 3B F16 pp16 480.21 469.60 0.98
llama 3B F16 pp32 941.25 910.40 0.97
llama 3B F16 pp64 1653.07 1531.46 0.93
llama 3B Q4_0 pp1 144.75 146.01 1.01
llama 3B Q4_0 pp2 62.71 59.93 0.96
llama 3B Q4_0 pp4 123.88 117.78 0.95
llama 3B Q4_0 pp8 245.10 232.88 0.95
llama 3B Q4_0 pp16 482.62 456.03 0.94
llama 3B Q4_0 pp32 932.14 870.65 0.93
llama 3B Q4_0 pp64 1560.35 1398.42 0.90

@Nekotekina Can you run this test using your hardware and model and post the numbers?

@Nekotekina
Copy link
Contributor Author

GGML_CUDA=1 ./scripts/compare-commits.sh master gg/logits-slowdown -m ~/Downloads/LLM-Models/aya-expanse-32b-Q4_K_M.gguf -p 1,1,2,4,8,16,32,64,512,1 -n 0 -r 20 -ngl 99
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

Model Test t/s master t/s gg/logits-slowdown Speedup
command-r 35B Q4_K_M pp1 38.43 38.26 1.00
command-r 35B Q4_K_M pp2 68.44 67.46 0.99
command-r 35B Q4_K_M pp4 86.93 84.17 0.97
command-r 35B Q4_K_M pp8 108.62 102.61 0.94
command-r 35B Q4_K_M pp16 363.74 340.73 0.94
command-r 35B Q4_K_M pp32 663.11 602.33 0.91
command-r 35B Q4_K_M pp64 961.14 839.48 0.87
command-r 35B Q4_K_M pp512 1245.76 1025.11 0.82

I'm not sure that the test reproduces the problem though. In my case, slowdown was observed in token generation (measuring llama_decode with batch_one and doing llama_synchronize immediately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
Development

No branches or pull requests

3 participants