Prerequisites
Current Behavior
llama_decode takes 4x more time to complete for 2 tokens compared to 1 token. Specifically when I feed single token to llama_decode it takes ~12 ms. to decode on average, while for 2 or more tokens llama_decode takes ~50 ms. to complete. Naturally I would expect at most 2x increase in time taken to process twice the amount of tokens but in fact processing 2 tokens takes 4x more time than processing 1 token.
Naively one could assume that llama.cpp CUDA code can be tweaked in such a way so that llama_decode for 2 tokens would complete in at most twice the time it takes to decode 1 token. This would result in the following benefits:
- Up to 2x reduction of prompt eval time for single sequence inference
- Up to 2x decrease in next token prediction time for multi sequence inference
My question
So I was wondering if these are sane considerations and if so, whether someone of the CUDA experts can pull off such an optimization?
Some additional notes
Here are the results of my measurements:
| n_tokens |
llama_decode time, ms |
| 1 |
12 |
| 2 |
50 |
| 4 |
51 |
| 8 |
51 |
| 64 |
56 |
Environment and Context
I am running RTX 4070 under WSL2.
The model is llama 7B quantized using Q4_0
Steps to Reproduce
The code I used to collect stats:
#include <iostream>
#include <iomanip>
#include <vector>
#include <llama.h>
#include <common.h>
void exit_if_false(bool cond, const char* msg) {
if (!cond) {
std::cerr << msg << std::endl;
exit(1);
}
}
const int BATCH_SIZE = 2;
const bool GPU = true;
int main(int argc, char* argv[]) {
std::cout << "Testing on " << (GPU ? "GPU" : "CPU") << '\n';
llama_model_params model_params = llama_model_default_params();
{
model_params.n_gpu_layers = GPU ? 1000 : 0;
}
llama_context_params context_params = llama_context_default_params();
{
context_params.n_ctx = 1024;
context_params.n_batch = BATCH_SIZE;
context_params.n_threads = GPU ? 1 : 10;
}
llama_model* model = llama_load_model_from_file(argv[1], model_params);
exit_if_false(model, "Can not load model");
llama_context* ctx = llama_new_context_with_model(model, context_params);
exit_if_false(ctx, "Can not create context");
std::string prompt = "In another moment down went Alice after it, never once considering how in the world she was to get out again.";
std::vector<llama_token> tokens = llama_tokenize(ctx, prompt, true, false);
std::cout << "Processing " << tokens.size() << " tokens\n";
llama_batch batch = llama_batch_init(BATCH_SIZE, 0, 1);
double total_dt_ms = 0;
int num_calls = 0;
for (size_t start = 0; start < tokens.size(); start += BATCH_SIZE) {
size_t end = std::min(start + BATCH_SIZE, tokens.size());
llama_batch_clear(batch);
for (size_t i = start; i < end; ++i) {
llama_batch_add(batch, tokens[i], i, {0}, false);
}
double tstart = ggml_time_us();
llama_decode(ctx, batch);
double tend = ggml_time_us();
double dt_ms = (tend - tstart) / 1000;
std::cout << "llama_decode: " << std::setw(7) << std::fixed << std::setprecision(3) << dt_ms
<< " ms. for " << std::setw(3) << batch.n_tokens << " token(s)\n";
total_dt_ms += dt_ms;
num_calls += 1;
}
llama_batch_free(batch);
std::cout << "Average:\n"
<< (total_dt_ms / num_calls) << " ms. per call\n"
<< (total_dt_ms / tokens.size()) << " ms. per token\n";
return 0;
}
Prerequisites
Current Behavior
llama_decodetakes 4x more time to complete for 2 tokens compared to 1 token. Specifically when I feed single token tollama_decodeit takes ~12 ms. to decode on average, while for 2 or more tokensllama_decodetakes ~50 ms. to complete. Naturally I would expect at most 2x increase in time taken to process twice the amount of tokens but in fact processing 2 tokens takes 4x more time than processing 1 token.Naively one could assume that
llama.cppCUDAcode can be tweaked in such a way so thatllama_decodefor 2 tokens would complete in at most twice the time it takes to decode 1 token. This would result in the following benefits:My question
So I was wondering if these are sane considerations and if so, whether someone of the
CUDAexperts can pull off such an optimization?Some additional notes
Here are the results of my measurements:
GPUspecific and does not affect theCPUEnvironment and Context
I am running
RTX 4070underWSL2.The model is
llama 7Bquantized usingQ4_0Steps to Reproduce
The code I used to collect stats: