Improving Qwen3 Coder Next 80b performance on ik_llama vs llama.cpp
Hi, as discussed on Reddit,
I tried many configurations to get Qwen3-Coder-Next running faster on ik_llama than on llama.cpp, but without success.
This is the fastest command I’ve achieved so far with llama.cpp on my hardware configuration. I think on q8 I was getting even faster prefills.
Configuration for Qwen3-Coder-Next on RTX 3090 + 5950X 64GB DDR4
Peak Performance: ~42 t/s Gen | ~1230 t/s Prompt
~/llama.cpp/build/bin/llama-server
-m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf
--host 0.0.0.0 --port 5000
--ctx-size 64000
--parallel 1
-ngl 99
--n-cpu-moe 28
-t 16
-b 4096
-ub 2048
--flash-attn on
--mlock
--no-mmap
--jinja
--cache-type-k q4_0
--cache-type-v q4_0
--temp 0.8
--min-p 0.05
--presence-penalty 1.1
--dry-multiplier 0.5
--dry-base 1.75
--dry-allowed-length 2
--dry-penalty-last-n 4096 \
I also have a simpler configuration for Qwen 3.5 27B. As it fits fully on the GPU would there be any improvement with ik_llama at all?
This is the command I am using:
MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"
4. Launch Server
~/llama.cpp/build/bin/llama-server
-m "$MODEL_PATH"
--mmproj "$VISION_PATH"
--host 0.0.0.0 --port 5000
--ctx-size 64000
--parallel 1
--fit on
--flash-attn on
--no-mmap
--jinja
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.00
--presence-penalty 0.0
--repeat-penalty 1.0
Heya super, thanks for dropping in, let's take a look, there are a few questions in here that I see:
- Running Qwen3-Coder-Next ~3ish BPW quant hybrid CPU+GPU mainline vs ik
I'm downloading my Q4_0 44.355 GiB (4.782 BPW) to run tests locally on my similar rig, AMD 9950X and single 3090TI FE. I'll get back to you with some commands and llama-sweep-bench results between ik and mainline.
I think on q8 I was getting even faster prefills
Yes, Q8_0 despite being larger is often faster for PP. This is because PP is typically compute bound wheras TG is typically memory bandwidth bound. A such Q8_0 being one of the first legacy quantization types has a very fast kernel computation-wise leading to faster PP peformance (but will give slower TG as it is bigger). Tradeoffs.
Qwen 3.5 27B. As it fits fully on the GPU would there be any improvement with ik_llama at all?
EDIT Ooops you are talking about the 27B dense, however I ran the 35B MoE below. You can use these same examples though to try it yourself now.
I'll run a llama-sweep-bench on my rig as well for this using my Q4_0 19.776 GiB (4.901 BPW) testing full offload.
I know if you have 2x GPUs using ik_llama.cpp would benefit due to -sm graph but I'm not sure on single GPU and we will find out!
Okay on my 3090 and Qwen3.5-35B-A3B Q4_0 custom quant it looks like ik_llama.cpp is faster for full offload on single GPU situation:
👈 Details
ik_llama.cpp main@277fc1d2
model=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 1024 -b 2048 \
--merge-qkv \
-ngl 99 \
--threads 1 \
--warmup-batch \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 1024 | 128 | 0 | 0.230 | 4456.53 | 0.951 | 134.55 |
| 1024 | 128 | 1024 | 0.233 | 4397.70 | 0.939 | 136.35 |
| 1024 | 128 | 2048 | 0.235 | 4361.99 | 0.944 | 135.57 |
| 1024 | 128 | 3072 | 0.237 | 4314.59 | 0.949 | 134.92 |
| 1024 | 128 | 4096 | 0.239 | 4283.01 | 0.952 | 134.45 |
| 1024 | 128 | 5120 | 0.243 | 4221.25 | 0.959 | 133.40 |
| 1024 | 128 | 6144 | 0.247 | 4147.44 | 0.965 | 132.65 |
| 1024 | 128 | 7168 | 0.249 | 4116.70 | 0.974 | 131.41 |
| 1024 | 128 | 8192 | 0.251 | 4084.81 | 0.981 | 130.47 |
| 1024 | 128 | 9216 | 0.254 | 4027.31 | 0.993 | 128.89 |
| 1024 | 128 | 10240 | 0.256 | 3999.14 | 1.002 | 127.69 |
| 1024 | 128 | 11264 | 0.258 | 3967.02 | 1.025 | 124.92 |
| 1024 | 128 | 12288 | 0.261 | 3917.65 | 1.032 | 124.01 |
| 1024 | 128 | 13312 | 0.263 | 3891.82 | 1.036 | 123.54 |
| 1024 | 128 | 14336 | 0.268 | 3825.69 | 1.040 | 123.10 |
| 1024 | 128 | 15360 | 0.270 | 3791.81 | 1.045 | 122.43 |
| 1024 | 128 | 16384 | 0.273 | 3750.57 | 1.051 | 121.80 |
| 1024 | 128 | 17408 | 0.274 | 3736.53 | 1.061 | 120.69 |
| 1024 | 128 | 18432 | 0.276 | 3704.71 | 1.067 | 120.00 |
| 1024 | 128 | 19456 | 0.280 | 3659.21 | 1.073 | 119.24 |
| 1024 | 128 | 20480 | 0.282 | 3634.29 | 1.080 | 118.47 |
| 1024 | 128 | 21504 | 0.283 | 3616.62 | 1.098 | 116.53 |
| 1024 | 128 | 22528 | 0.287 | 3573.82 | 1.107 | 115.61 |
| 1024 | 128 | 23552 | 0.290 | 3537.00 | 1.113 | 115.00 |
| 1024 | 128 | 24576 | 0.292 | 3503.15 | 1.119 | 114.41 |
| 1024 | 128 | 25600 | 0.295 | 3469.56 | 1.125 | 113.82 |
| 1024 | 128 | 26624 | 0.297 | 3450.91 | 1.131 | 113.18 |
| 1024 | 128 | 27648 | 0.301 | 3406.72 | 1.135 | 112.79 |
| 1024 | 128 | 28672 | 0.303 | 3384.64 | 1.142 | 112.08 |
| 1024 | 128 | 29696 | 0.305 | 3358.41 | 1.147 | 111.58 |
| 1024 | 128 | 30720 | 0.309 | 3312.55 | 1.154 | 110.93 |
| 1024 | 128 | 31744 | 0.310 | 3300.11 | 1.162 | 110.17 |
| 1024 | 128 | 32768 | 0.314 | 3264.49 | 1.183 | 108.24 |
| 1024 | 128 | 33792 | 0.316 | 3243.48 | 1.187 | 107.80 |
| 1024 | 128 | 34816 | 0.318 | 3219.86 | 1.191 | 107.44 |
| 1024 | 128 | 35840 | 0.322 | 3184.30 | 1.198 | 106.87 |
| 1024 | 128 | 36864 | 0.325 | 3152.46 | 1.202 | 106.47 |
| 1024 | 128 | 37888 | 0.326 | 3143.22 | 1.209 | 105.89 |
| 1024 | 128 | 38912 | 0.330 | 3107.39 | 1.214 | 105.47 |
| 1024 | 128 | 39936 | 0.333 | 3078.47 | 1.221 | 104.83 |
| 1024 | 128 | 40960 | 0.333 | 3074.35 | 1.224 | 104.59 |
| 1024 | 128 | 41984 | 0.337 | 3042.41 | 1.232 | 103.92 |
| 1024 | 128 | 43008 | 0.340 | 3013.83 | 1.247 | 102.68 |
| 1024 | 128 | 44032 | 0.341 | 3004.07 | 1.260 | 101.58 |
| 1024 | 128 | 45056 | 0.345 | 2968.48 | 1.263 | 101.31 |
| 1024 | 128 | 46080 | 0.346 | 2956.39 | 1.269 | 100.88 |
| 1024 | 128 | 47104 | 0.349 | 2933.70 | 1.275 | 100.42 |
| 1024 | 128 | 48128 | 0.353 | 2904.14 | 1.280 | 100.00 |
| 1024 | 128 | 49152 | 0.354 | 2894.14 | 1.283 | 99.80 |
| 1024 | 128 | 50176 | 0.356 | 2872.86 | 1.291 | 99.13 |
| 1024 | 128 | 51200 | 0.358 | 2859.93 | 1.297 | 98.66 |
| 1024 | 128 | 52224 | 0.363 | 2821.41 | 1.305 | 98.11 |
| 1024 | 128 | 53248 | 0.363 | 2817.62 | 1.309 | 97.75 |
| 1024 | 128 | 54272 | 0.367 | 2786.95 | 1.331 | 96.18 |
| 1024 | 128 | 55296 | 0.370 | 2767.88 | 1.337 | 95.76 |
| 1024 | 128 | 56320 | 0.372 | 2753.92 | 1.342 | 95.40 |
| 1024 | 128 | 57344 | 0.377 | 2719.77 | 1.347 | 95.05 |
| 1024 | 128 | 58368 | 0.376 | 2720.04 | 1.353 | 94.61 |
| 1024 | 128 | 59392 | 0.381 | 2688.69 | 1.358 | 94.29 |
| 1024 | 128 | 60416 | 0.383 | 2671.91 | 1.362 | 93.97 |
| 1024 | 128 | 61440 | 0.385 | 2662.26 | 1.369 | 93.52 |
| 1024 | 128 | 62464 | 0.391 | 2621.37 | 1.374 | 93.15 |
| 1024 | 128 | 63488 | 0.393 | 2605.50 | 1.381 | 92.65 |
| 1024 | 128 | 64512 | 0.394 | 2601.22 | 1.399 | 91.48 |
| 1024 | 128 | 65536 | 0.395 | 2595.04 | 1.407 | 90.94 |
| 1024 | 128 | 66560 | 0.398 | 2575.22 | 1.413 | 90.59 |
| 1024 | 128 | 67584 | 0.399 | 2566.49 | 1.420 | 90.13 |
| 1024 | 128 | 68608 | 0.404 | 2537.74 | 1.425 | 89.83 |
mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench
model=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 1024 -b 2048 \
-ngl 99 \
--threads 1 \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 1024 | 128 | 0 | 0.271 | 3782.91 | 1.044 | 122.57 |
| 1024 | 128 | 1024 | 0.275 | 3727.53 | 1.047 | 122.31 |
| 1024 | 128 | 2048 | 0.278 | 3689.74 | 1.053 | 121.59 |
| 1024 | 128 | 3072 | 0.281 | 3648.56 | 1.063 | 120.44 |
| 1024 | 128 | 4096 | 0.281 | 3639.17 | 1.070 | 119.64 |
| 1024 | 128 | 5120 | 0.286 | 3579.62 | 1.077 | 118.88 |
| 1024 | 128 | 6144 | 0.290 | 3525.14 | 1.087 | 117.76 |
| 1024 | 128 | 7168 | 0.294 | 3485.97 | 1.097 | 116.65 |
| 1024 | 128 | 8192 | 0.296 | 3464.54 | 1.107 | 115.59 |
| 1024 | 128 | 9216 | 0.300 | 3417.43 | 1.118 | 114.53 |
| 1024 | 128 | 10240 | 0.302 | 3391.42 | 1.123 | 113.97 |
| 1024 | 128 | 11264 | 0.304 | 3366.24 | 1.134 | 112.87 |
| 1024 | 128 | 12288 | 0.306 | 3343.82 | 1.144 | 111.90 |
| 1024 | 128 | 13312 | 0.310 | 3306.75 | 1.155 | 110.82 |
| 1024 | 128 | 14336 | 0.315 | 3251.99 | 1.162 | 110.11 |
| 1024 | 128 | 15360 | 0.317 | 3231.79 | 1.171 | 109.35 |
| 1024 | 128 | 16384 | 0.320 | 3202.35 | 1.174 | 108.99 |
| 1024 | 128 | 17408 | 0.322 | 3183.65 | 1.181 | 108.37 |
| 1024 | 128 | 18432 | 0.325 | 3152.32 | 1.190 | 107.57 |
| 1024 | 128 | 19456 | 0.329 | 3110.96 | 1.198 | 106.84 |
| 1024 | 128 | 20480 | 0.331 | 3090.42 | 1.203 | 106.39 |
| 1024 | 128 | 21504 | 0.333 | 3076.22 | 1.212 | 105.64 |
| 1024 | 128 | 22528 | 0.337 | 3040.19 | 1.221 | 104.81 |
| 1024 | 128 | 23552 | 0.339 | 3017.76 | 1.227 | 104.35 |
| 1024 | 128 | 24576 | 0.343 | 2983.70 | 1.234 | 103.74 |
| 1024 | 128 | 25600 | 0.346 | 2955.51 | 1.241 | 103.12 |
| 1024 | 128 | 26624 | 0.349 | 2935.42 | 1.247 | 102.67 |
| 1024 | 128 | 27648 | 0.352 | 2906.81 | 1.257 | 101.80 |
| 1024 | 128 | 28672 | 0.355 | 2884.84 | 1.265 | 101.20 |
| 1024 | 128 | 29696 | 0.358 | 2862.09 | 1.272 | 100.60 |
| 1024 | 128 | 30720 | 0.361 | 2833.43 | 1.278 | 100.19 |
| 1024 | 128 | 31744 | 0.364 | 2814.79 | 1.287 | 99.47 |
| 1024 | 128 | 32768 | 0.368 | 2785.09 | 1.294 | 98.88 |
| 1024 | 128 | 33792 | 0.370 | 2765.25 | 1.303 | 98.21 |
| 1024 | 128 | 34816 | 0.372 | 2749.23 | 1.311 | 97.67 |
| 1024 | 128 | 35840 | 0.376 | 2723.60 | 1.318 | 97.10 |
| 1024 | 128 | 36864 | 0.380 | 2694.01 | 1.325 | 96.60 |
| 1024 | 128 | 37888 | 0.383 | 2676.10 | 1.334 | 95.98 |
| 1024 | 128 | 38912 | 0.385 | 2657.59 | 1.341 | 95.48 |
| 1024 | 128 | 39936 | 0.389 | 2633.21 | 1.349 | 94.87 |
| 1024 | 128 | 40960 | 0.392 | 2615.03 | 1.356 | 94.38 |
| 1024 | 128 | 41984 | 0.395 | 2590.21 | 1.364 | 93.83 |
| 1024 | 128 | 43008 | 0.398 | 2575.27 | 1.371 | 93.37 |
| 1024 | 128 | 44032 | 0.400 | 2561.88 | 1.381 | 92.71 |
| 1024 | 128 | 45056 | 0.402 | 2544.60 | 1.387 | 92.29 |
| 1024 | 128 | 46080 | 0.407 | 2518.46 | 1.395 | 91.73 |
| 1024 | 128 | 47104 | 0.411 | 2494.47 | 1.403 | 91.26 |
| 1024 | 128 | 48128 | 0.413 | 2478.29 | 1.410 | 90.76 |
| 1024 | 128 | 49152 | 0.416 | 2459.37 | 1.418 | 90.29 |
| 1024 | 128 | 50176 | 0.419 | 2444.44 | 1.427 | 89.67 |
| 1024 | 128 | 51200 | 0.420 | 2440.13 | 1.433 | 89.31 |
| 1024 | 128 | 52224 | 0.425 | 2410.67 | 1.441 | 88.83 |
| 1024 | 128 | 53248 | 0.426 | 2401.83 | 1.449 | 88.34 |
| 1024 | 128 | 54272 | 0.432 | 2372.34 | 1.456 | 87.91 |
| 1024 | 128 | 55296 | 0.434 | 2360.03 | 1.463 | 87.52 |
| 1024 | 128 | 56320 | 0.437 | 2343.49 | 1.471 | 87.03 |
| 1024 | 128 | 57344 | 0.440 | 2329.36 | 1.478 | 86.60 |
| 1024 | 128 | 58368 | 0.442 | 2314.62 | 1.488 | 86.03 |
| 1024 | 128 | 59392 | 0.445 | 2299.02 | 1.494 | 85.67 |
| 1024 | 128 | 60416 | 0.449 | 2281.04 | 1.504 | 85.09 |
These are the instructions for compiling:
ik_llama.cpp
# update
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
git pull
# compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_CCACHE=OFF
cmake --build build --config Release -j $(nproc)
# confirm
./build/bin/llama-server --version
version: 4264 (277fc1d2)
built with cc (GCC) 15.2.1 20251112 for x86_64-pc-linux-gnu
# benchmark
# use commands found in details
mainline llama.cpp
# this includes the llama-sweep-bench patch for mainline
# update
git clone --depth 1 --branch ug/port-sweep-bench https://github.com/ubergarm/llama.cpp.git
cd llama.cpp
# compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_CCACHE=OFF
cmake --build build --config Release -j $(nproc)
# confirm
git log --pretty=format:"%h%x09%an%x09%ad%x09%s" | head -n 9
$ ./build/bin/llama-server --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 8233 (8eb8c6952)
built with GNU 15.2.1 for Linux x86_64
Okay I'll do the coder test now that it is finished downloading.
Here is the Qwen3-Coder-Next hybrid CPU+GPU results and commands in details below. Compiled same as above. ik is winning over mainline by a small but noticeable margin. Once again mainline borks out at the end either due to larger cuda buffer or might be something with the sweep-bench port.
So given my rig has avx512_vnni I did expect to see better PP for ik over mainline which is true. Not sure on Zen4 rig without but likely marginally better PP. mainline did pretty well here for TG and I'm not 100% sure the arch differences in Qwen3-Coder-Next vs the newer qwen35moe models, but I will do one more test after this for Qwen3.5-35B-A3B CPU-only.
👈 Details
ik_llama.cpp main@277fc1d2
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 4096 -b 4096 \
--merge-qkv \
-ngl 99 \
--n-cpu-moe 30 \
--threads 16 \
--warmup-batch \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 2.289 | 1789.36 | 2.000 | 63.99 |
| 4096 | 128 | 4096 | 2.321 | 1764.40 | 2.037 | 62.85 |
| 4096 | 128 | 8192 | 2.361 | 1735.09 | 2.076 | 61.65 |
| 4096 | 128 | 12288 | 2.394 | 1711.00 | 2.135 | 59.96 |
| 4096 | 128 | 16384 | 2.430 | 1685.57 | 2.158 | 59.31 |
| 4096 | 128 | 20480 | 2.472 | 1656.92 | 2.187 | 58.53 |
| 4096 | 128 | 24576 | 2.508 | 1633.27 | 2.227 | 57.46 |
| 4096 | 128 | 28672 | 2.552 | 1604.83 | 2.269 | 56.41 |
| 4096 | 128 | 32768 | 2.586 | 1583.93 | 2.307 | 55.47 |
| 4096 | 128 | 36864 | 2.635 | 1554.49 | 2.337 | 54.78 |
| 4096 | 128 | 40960 | 2.681 | 1528.06 | 2.368 | 54.07 |
| 4096 | 128 | 45056 | 2.716 | 1508.21 | 2.413 | 53.05 |
| 4096 | 128 | 49152 | 2.749 | 1489.91 | 2.433 | 52.61 |
| 4096 | 128 | 53248 | 2.792 | 1467.27 | 2.479 | 51.63 |
| 4096 | 128 | 57344 | 2.842 | 1441.40 | 2.515 | 50.90 |
| 4096 | 128 | 61440 | 2.884 | 1420.47 | 2.540 | 50.40 |
| 4096 | 128 | 65536 | 2.932 | 1397.11 | 2.590 | 49.43 |
mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench
model=/mnt/astrodata/llm/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 4096 -b 4096 \
-ngl 99 \
--n-cpu-moe 30 \
--threads 16 \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 2.567 | 1595.93 | 2.212 | 57.86 |
| 4096 | 128 | 4096 | 2.602 | 1573.95 | 2.232 | 57.35 |
| 4096 | 128 | 8192 | 2.645 | 1548.56 | 2.264 | 56.53 |
| 4096 | 128 | 12288 | 2.686 | 1525.13 | 2.312 | 55.37 |
| 4096 | 128 | 16384 | 2.726 | 1502.79 | 2.346 | 54.56 |
| 4096 | 128 | 20480 | 2.768 | 1479.84 | 2.407 | 53.18 |
| 4096 | 128 | 24576 | 2.807 | 1459.18 | 2.434 | 52.58 |
| 4096 | 128 | 28672 | 2.860 | 1432.29 | 2.469 | 51.84 |
| 4096 | 128 | 32768 | 2.898 | 1413.55 | 2.505 | 51.11 |
| 4096 | 128 | 36864 | 2.945 | 1391.06 | 2.530 | 50.59 |
| 4096 | 128 | 40960 | 2.987 | 1371.13 | 2.564 | 49.93 |
| 4096 | 128 | 45056 | 3.029 | 1352.17 | 2.595 | 49.33 |
| 4096 | 128 | 49152 | 3.066 | 1336.07 | 2.626 | 48.75 |
| 4096 | 128 | 53248 | 3.109 | 1317.44 | 2.664 | 48.05 |
| 4096 | 128 | 57344 | 3.160 | 1296.19 | 2.704 | 47.34 |
| 4096 | 128 | 61440 | 3.199 | 1280.39 | 2.721 | 47.04 |
Okay last one, CPU-only for Qwen3.5-35B-A3B showing ik is doing a lot better with the gated delta net CPU implementation. Compiled same as above with with CUDA off so it is CPU only backend.
👈 Details
ik_llama.cpp main@277fc1d2
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 1024 -b 2048 \
--merge-qkv \
--threads 16 \
--warmup-batch \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 1024 | 128 | 0 | 1.296 | 790.34 | 5.271 | 24.28 |
| 1024 | 128 | 1024 | 1.358 | 754.32 | 5.141 | 24.90 |
| 1024 | 128 | 2048 | 1.401 | 730.66 | 5.149 | 24.86 |
| 1024 | 128 | 3072 | 1.450 | 706.21 | 5.194 | 24.65 |
| 1024 | 128 | 4096 | 1.493 | 685.70 | 5.221 | 24.52 |
| 1024 | 128 | 5120 | 1.530 | 669.12 | 5.240 | 24.43 |
| 1024 | 128 | 6144 | 1.575 | 650.02 | 5.253 | 24.37 |
| 1024 | 128 | 7168 | 1.599 | 640.30 | 5.287 | 24.21 |
| 1024 | 128 | 8192 | 1.643 | 623.10 | 5.281 | 24.24 |
| 1024 | 128 | 9216 | 1.681 | 609.02 | 5.302 | 24.14 |
| 1024 | 128 | 10240 | 1.792 | 571.54 | 5.332 | 24.01 |
| 1024 | 128 | 11264 | 1.763 | 580.93 | 5.335 | 23.99 |
| 1024 | 128 | 12288 | 1.806 | 566.90 | 5.369 | 23.84 |
| 1024 | 128 | 13312 | 1.847 | 554.44 | 5.397 | 23.72 |
| 1024 | 128 | 14336 | 1.885 | 543.22 | 5.402 | 23.69 |
| 1024 | 128 | 15360 | 1.929 | 530.94 | 5.431 | 23.57 |
| 1024 | 128 | 16384 | 1.980 | 517.04 | 5.440 | 23.53 |
| 1024 | 128 | 17408 | 2.062 | 496.52 | 5.496 | 23.29 |
| 1024 | 128 | 18432 | 2.060 | 497.02 | 5.511 | 23.22 |
| 1024 | 128 | 19456 | 2.087 | 490.75 | 5.568 | 22.99 |
| 1024 | 128 | 20480 | 2.141 | 478.34 | 5.645 | 22.67 |
| 1024 | 128 | 21504 | 2.160 | 474.08 | 5.627 | 22.75 |
| 1024 | 128 | 22528 | 2.224 | 460.40 | 5.634 | 22.72 |
| 1024 | 128 | 23552 | 2.258 | 453.41 | 5.689 | 22.50 |
| 1024 | 128 | 24576 | 2.422 | 422.76 | 5.692 | 22.49 |
| 1024 | 128 | 25600 | 2.327 | 440.10 | 5.683 | 22.52 |
| 1024 | 128 | 26624 | 2.367 | 432.61 | 5.762 | 22.22 |
| 1024 | 128 | 27648 | 2.410 | 424.86 | 5.788 | 22.12 |
| 1024 | 128 | 28672 | 2.444 | 419.04 | 5.817 | 22.00 |
| 1024 | 128 | 29696 | 2.501 | 409.44 | 5.843 | 21.91 |
| 1024 | 128 | 30720 | 2.581 | 396.74 | 5.853 | 21.87 |
| 1024 | 128 | 31744 | 2.610 | 392.41 | 5.853 | 21.87 |
| 1024 | 128 | 32768 | 2.669 | 383.70 | 5.842 | 21.91 |
mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 1024 -b 2048 \
--threads 16 \
-n 128
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 1024 | 128 | 0 | 6.153 | 166.41 | 8.284 | 15.45 |
| 1024 | 128 | 1024 | 7.336 | 139.59 | 8.140 | 15.72 |
| 1024 | 128 | 2048 | 7.958 | 128.68 | 8.312 | 15.40 |
| 1024 | 128 | 3072 | 9.048 | 113.17 | 8.603 | 14.88 |
| 1024 | 128 | 4096 | 10.045 | 101.94 | 8.353 | 15.32 |
| 1024 | 128 | 5120 | 10.969 | 93.35 | 8.858 | 14.45 |
| 1024 | 128 | 6144 | 11.710 | 87.45 | 8.739 | 14.65 |
| 1024 | 128 | 7168 | 12.552 | 81.58 | 8.697 | 14.72 |
| 1024 | 128 | 8192 | 13.479 | 75.97 | 9.226 | 13.87 |
| 1024 | 128 | 9216 | 14.620 | 70.04 | 9.239 | 13.85 |
| 1024 | 128 | 10240 | 15.000 | 68.27 | 9.135 | 14.01 |
| 1024 | 128 | 11264 | 16.088 | 63.65 | 9.467 | 13.52 |
| 1024 | 128 | 12288 | 16.675 | 61.41 | 9.158 | 13.98 |
| 1024 | 128 | 13312 | 17.236 | 59.41 | 9.696 | 13.20 |
| 1024 | 128 | 14336 | 18.576 | 55.12 | 9.524 | 13.44 |
| 1024 | 128 | 15360 | 19.520 | 52.46 | 9.855 | 12.99 |
| 1024 | 128 | 16384 | 19.817 | 51.67 | 9.231 | 13.87 |
| 1024 | 128 | 17408 | 19.869 | 51.54 | 9.579 | 13.36 |
| 1024 | 128 | 18432 | 21.962 | 46.63 | 10.553 | 12.13 |
| 1024 | 128 | 19456 | 22.715 | 45.08 | 10.453 | 12.25 |
| 1024 | 128 | 20480 | 23.965 | 42.73 | 10.579 | 12.10 |
| 1024 | 128 | 21504 | 24.021 | 42.63 | 9.781 | 13.09 |
| 1024 | 128 | 22528 | 24.344 | 42.06 | 10.275 | 12.46 |
| 1024 | 128 | 23552 | 24.429 | 41.92 | 10.475 | 12.22 |
| 1024 | 128 | 24576 | 27.235 | 37.60 | 10.987 | 11.65 |
| 1024 | 128 | 25600 | 26.745 | 38.29 | 10.248 | 12.49 |
| 1024 | 128 | 26624 | 27.459 | 37.29 | 10.219 | 12.53 |
| 1024 | 128 | 27648 | 29.335 | 34.91 | 10.275 | 12.46 |
| 1024 | 128 | 28672 | 31.157 | 32.87 | 10.270 | 12.46 |
| 1024 | 128 | 29696 | 32.972 | 31.06 | 12.251 | 10.45 |
| 1024 | 128 | 30720 | 33.706 | 30.38 | 10.442 | 12.26 |
| 1024 | 128 | 31744 | 35.189 | 29.10 | 12.012 | 10.66 |
| 1024 | 128 | 32768 | 34.935 | 29.31 | 12.053 | 10.62 |
Okay it seems like mainline llama.cpp has PR improving the CPU chunked delta net implementation getting close to merged, I did a 3-way benchmark there if you're interested: https://github.com/ggml-org/llama.cpp/pull/19504#issuecomment-4013706238
Thank you for the detailed benchmarks as I don't trust the synthetic benches I am using a simple haystack stress test: https://gist.github.com/sabotage3d/3f9f1fc544495e1c2c4ec1ee3eadebf5
compiled main branch of ik_llama.cpp output using the command below:
MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"
# 4. Launch Server
~/ik_llama.cpp/build/bin/llama-server \
-m "$MODEL_PATH" \
--mmproj "$VISION_PATH" \
--host 0.0.0.0 --port 5000 \
--ctx-size 64000 \
--parallel 1 \
-ub 1024 -b 2048 \
--merge-qkv \
-ngl 99 \
--threads 16 \
--warmup-batch \
-n 128
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--presence-penalty 0.0 \
--repeat-penalty 1.0
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="140690707640320" timestamp=1772834996 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834996 id_slot=0 id_task=0 p0=0
slot create_check: id 0 | task 0 | created context checkpoint 1 of 8 (pos_min = 2047, pos_max = 2047, size = 149.643 MiB, took 1646.78 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834998 id_slot=0 id_task=0 p0=2048
slot create_check: id 0 | task 0 | created context checkpoint 2 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 1555.92 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834999 id_slot=0 id_task=0 p0=4096
slot create_check: id 0 | task 0 | created context checkpoint 3 of 8 (pos_min = 6143, pos_max = 6143, size = 149.674 MiB, took 1585.55 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835001 id_slot=0 id_task=0 p0=6144
slot create_check: id 0 | task 0 | created context checkpoint 4 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 633.89 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835003 id_slot=0 id_task=0 p0=8192
slot create_check: id 0 | task 0 | created context checkpoint 5 of 8 (pos_min = 10239, pos_max = 10239, size = 149.706 MiB, took 683.44 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835004 id_slot=0 id_task=0 p0=10240
slot create_check: id 0 | task 0 | created context checkpoint 6 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 688.13 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835006 id_slot=0 id_task=0 p0=12288
slot create_check: id 0 | task 0 | created context checkpoint 7 of 8 (pos_min = 14335, pos_max = 14335, size = 149.737 MiB, took 667.95 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835008 id_slot=0 id_task=0 p0=14336
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 16383, pos_max = 16383, size = 149.752 MiB, took 693.84 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835010 id_slot=0 id_task=0 p0=16384
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 2047, pos_max = 2047, size = 149.643 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 18431, pos_max = 18431, size = 149.768 MiB, took 661.93 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835012 id_slot=0 id_task=0 p0=18432
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 20479, pos_max = 20479, size = 149.784 MiB, took 660.57 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835013 id_slot=0 id_task=0 p0=20480
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 6143, pos_max = 6143, size = 149.674 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 22527, pos_max = 22527, size = 149.799 MiB, took 671.89 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835015 id_slot=0 id_task=0 p0=22528
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 8191, pos_max = 8191, size = 149.690 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 24575, pos_max = 24575, size = 149.815 MiB, took 682.74 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835017 id_slot=0 id_task=0 p0=24576
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 10239, pos_max = 10239, size = 149.706 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 26623, pos_max = 26623, size = 149.831 MiB, took 696.47 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835019 id_slot=0 id_task=0 p0=26624
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 12287, pos_max = 12287, size = 149.721 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 28671, pos_max = 28671, size = 149.846 MiB, took 700.38 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835021 id_slot=0 id_task=0 p0=28672
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 14335, pos_max = 14335, size = 149.737 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 30719, pos_max = 30719, size = 149.862 MiB, took 786.73 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835023 id_slot=0 id_task=0 p0=30720
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 16383, pos_max = 16383, size = 149.752 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 32767, pos_max = 32767, size = 149.877 MiB, took 722.32 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835025 id_slot=0 id_task=0 p0=32768
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 18431, pos_max = 18431, size = 149.768 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 34815, pos_max = 34815, size = 149.893 MiB, took 737.62 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835027 id_slot=0 id_task=0 p0=34816
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 20479, pos_max = 20479, size = 149.784 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 36863, pos_max = 36863, size = 149.909 MiB, took 738.45 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835029 id_slot=0 id_task=0 p0=36864
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 22527, pos_max = 22527, size = 149.799 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 38911, pos_max = 38911, size = 149.924 MiB, took 747.90 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835031 id_slot=0 id_task=0 p0=38912
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 24575, pos_max = 24575, size = 149.815 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40436, pos_max = 40436, size = 149.936 MiB, took 1520.62 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835032 id_slot=0 id_task=0 p0=40437
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 26623, pos_max = 26623, size = 149.831 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40442, pos_max = 40442, size = 149.936 MiB, took 37.77 ms)
slot print_timing: id 0 | task 0 |
prompt eval time = 36534.43 ms / 40442 tokens ( 0.90 ms per token, 1106.96 tokens per second)
eval time = 5436.31 ms / 50 tokens ( 108.73 ms per token, 9.20 tokens per second)
total time = 41970.74 ms / 40492 tokens
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 28671, pos_max = 28671, size = 149.846 MiB)
INFO [ log_server_request] request | tid="140680000237568" timestamp=1772835038 remote_addr="127.0.0.1" remote_port=60706 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40490, pos_max = 40490, size = 149.936 MiB, took 29.89 ms)
INFO [ release_slots] slot released | tid="140690707640320" timestamp=1772835038 id_slot=0 id_task=0 n_ctx=64000 n_past=40491 n_system_tokens=0 n_cache_tokens=40491 truncated=false
INFO [ slots_idle] all slots are idle | tid="140690707640320" timestamp=1772835038
This is the same stress test on llama.cpp main branch command and log.
MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"
# 4. Launch Server
~/llama.cpp/build/bin/llama-server \
-m "$MODEL_PATH" \
--mmproj "$VISION_PATH" \
--host 0.0.0.0 --port 5000 \
--ctx-size 64000 \
--parallel 1 \
--fit on \
--flash-attn on \
--no-mmap \
--jinja \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--presence-penalty 0.0 \
--repeat-penalty 1.0
main: model loaded
main: server is listening on http://0.0.0.0:5000
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 40444
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.050638
slot update_slots: id 0 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.101276
slot update_slots: id 0 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.151914
slot update_slots: id 0 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.202552
slot update_slots: id 0 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.253190
slot update_slots: id 0 | task 0 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.303828
slot update_slots: id 0 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.354465
slot update_slots: id 0 | task 0 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.405103
slot update_slots: id 0 | task 0 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 18432, batch.n_tokens = 2048, progress = 0.455741
slot update_slots: id 0 | task 0 | n_tokens = 18432, memory_seq_rm [18432, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 2048, progress = 0.506379
slot update_slots: id 0 | task 0 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 22528, batch.n_tokens = 2048, progress = 0.557017
slot update_slots: id 0 | task 0 | n_tokens = 22528, memory_seq_rm [22528, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 2048, progress = 0.607655
slot update_slots: id 0 | task 0 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 26624, batch.n_tokens = 2048, progress = 0.658293
slot update_slots: id 0 | task 0 | n_tokens = 26624, memory_seq_rm [26624, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 2048, progress = 0.708931
slot update_slots: id 0 | task 0 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 30720, batch.n_tokens = 2048, progress = 0.759569
slot update_slots: id 0 | task 0 | n_tokens = 30720, memory_seq_rm [30720, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 2048, progress = 0.810207
slot update_slots: id 0 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 34816, batch.n_tokens = 2048, progress = 0.860845
slot update_slots: id 0 | task 0 | n_tokens = 34816, memory_seq_rm [34816, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 2048, progress = 0.911483
slot update_slots: id 0 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 38912, batch.n_tokens = 2048, progress = 0.962120
slot update_slots: id 0 | task 0 | n_tokens = 38912, memory_seq_rm [38912, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 39932, batch.n_tokens = 1020, progress = 0.987341
slot update_slots: id 0 | task 0 | n_tokens = 39932, memory_seq_rm [39932, end)
slot init_sampler: id 0 | task 0 | init sampler, took 3.66 ms, tokens: text = 40444, total = 40444
slot update_slots: id 0 | task 0 | created context checkpoint 1 of 8 (pos_min = 39931, pos_max = 39931, n_tokens = 39932, size = 149.626 MiB)
slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 40444, batch.n_tokens = 512
slot print_timing: id 0 | task 0 |
prompt eval time = 41380.08 ms / 40444 tokens ( 1.02 ms per token, 977.38 tokens per second)
eval time = 1862.34 ms / 50 tokens ( 37.25 ms per token, 26.85 tokens per second)
total time = 43242.42 ms / 40494 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 40493, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
This is ik_llama.cpp on Qwen 3 Coder Next 80b using my stress test.
~/ik_llama.cpp/build/bin/llama-server \
-m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf \
--host 0.0.0.0 --port 5000 \
--ctx-size 64000 \
--parallel 1 \
-ngl 99 \
--n-cpu-moe 28 \
-t 16 \
-b 4096 \
-ub 4096 \
--merge-qkv \
--mlock \
--no-mmap \
--warmup-batch \
-n 128
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.8 \
--min-p 0.05 \
--presence-penalty 1.1 \
--dry-multiplier 0.5 \
--dry-base 1.75 \
--dry-allowed-length 2 \
--dry-penalty-last-n 4096 \
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="139703933071360" timestamp=1772836818 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836818 id_slot=0 id_task=0 p0=0
slot create_check: id 0 | task 0 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 75.408 MiB, took 501.29 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836821 id_slot=0 id_task=0 p0=4096
slot create_check: id 0 | task 0 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 75.439 MiB, took 521.09 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836823 id_slot=0 id_task=0 p0=8192
slot create_check: id 0 | task 0 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 75.471 MiB, took 547.12 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836826 id_slot=0 id_task=0 p0=12288
slot create_check: id 0 | task 0 | created context checkpoint 4 of 8 (pos_min = 16383, pos_max = 16383, size = 75.502 MiB, took 567.33 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836828 id_slot=0 id_task=0 p0=16384
slot create_check: id 0 | task 0 | created context checkpoint 5 of 8 (pos_min = 20479, pos_max = 20479, size = 75.533 MiB, took 580.02 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836830 id_slot=0 id_task=0 p0=20480
slot create_check: id 0 | task 0 | created context checkpoint 6 of 8 (pos_min = 24575, pos_max = 24575, size = 75.564 MiB, took 609.27 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836833 id_slot=0 id_task=0 p0=24576
slot create_check: id 0 | task 0 | created context checkpoint 7 of 8 (pos_min = 28671, pos_max = 28671, size = 75.596 MiB, took 638.78 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836836 id_slot=0 id_task=0 p0=28672
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 32767, pos_max = 32767, size = 75.627 MiB, took 651.56 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836838 id_slot=0 id_task=0 p0=32768
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 4095, pos_max = 4095, size = 75.408 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 36863, pos_max = 36863, size = 75.658 MiB, took 662.48 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836841 id_slot=0 id_task=0 p0=36864
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 8191, pos_max = 8191, size = 75.439 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40436, pos_max = 40436, size = 75.685 MiB, took 616.27 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836843 id_slot=0 id_task=0 p0=40437
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 12287, pos_max = 12287, size = 75.471 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40442, pos_max = 40442, size = 75.685 MiB, took 31.51 ms)
slot print_timing: id 0 | task 0 |
prompt eval time = 25380.13 ms / 40442 tokens ( 0.63 ms per token, 1593.45 tokens per second)
eval time = 1015.06 ms / 50 tokens ( 20.30 ms per token, 49.26 tokens per second)
total time = 26395.19 ms / 40492 tokens
INFO [ log_server_request] request | tid="139658447953920" timestamp=1772836845 remote_addr="127.0.0.1" remote_port=38406 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id 0 | task 0 | erasing old context checkpoint (pos_min = 16383, pos_max = 16383, size = 75.502 MiB)
slot create_check: id 0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40490, pos_max = 40490, size = 75.686 MiB, took 30.76 ms)
INFO [ release_slots] slot released | tid="139703933071360" timestamp=1772836845 id_slot=0 id_task=0 n_ctx=64000 n_past=40491 n_system_tokens=0 n_cache_tokens=40491 truncated=false
INFO [ slots_idle] all slots are idle | tid="139703933071360" timestamp=1772836845
This is llama.cpp on Qwen 3 Coder Next 80b using my stress test.
~/llama.cpp/build/bin/llama-server \
-m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf \
--host 0.0.0.0 --port 5000 \
--ctx-size 64000 \
--parallel 1 \
-ngl 99 \
--n-cpu-moe 28 \
-t 16 \
-b 4096 \
-ub 2048 \
--flash-attn on \
--mlock \
--no-mmap \
--jinja \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.8 \
--min-p 0.05 \
--presence-penalty 1.1 \
--dry-multiplier 0.5 \
--dry-base 1.75 \
--dry-allowed-length 2 \
--dry-penalty-last-n 4096 \
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 40442
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 4096, progress = 0.101281
slot update_slots: id 0 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 4096, progress = 0.202562
slot update_slots: id 0 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 4096, progress = 0.303843
slot update_slots: id 0 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 4096, progress = 0.405123
slot update_slots: id 0 | task 0 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 4096, progress = 0.506404
slot update_slots: id 0 | task 0 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 4096, progress = 0.607685
slot update_slots: id 0 | task 0 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 4096, progress = 0.708966
slot update_slots: id 0 | task 0 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 4096, progress = 0.810247
slot update_slots: id 0 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 4096, progress = 0.911528
slot update_slots: id 0 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 39930, batch.n_tokens = 3066, progress = 0.987340
slot update_slots: id 0 | task 0 | n_tokens = 39930, memory_seq_rm [39930, end)
slot init_sampler: id 0 | task 0 | init sampler, took 4.46 ms, tokens: text = 40442, total = 40442
slot update_slots: id 0 | task 0 | created context checkpoint 1 of 8 (pos_min = 39929, pos_max = 39929, n_tokens = 39930, size = 75.376 MiB)
slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 40442, batch.n_tokens = 512
slot print_timing: id 0 | task 0 |
prompt eval time = 28284.04 ms / 40442 tokens ( 0.70 ms per token, 1429.85 tokens per second)
eval time = 1218.06 ms / 50 tokens ( 24.36 ms per token, 41.05 tokens per second)
total time = 29502.10 ms / 40492 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 40491, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Based on my tests the Qwen 3 Coder Next has improvement of 10 t/s in ik_llama , but a dense model like Qwen 3.5 27B very bad performance on ik_llama, anything that I am missing in my flags that can improve that?
Thanks for sharing full logs and your test script. Let me take a look, seems like you are testing two quants (though , each on ik vs mainline
Qwen3.5-27B-UD-Q5_K_XL
ik
prompt eval time = 36534.43 ms / 40442 tokens ( 0.90 ms per token, 1106.96 tokens per second)
eval time = 5436.31 ms / 50 tokens ( 108.73 ms per token, 9.20 tokens per second)
total time = 41970.74 ms / 40492 tokens
mainline
prompt eval time = 41380.08 ms / 40444 tokens ( 1.02 ms per token, 977.38 tokens per second)
eval time = 1862.34 ms / 50 tokens ( 37.25 ms per token, 26.85 tokens per second)
total time = 43242.42 ms / 40494 tokens
Qwen3-Coder-Next-Q3_K_M.gguf
ik
prompt eval time = 25380.13 ms / 40442 tokens ( 0.63 ms per token, 1593.45 tokens per second)
eval time = 1015.06 ms / 50 tokens ( 20.30 ms per token, 49.26 tokens per second)
total time = 26395.19 ms / 40492 tokens
mainline
prompt eval time = 28284.04 ms / 40442 tokens ( 0.70 ms per token, 1429.85 tokens per second)
eval time = 1218.06 ms / 50 tokens ( 24.36 ms per token, 41.05 tokens per second)
total time = 29502.10 ms / 40492 tokens
Firstly, myself and ik recommend against using UD quants that have downcast bf16 to f16 as described here: https://github.com/ikawrakow/ik_llama.cpp/?tab=readme-ov-file#tldr . this is because unless you check all values in all tensors there could be clipping during downcasting despite potential speed benefits, but honestly I don't use 16bit tensors in any of my quant as q8_0 is plenty good and doesn't have clipping issue and is half the size.
But specific to the tests, you say:
Qwen 3.5 27B very bad performance on ik_llama, anything that I am missing in my flags that can improve that?
From what I see ik is faster prompt processing than mainline in both cases.
Given you are only generating 50 tokens of output, that is not enough imo to make a clear distinction between the two. Perhaps you could update your test to generate a longer output?
The llama-sweep-bench while being "synthetic" is still exercising the underlying kernel implementations and representative of potential speeds.
But yes all the other arguments, caching, and stuff do come into play for actual use.
I noticed a guy open an issue on ik_llama.cpp discussing potential speed difference in llama-sweep-bench and actual use with samplers etc. Might be of interest to you: https://github.com/ikawrakow/ik_llama.cpp/issues/1390
Hi, sorry for the delay. I’m still trying to figure out something very odd. I’m running a new test where the mainline llama.cpp consistently clocks around 29 t/s, while ik_llama sometimes runs at 29, 27, 25, or even 15 t/s on subsequent runs.
The funny thing is that the test itself is a bit broken, but the input and output remain consistent. This is the test if you run it a couple of times on ik_llama: https://gist.github.com/sabotage3d/3f9f1fc544495e1c2c4ec1ee3eadebf5#file-haystrack_stress_weird-py
Is the llama-server doing any prompt caching? I wonder if you have to explicitly disable that e.g. -cram 0 --ctx-checkpoints 0 or something (i'm not 100% sure, just spitballing).
Also heads up PSA: some of the new re-uploaded Qwen3.5 quants using the mainline pre-fused up|gate tensor stuff are broken on ik. ik can do the fusion on the fly with -muge and some comparison benchmarks I just did there: https://github.com/ikawrakow/ik_llama.cpp/pull/1403
I had some more time on my hands today, so I recompiled both ik and main llama.cpp from latest and had some new good results. I also did a new token randomized bench that just inputs and output certain amount of tokens so that we have a more accurate result. Its a big win the for the prefill speed, which is very impressive. I think I would use more IK in the future. I will also try the none UD models next. I didn't see much difference with -muge.
model: Qwen3.5-27B-UD-Q5_K_XL.gguf
mainline
prompt eval time = 7002.15 ms / 7308 tokens ( 0.96 ms per token, 1043.68 tokens per second)
eval time = 65670.21 ms / 2000 tokens ( 32.84 ms per token, 30.46 tokens per second)
total time = 72672.36 ms / 9308 tokens
ik
prompt eval time = 6166.72 ms / 7308 tokens ( 0.84 ms per token, 1185.07 tokens per second)
eval time = 69941.98 ms / 2000 tokens ( 34.97 ms per token, 28.60 tokens per second)
total time = 76108.70 ms / 9308 tokens
Yeah its confusing as many new mainline quants are "pre-fused" so don't need -muge on ik. i recommend using --merge-qkv -muge on my quants. the fallback is sane if not supported for any other reasons.
Glad you're getting more throughput!


