Improving Qwen3 Coder Next 80b performance on ik_llama vs llama.cpp

#6
by sabotage3d - opened

Hi, as discussed on Reddit,
I tried many configurations to get Qwen3-Coder-Next running faster on ik_llama than on llama.cpp, but without success.
This is the fastest command I’ve achieved so far with llama.cpp on my hardware configuration. I think on q8 I was getting even faster prefills.

Configuration for Qwen3-Coder-Next on RTX 3090 + 5950X 64GB DDR4

Peak Performance: ~42 t/s Gen | ~1230 t/s Prompt

~/llama.cpp/build/bin/llama-server
-m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf
--host 0.0.0.0 --port 5000
--ctx-size 64000
--parallel 1
-ngl 99
--n-cpu-moe 28
-t 16
-b 4096
-ub 2048
--flash-attn on
--mlock
--no-mmap
--jinja
--cache-type-k q4_0
--cache-type-v q4_0
--temp 0.8
--min-p 0.05
--presence-penalty 1.1
--dry-multiplier 0.5
--dry-base 1.75
--dry-allowed-length 2
--dry-penalty-last-n 4096 \

I also have a simpler configuration for Qwen 3.5 27B. As it fits fully on the GPU would there be any improvement with ik_llama at all?
This is the command I am using:

MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"

4. Launch Server

~/llama.cpp/build/bin/llama-server
-m "$MODEL_PATH"
--mmproj "$VISION_PATH"
--host 0.0.0.0 --port 5000
--ctx-size 64000
--parallel 1
--fit on
--flash-attn on
--no-mmap
--jinja
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.00
--presence-penalty 0.0
--repeat-penalty 1.0

Heya super, thanks for dropping in, let's take a look, there are a few questions in here that I see:

  1. Running Qwen3-Coder-Next ~3ish BPW quant hybrid CPU+GPU mainline vs ik

I'm downloading my Q4_0 44.355 GiB (4.782 BPW) to run tests locally on my similar rig, AMD 9950X and single 3090TI FE. I'll get back to you with some commands and llama-sweep-bench results between ik and mainline.

I think on q8 I was getting even faster prefills

Yes, Q8_0 despite being larger is often faster for PP. This is because PP is typically compute bound wheras TG is typically memory bandwidth bound. A such Q8_0 being one of the first legacy quantization types has a very fast kernel computation-wise leading to faster PP peformance (but will give slower TG as it is bigger). Tradeoffs.

Qwen 3.5 27B. As it fits fully on the GPU would there be any improvement with ik_llama at all?

EDIT Ooops you are talking about the 27B dense, however I ran the 35B MoE below. You can use these same examples though to try it yourself now.

I'll run a llama-sweep-bench on my rig as well for this using my Q4_0 19.776 GiB (4.901 BPW) testing full offload.

I know if you have 2x GPUs using ik_llama.cpp would benefit due to -sm graph but I'm not sure on single GPU and we will find out!

@sabotage3d

Okay on my 3090 and Qwen3.5-35B-A3B Q4_0 custom quant it looks like ik_llama.cpp is faster for full offload on single GPU situation:

sweep-bench-Qwen3.5-35B-A3B-ik-vs-mainline

👈 Details

ik_llama.cpp main@277fc1d2

model=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  --merge-qkv \
  -ngl 99 \
  --threads 1 \
  --warmup-batch \
  -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 0.230 4456.53 0.951 134.55
1024 128 1024 0.233 4397.70 0.939 136.35
1024 128 2048 0.235 4361.99 0.944 135.57
1024 128 3072 0.237 4314.59 0.949 134.92
1024 128 4096 0.239 4283.01 0.952 134.45
1024 128 5120 0.243 4221.25 0.959 133.40
1024 128 6144 0.247 4147.44 0.965 132.65
1024 128 7168 0.249 4116.70 0.974 131.41
1024 128 8192 0.251 4084.81 0.981 130.47
1024 128 9216 0.254 4027.31 0.993 128.89
1024 128 10240 0.256 3999.14 1.002 127.69
1024 128 11264 0.258 3967.02 1.025 124.92
1024 128 12288 0.261 3917.65 1.032 124.01
1024 128 13312 0.263 3891.82 1.036 123.54
1024 128 14336 0.268 3825.69 1.040 123.10
1024 128 15360 0.270 3791.81 1.045 122.43
1024 128 16384 0.273 3750.57 1.051 121.80
1024 128 17408 0.274 3736.53 1.061 120.69
1024 128 18432 0.276 3704.71 1.067 120.00
1024 128 19456 0.280 3659.21 1.073 119.24
1024 128 20480 0.282 3634.29 1.080 118.47
1024 128 21504 0.283 3616.62 1.098 116.53
1024 128 22528 0.287 3573.82 1.107 115.61
1024 128 23552 0.290 3537.00 1.113 115.00
1024 128 24576 0.292 3503.15 1.119 114.41
1024 128 25600 0.295 3469.56 1.125 113.82
1024 128 26624 0.297 3450.91 1.131 113.18
1024 128 27648 0.301 3406.72 1.135 112.79
1024 128 28672 0.303 3384.64 1.142 112.08
1024 128 29696 0.305 3358.41 1.147 111.58
1024 128 30720 0.309 3312.55 1.154 110.93
1024 128 31744 0.310 3300.11 1.162 110.17
1024 128 32768 0.314 3264.49 1.183 108.24
1024 128 33792 0.316 3243.48 1.187 107.80
1024 128 34816 0.318 3219.86 1.191 107.44
1024 128 35840 0.322 3184.30 1.198 106.87
1024 128 36864 0.325 3152.46 1.202 106.47
1024 128 37888 0.326 3143.22 1.209 105.89
1024 128 38912 0.330 3107.39 1.214 105.47
1024 128 39936 0.333 3078.47 1.221 104.83
1024 128 40960 0.333 3074.35 1.224 104.59
1024 128 41984 0.337 3042.41 1.232 103.92
1024 128 43008 0.340 3013.83 1.247 102.68
1024 128 44032 0.341 3004.07 1.260 101.58
1024 128 45056 0.345 2968.48 1.263 101.31
1024 128 46080 0.346 2956.39 1.269 100.88
1024 128 47104 0.349 2933.70 1.275 100.42
1024 128 48128 0.353 2904.14 1.280 100.00
1024 128 49152 0.354 2894.14 1.283 99.80
1024 128 50176 0.356 2872.86 1.291 99.13
1024 128 51200 0.358 2859.93 1.297 98.66
1024 128 52224 0.363 2821.41 1.305 98.11
1024 128 53248 0.363 2817.62 1.309 97.75
1024 128 54272 0.367 2786.95 1.331 96.18
1024 128 55296 0.370 2767.88 1.337 95.76
1024 128 56320 0.372 2753.92 1.342 95.40
1024 128 57344 0.377 2719.77 1.347 95.05
1024 128 58368 0.376 2720.04 1.353 94.61
1024 128 59392 0.381 2688.69 1.358 94.29
1024 128 60416 0.383 2671.91 1.362 93.97
1024 128 61440 0.385 2662.26 1.369 93.52
1024 128 62464 0.391 2621.37 1.374 93.15
1024 128 63488 0.393 2605.50 1.381 92.65
1024 128 64512 0.394 2601.22 1.399 91.48
1024 128 65536 0.395 2595.04 1.407 90.94
1024 128 66560 0.398 2575.22 1.413 90.59
1024 128 67584 0.399 2566.49 1.420 90.13
1024 128 68608 0.404 2537.74 1.425 89.83

mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench

model=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  -ngl 99 \
  --threads 1 \
  -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 0.271 3782.91 1.044 122.57
1024 128 1024 0.275 3727.53 1.047 122.31
1024 128 2048 0.278 3689.74 1.053 121.59
1024 128 3072 0.281 3648.56 1.063 120.44
1024 128 4096 0.281 3639.17 1.070 119.64
1024 128 5120 0.286 3579.62 1.077 118.88
1024 128 6144 0.290 3525.14 1.087 117.76
1024 128 7168 0.294 3485.97 1.097 116.65
1024 128 8192 0.296 3464.54 1.107 115.59
1024 128 9216 0.300 3417.43 1.118 114.53
1024 128 10240 0.302 3391.42 1.123 113.97
1024 128 11264 0.304 3366.24 1.134 112.87
1024 128 12288 0.306 3343.82 1.144 111.90
1024 128 13312 0.310 3306.75 1.155 110.82
1024 128 14336 0.315 3251.99 1.162 110.11
1024 128 15360 0.317 3231.79 1.171 109.35
1024 128 16384 0.320 3202.35 1.174 108.99
1024 128 17408 0.322 3183.65 1.181 108.37
1024 128 18432 0.325 3152.32 1.190 107.57
1024 128 19456 0.329 3110.96 1.198 106.84
1024 128 20480 0.331 3090.42 1.203 106.39
1024 128 21504 0.333 3076.22 1.212 105.64
1024 128 22528 0.337 3040.19 1.221 104.81
1024 128 23552 0.339 3017.76 1.227 104.35
1024 128 24576 0.343 2983.70 1.234 103.74
1024 128 25600 0.346 2955.51 1.241 103.12
1024 128 26624 0.349 2935.42 1.247 102.67
1024 128 27648 0.352 2906.81 1.257 101.80
1024 128 28672 0.355 2884.84 1.265 101.20
1024 128 29696 0.358 2862.09 1.272 100.60
1024 128 30720 0.361 2833.43 1.278 100.19
1024 128 31744 0.364 2814.79 1.287 99.47
1024 128 32768 0.368 2785.09 1.294 98.88
1024 128 33792 0.370 2765.25 1.303 98.21
1024 128 34816 0.372 2749.23 1.311 97.67
1024 128 35840 0.376 2723.60 1.318 97.10
1024 128 36864 0.380 2694.01 1.325 96.60
1024 128 37888 0.383 2676.10 1.334 95.98
1024 128 38912 0.385 2657.59 1.341 95.48
1024 128 39936 0.389 2633.21 1.349 94.87
1024 128 40960 0.392 2615.03 1.356 94.38
1024 128 41984 0.395 2590.21 1.364 93.83
1024 128 43008 0.398 2575.27 1.371 93.37
1024 128 44032 0.400 2561.88 1.381 92.71
1024 128 45056 0.402 2544.60 1.387 92.29
1024 128 46080 0.407 2518.46 1.395 91.73
1024 128 47104 0.411 2494.47 1.403 91.26
1024 128 48128 0.413 2478.29 1.410 90.76
1024 128 49152 0.416 2459.37 1.418 90.29
1024 128 50176 0.419 2444.44 1.427 89.67
1024 128 51200 0.420 2440.13 1.433 89.31
1024 128 52224 0.425 2410.67 1.441 88.83
1024 128 53248 0.426 2401.83 1.449 88.34
1024 128 54272 0.432 2372.34 1.456 87.91
1024 128 55296 0.434 2360.03 1.463 87.52
1024 128 56320 0.437 2343.49 1.471 87.03
1024 128 57344 0.440 2329.36 1.478 86.60
1024 128 58368 0.442 2314.62 1.488 86.03
1024 128 59392 0.445 2299.02 1.494 85.67
1024 128 60416 0.449 2281.04 1.504 85.09

These are the instructions for compiling:

ik_llama.cpp

# update
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
git pull

# compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_CCACHE=OFF
cmake --build build --config Release -j $(nproc)

# confirm
./build/bin/llama-server --version
version: 4264 (277fc1d2)
built with cc (GCC) 15.2.1 20251112 for x86_64-pc-linux-gnu

# benchmark
# use commands found in details

mainline llama.cpp

# this includes the llama-sweep-bench patch for mainline
# update
git clone --depth 1 --branch ug/port-sweep-bench https://github.com/ubergarm/llama.cpp.git
cd llama.cpp

# compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_CCACHE=OFF
cmake --build build --config Release -j $(nproc)

# confirm
git log --pretty=format:"%h%x09%an%x09%ad%x09%s" | head -n 9
$ ./build/bin/llama-server --version
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 8233 (8eb8c6952)
built with GNU 15.2.1 for Linux x86_64

Okay I'll do the coder test now that it is finished downloading.

@sabotage3d

Here is the Qwen3-Coder-Next hybrid CPU+GPU results and commands in details below. Compiled same as above. ik is winning over mainline by a small but noticeable margin. Once again mainline borks out at the end either due to larger cuda buffer or might be something with the sweep-bench port.

So given my rig has avx512_vnni I did expect to see better PP for ik over mainline which is true. Not sure on Zen4 rig without but likely marginally better PP. mainline did pretty well here for TG and I'm not 100% sure the arch differences in Qwen3-Coder-Next vs the newer qwen35moe models, but I will do one more test after this for Qwen3.5-35B-A3B CPU-only.

sweep-bench-Qwen3-Coder-Next-ik-vs-mainline

👈 Details

ik_llama.cpp main@277fc1d2

./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -ngl 99 \
  --n-cpu-moe 30 \
  --threads 16 \
  --warmup-batch \
  -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 2.289 1789.36 2.000 63.99
4096 128 4096 2.321 1764.40 2.037 62.85
4096 128 8192 2.361 1735.09 2.076 61.65
4096 128 12288 2.394 1711.00 2.135 59.96
4096 128 16384 2.430 1685.57 2.158 59.31
4096 128 20480 2.472 1656.92 2.187 58.53
4096 128 24576 2.508 1633.27 2.227 57.46
4096 128 28672 2.552 1604.83 2.269 56.41
4096 128 32768 2.586 1583.93 2.307 55.47
4096 128 36864 2.635 1554.49 2.337 54.78
4096 128 40960 2.681 1528.06 2.368 54.07
4096 128 45056 2.716 1508.21 2.413 53.05
4096 128 49152 2.749 1489.91 2.433 52.61
4096 128 53248 2.792 1467.27 2.479 51.63
4096 128 57344 2.842 1441.40 2.515 50.90
4096 128 61440 2.884 1420.47 2.540 50.40
4096 128 65536 2.932 1397.11 2.590 49.43

mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench

model=/mnt/astrodata/llm/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 4096 -b 4096 \
  -ngl 99 \
  --n-cpu-moe 30 \
  --threads 16 \
  -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 2.567 1595.93 2.212 57.86
4096 128 4096 2.602 1573.95 2.232 57.35
4096 128 8192 2.645 1548.56 2.264 56.53
4096 128 12288 2.686 1525.13 2.312 55.37
4096 128 16384 2.726 1502.79 2.346 54.56
4096 128 20480 2.768 1479.84 2.407 53.18
4096 128 24576 2.807 1459.18 2.434 52.58
4096 128 28672 2.860 1432.29 2.469 51.84
4096 128 32768 2.898 1413.55 2.505 51.11
4096 128 36864 2.945 1391.06 2.530 50.59
4096 128 40960 2.987 1371.13 2.564 49.93
4096 128 45056 3.029 1352.17 2.595 49.33
4096 128 49152 3.066 1336.07 2.626 48.75
4096 128 53248 3.109 1317.44 2.664 48.05
4096 128 57344 3.160 1296.19 2.704 47.34
4096 128 61440 3.199 1280.39 2.721 47.04
Owner

Okay last one, CPU-only for Qwen3.5-35B-A3B showing ik is doing a lot better with the gated delta net CPU implementation. Compiled same as above with with CUDA off so it is CPU only backend.

sweep-bench-Qwen3.5-35B-A3B-ik-vs-mainline-CPU

👈 Details

ik_llama.cpp main@277fc1d2

./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  --merge-qkv \
  --threads 16 \
  --warmup-batch \
  -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 1.296 790.34 5.271 24.28
1024 128 1024 1.358 754.32 5.141 24.90
1024 128 2048 1.401 730.66 5.149 24.86
1024 128 3072 1.450 706.21 5.194 24.65
1024 128 4096 1.493 685.70 5.221 24.52
1024 128 5120 1.530 669.12 5.240 24.43
1024 128 6144 1.575 650.02 5.253 24.37
1024 128 7168 1.599 640.30 5.287 24.21
1024 128 8192 1.643 623.10 5.281 24.24
1024 128 9216 1.681 609.02 5.302 24.14
1024 128 10240 1.792 571.54 5.332 24.01
1024 128 11264 1.763 580.93 5.335 23.99
1024 128 12288 1.806 566.90 5.369 23.84
1024 128 13312 1.847 554.44 5.397 23.72
1024 128 14336 1.885 543.22 5.402 23.69
1024 128 15360 1.929 530.94 5.431 23.57
1024 128 16384 1.980 517.04 5.440 23.53
1024 128 17408 2.062 496.52 5.496 23.29
1024 128 18432 2.060 497.02 5.511 23.22
1024 128 19456 2.087 490.75 5.568 22.99
1024 128 20480 2.141 478.34 5.645 22.67
1024 128 21504 2.160 474.08 5.627 22.75
1024 128 22528 2.224 460.40 5.634 22.72
1024 128 23552 2.258 453.41 5.689 22.50
1024 128 24576 2.422 422.76 5.692 22.49
1024 128 25600 2.327 440.10 5.683 22.52
1024 128 26624 2.367 432.61 5.762 22.22
1024 128 27648 2.410 424.86 5.788 22.12
1024 128 28672 2.444 419.04 5.817 22.00
1024 128 29696 2.501 409.44 5.843 21.91
1024 128 30720 2.581 396.74 5.853 21.87
1024 128 31744 2.610 392.41 5.853 21.87
1024 128 32768 2.669 383.70 5.842 21.91

mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench

./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  --threads 16 \
  -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 6.153 166.41 8.284 15.45
1024 128 1024 7.336 139.59 8.140 15.72
1024 128 2048 7.958 128.68 8.312 15.40
1024 128 3072 9.048 113.17 8.603 14.88
1024 128 4096 10.045 101.94 8.353 15.32
1024 128 5120 10.969 93.35 8.858 14.45
1024 128 6144 11.710 87.45 8.739 14.65
1024 128 7168 12.552 81.58 8.697 14.72
1024 128 8192 13.479 75.97 9.226 13.87
1024 128 9216 14.620 70.04 9.239 13.85
1024 128 10240 15.000 68.27 9.135 14.01
1024 128 11264 16.088 63.65 9.467 13.52
1024 128 12288 16.675 61.41 9.158 13.98
1024 128 13312 17.236 59.41 9.696 13.20
1024 128 14336 18.576 55.12 9.524 13.44
1024 128 15360 19.520 52.46 9.855 12.99
1024 128 16384 19.817 51.67 9.231 13.87
1024 128 17408 19.869 51.54 9.579 13.36
1024 128 18432 21.962 46.63 10.553 12.13
1024 128 19456 22.715 45.08 10.453 12.25
1024 128 20480 23.965 42.73 10.579 12.10
1024 128 21504 24.021 42.63 9.781 13.09
1024 128 22528 24.344 42.06 10.275 12.46
1024 128 23552 24.429 41.92 10.475 12.22
1024 128 24576 27.235 37.60 10.987 11.65
1024 128 25600 26.745 38.29 10.248 12.49
1024 128 26624 27.459 37.29 10.219 12.53
1024 128 27648 29.335 34.91 10.275 12.46
1024 128 28672 31.157 32.87 10.270 12.46
1024 128 29696 32.972 31.06 12.251 10.45
1024 128 30720 33.706 30.38 10.442 12.26
1024 128 31744 35.189 29.10 12.012 10.66
1024 128 32768 34.935 29.31 12.053 10.62
Owner

Okay it seems like mainline llama.cpp has PR improving the CPU chunked delta net implementation getting close to merged, I did a 3-way benchmark there if you're interested: https://github.com/ggml-org/llama.cpp/pull/19504#issuecomment-4013706238

Thank you for the detailed benchmarks as I don't trust the synthetic benches I am using a simple haystack stress test: https://gist.github.com/sabotage3d/3f9f1fc544495e1c2c4ec1ee3eadebf5
compiled main branch of ik_llama.cpp output using the command below:

MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"

# 4. Launch Server
~/ik_llama.cpp/build/bin/llama-server \
  -m "$MODEL_PATH" \
  --mmproj "$VISION_PATH" \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  -ub 1024 -b 2048 \
  --merge-qkv \
  -ngl 99 \
  --threads 16 \
  --warmup-batch \
  -n 128
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="140690707640320" timestamp=1772834996 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834996 id_slot=0 id_task=0 p0=0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 2047, pos_max = 2047, size = 149.643 MiB, took 1646.78 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834998 id_slot=0 id_task=0 p0=2048
slot create_check: id  0 | task 0 | created context checkpoint 2 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 1555.92 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834999 id_slot=0 id_task=0 p0=4096
slot create_check: id  0 | task 0 | created context checkpoint 3 of 8 (pos_min = 6143, pos_max = 6143, size = 149.674 MiB, took 1585.55 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835001 id_slot=0 id_task=0 p0=6144
slot create_check: id  0 | task 0 | created context checkpoint 4 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 633.89 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835003 id_slot=0 id_task=0 p0=8192
slot create_check: id  0 | task 0 | created context checkpoint 5 of 8 (pos_min = 10239, pos_max = 10239, size = 149.706 MiB, took 683.44 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835004 id_slot=0 id_task=0 p0=10240
slot create_check: id  0 | task 0 | created context checkpoint 6 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 688.13 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835006 id_slot=0 id_task=0 p0=12288
slot create_check: id  0 | task 0 | created context checkpoint 7 of 8 (pos_min = 14335, pos_max = 14335, size = 149.737 MiB, took 667.95 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835008 id_slot=0 id_task=0 p0=14336
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 16383, pos_max = 16383, size = 149.752 MiB, took 693.84 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835010 id_slot=0 id_task=0 p0=16384
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 2047, pos_max = 2047, size = 149.643 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 18431, pos_max = 18431, size = 149.768 MiB, took 661.93 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835012 id_slot=0 id_task=0 p0=18432
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 20479, pos_max = 20479, size = 149.784 MiB, took 660.57 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835013 id_slot=0 id_task=0 p0=20480
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 6143, pos_max = 6143, size = 149.674 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 22527, pos_max = 22527, size = 149.799 MiB, took 671.89 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835015 id_slot=0 id_task=0 p0=22528
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 8191, pos_max = 8191, size = 149.690 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 24575, pos_max = 24575, size = 149.815 MiB, took 682.74 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835017 id_slot=0 id_task=0 p0=24576
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 10239, pos_max = 10239, size = 149.706 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 26623, pos_max = 26623, size = 149.831 MiB, took 696.47 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835019 id_slot=0 id_task=0 p0=26624
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 12287, pos_max = 12287, size = 149.721 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 28671, pos_max = 28671, size = 149.846 MiB, took 700.38 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835021 id_slot=0 id_task=0 p0=28672
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 14335, pos_max = 14335, size = 149.737 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 30719, pos_max = 30719, size = 149.862 MiB, took 786.73 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835023 id_slot=0 id_task=0 p0=30720
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 16383, pos_max = 16383, size = 149.752 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 32767, pos_max = 32767, size = 149.877 MiB, took 722.32 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835025 id_slot=0 id_task=0 p0=32768
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 18431, pos_max = 18431, size = 149.768 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 34815, pos_max = 34815, size = 149.893 MiB, took 737.62 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835027 id_slot=0 id_task=0 p0=34816
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 20479, pos_max = 20479, size = 149.784 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 36863, pos_max = 36863, size = 149.909 MiB, took 738.45 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835029 id_slot=0 id_task=0 p0=36864
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 22527, pos_max = 22527, size = 149.799 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 38911, pos_max = 38911, size = 149.924 MiB, took 747.90 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835031 id_slot=0 id_task=0 p0=38912
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 24575, pos_max = 24575, size = 149.815 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40436, pos_max = 40436, size = 149.936 MiB, took 1520.62 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835032 id_slot=0 id_task=0 p0=40437
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 26623, pos_max = 26623, size = 149.831 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40442, pos_max = 40442, size = 149.936 MiB, took 37.77 ms)
slot print_timing: id  0 | task 0 | 
prompt eval time =   36534.43 ms / 40442 tokens (    0.90 ms per token,  1106.96 tokens per second)
       eval time =    5436.31 ms /    50 tokens (  108.73 ms per token,     9.20 tokens per second)
      total time =   41970.74 ms / 40492 tokens
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 28671, pos_max = 28671, size = 149.846 MiB)
INFO [      log_server_request] request | tid="140680000237568" timestamp=1772835038 remote_addr="127.0.0.1" remote_port=60706 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40490, pos_max = 40490, size = 149.936 MiB, took 29.89 ms)
INFO [           release_slots] slot released | tid="140690707640320" timestamp=1772835038 id_slot=0 id_task=0 n_ctx=64000 n_past=40491 n_system_tokens=0 n_cache_tokens=40491 truncated=false
INFO [              slots_idle] all slots are idle | tid="140690707640320" timestamp=1772835038

This is the same stress test on llama.cpp main branch command and log.

MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"

# 4. Launch Server
~/llama.cpp/build/bin/llama-server \
  -m "$MODEL_PATH" \
  --mmproj "$VISION_PATH" \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  --fit on \
  --flash-attn on \
  --no-mmap \
  --jinja \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0
main: model loaded
main: server is listening on http://0.0.0.0:5000
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 40444
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.050638
slot update_slots: id  0 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.101276
slot update_slots: id  0 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.151914
slot update_slots: id  0 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.202552
slot update_slots: id  0 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.253190
slot update_slots: id  0 | task 0 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.303828
slot update_slots: id  0 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.354465
slot update_slots: id  0 | task 0 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.405103
slot update_slots: id  0 | task 0 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 18432, batch.n_tokens = 2048, progress = 0.455741
slot update_slots: id  0 | task 0 | n_tokens = 18432, memory_seq_rm [18432, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 2048, progress = 0.506379
slot update_slots: id  0 | task 0 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 22528, batch.n_tokens = 2048, progress = 0.557017
slot update_slots: id  0 | task 0 | n_tokens = 22528, memory_seq_rm [22528, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 2048, progress = 0.607655
slot update_slots: id  0 | task 0 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 26624, batch.n_tokens = 2048, progress = 0.658293
slot update_slots: id  0 | task 0 | n_tokens = 26624, memory_seq_rm [26624, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 2048, progress = 0.708931
slot update_slots: id  0 | task 0 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 30720, batch.n_tokens = 2048, progress = 0.759569
slot update_slots: id  0 | task 0 | n_tokens = 30720, memory_seq_rm [30720, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 2048, progress = 0.810207
slot update_slots: id  0 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 34816, batch.n_tokens = 2048, progress = 0.860845
slot update_slots: id  0 | task 0 | n_tokens = 34816, memory_seq_rm [34816, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 2048, progress = 0.911483
slot update_slots: id  0 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 38912, batch.n_tokens = 2048, progress = 0.962120
slot update_slots: id  0 | task 0 | n_tokens = 38912, memory_seq_rm [38912, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 39932, batch.n_tokens = 1020, progress = 0.987341
slot update_slots: id  0 | task 0 | n_tokens = 39932, memory_seq_rm [39932, end)
slot init_sampler: id  0 | task 0 | init sampler, took 3.66 ms, tokens: text = 40444, total = 40444
slot update_slots: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 39931, pos_max = 39931, n_tokens = 39932, size = 149.626 MiB)
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 40444, batch.n_tokens = 512
slot print_timing: id  0 | task 0 | 
prompt eval time =   41380.08 ms / 40444 tokens (    1.02 ms per token,   977.38 tokens per second)
       eval time =    1862.34 ms /    50 tokens (   37.25 ms per token,    26.85 tokens per second)
      total time =   43242.42 ms / 40494 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 40493, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
sabotage3d changed discussion status to closed
sabotage3d changed discussion status to open

This is ik_llama.cpp on Qwen 3 Coder Next 80b using my stress test.

~/ik_llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  -ngl 99 \
  --n-cpu-moe 28 \
  -t 16 \
  -b 4096 \
  -ub 4096 \
  --merge-qkv \
  --mlock \
  --no-mmap \
  --warmup-batch \
  -n 128
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.8 \
  --min-p 0.05 \
  --presence-penalty 1.1 \
  --dry-multiplier 0.5 \
  --dry-base 1.75 \
  --dry-allowed-length 2 \
  --dry-penalty-last-n 4096 \
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="139703933071360" timestamp=1772836818 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836818 id_slot=0 id_task=0 p0=0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 75.408 MiB, took 501.29 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836821 id_slot=0 id_task=0 p0=4096
slot create_check: id  0 | task 0 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 75.439 MiB, took 521.09 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836823 id_slot=0 id_task=0 p0=8192
slot create_check: id  0 | task 0 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 75.471 MiB, took 547.12 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836826 id_slot=0 id_task=0 p0=12288
slot create_check: id  0 | task 0 | created context checkpoint 4 of 8 (pos_min = 16383, pos_max = 16383, size = 75.502 MiB, took 567.33 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836828 id_slot=0 id_task=0 p0=16384
slot create_check: id  0 | task 0 | created context checkpoint 5 of 8 (pos_min = 20479, pos_max = 20479, size = 75.533 MiB, took 580.02 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836830 id_slot=0 id_task=0 p0=20480
slot create_check: id  0 | task 0 | created context checkpoint 6 of 8 (pos_min = 24575, pos_max = 24575, size = 75.564 MiB, took 609.27 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836833 id_slot=0 id_task=0 p0=24576
slot create_check: id  0 | task 0 | created context checkpoint 7 of 8 (pos_min = 28671, pos_max = 28671, size = 75.596 MiB, took 638.78 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836836 id_slot=0 id_task=0 p0=28672
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 32767, pos_max = 32767, size = 75.627 MiB, took 651.56 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836838 id_slot=0 id_task=0 p0=32768
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 4095, pos_max = 4095, size = 75.408 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 36863, pos_max = 36863, size = 75.658 MiB, took 662.48 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836841 id_slot=0 id_task=0 p0=36864
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 8191, pos_max = 8191, size = 75.439 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40436, pos_max = 40436, size = 75.685 MiB, took 616.27 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836843 id_slot=0 id_task=0 p0=40437
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 12287, pos_max = 12287, size = 75.471 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40442, pos_max = 40442, size = 75.685 MiB, took 31.51 ms)
slot print_timing: id  0 | task 0 | 
prompt eval time =   25380.13 ms / 40442 tokens (    0.63 ms per token,  1593.45 tokens per second)
       eval time =    1015.06 ms /    50 tokens (   20.30 ms per token,    49.26 tokens per second)
      total time =   26395.19 ms / 40492 tokens
INFO [      log_server_request] request | tid="139658447953920" timestamp=1772836845 remote_addr="127.0.0.1" remote_port=38406 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 16383, pos_max = 16383, size = 75.502 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40490, pos_max = 40490, size = 75.686 MiB, took 30.76 ms)
INFO [           release_slots] slot released | tid="139703933071360" timestamp=1772836845 id_slot=0 id_task=0 n_ctx=64000 n_past=40491 n_system_tokens=0 n_cache_tokens=40491 truncated=false
INFO [              slots_idle] all slots are idle | tid="139703933071360" timestamp=1772836845
sabotage3d changed discussion status to closed
sabotage3d changed discussion status to open

This is llama.cpp on Qwen 3 Coder Next 80b using my stress test.

~/llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  -ngl 99 \
  --n-cpu-moe 28 \
  -t 16 \
  -b 4096 \
  -ub 2048 \
  --flash-attn on \
  --mlock \
  --no-mmap \
  --jinja \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.8 \
  --min-p 0.05 \
  --presence-penalty 1.1 \
  --dry-multiplier 0.5 \
  --dry-base 1.75 \
  --dry-allowed-length 2 \
  --dry-penalty-last-n 4096 \
srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 40442
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 4096, progress = 0.101281
slot update_slots: id  0 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 4096, progress = 0.202562
slot update_slots: id  0 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 4096, progress = 0.303843
slot update_slots: id  0 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 4096, progress = 0.405123
slot update_slots: id  0 | task 0 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 4096, progress = 0.506404
slot update_slots: id  0 | task 0 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 4096, progress = 0.607685
slot update_slots: id  0 | task 0 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 4096, progress = 0.708966
slot update_slots: id  0 | task 0 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 4096, progress = 0.810247
slot update_slots: id  0 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 4096, progress = 0.911528
slot update_slots: id  0 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 39930, batch.n_tokens = 3066, progress = 0.987340
slot update_slots: id  0 | task 0 | n_tokens = 39930, memory_seq_rm [39930, end)
slot init_sampler: id  0 | task 0 | init sampler, took 4.46 ms, tokens: text = 40442, total = 40442
slot update_slots: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 39929, pos_max = 39929, n_tokens = 39930, size = 75.376 MiB)
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 40442, batch.n_tokens = 512
slot print_timing: id  0 | task 0 | 
prompt eval time =   28284.04 ms / 40442 tokens (    0.70 ms per token,  1429.85 tokens per second)
       eval time =    1218.06 ms /    50 tokens (   24.36 ms per token,    41.05 tokens per second)
      total time =   29502.10 ms / 40492 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 40491, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
sabotage3d changed discussion status to closed
sabotage3d changed discussion status to open

Based on my tests the Qwen 3 Coder Next has improvement of 10 t/s in ik_llama , but a dense model like Qwen 3.5 27B very bad performance on ik_llama, anything that I am missing in my flags that can improve that?

Owner

@sabotage3d

Thanks for sharing full logs and your test script. Let me take a look, seems like you are testing two quants (though , each on ik vs mainline

Qwen3.5-27B-UD-Q5_K_XL

ik

prompt eval time =   36534.43 ms / 40442 tokens (    0.90 ms per token,  1106.96 tokens per second)
       eval time =    5436.31 ms /    50 tokens (  108.73 ms per token,     9.20 tokens per second)
      total time =   41970.74 ms / 40492 tokens

mainline

prompt eval time =   41380.08 ms / 40444 tokens (    1.02 ms per token,   977.38 tokens per second)
       eval time =    1862.34 ms /    50 tokens (   37.25 ms per token,    26.85 tokens per second)
      total time =   43242.42 ms / 40494 tokens

Qwen3-Coder-Next-Q3_K_M.gguf

ik

prompt eval time =   25380.13 ms / 40442 tokens (    0.63 ms per token,  1593.45 tokens per second)
       eval time =    1015.06 ms /    50 tokens (   20.30 ms per token,    49.26 tokens per second)
      total time =   26395.19 ms / 40492 tokens

mainline

prompt eval time =   28284.04 ms / 40442 tokens (    0.70 ms per token,  1429.85 tokens per second)
       eval time =    1218.06 ms /    50 tokens (   24.36 ms per token,    41.05 tokens per second)
      total time =   29502.10 ms / 40492 tokens

Firstly, myself and ik recommend against using UD quants that have downcast bf16 to f16 as described here: https://github.com/ikawrakow/ik_llama.cpp/?tab=readme-ov-file#tldr . this is because unless you check all values in all tensors there could be clipping during downcasting despite potential speed benefits, but honestly I don't use 16bit tensors in any of my quant as q8_0 is plenty good and doesn't have clipping issue and is half the size.

But specific to the tests, you say:

Qwen 3.5 27B very bad performance on ik_llama, anything that I am missing in my flags that can improve that?

From what I see ik is faster prompt processing than mainline in both cases.

Given you are only generating 50 tokens of output, that is not enough imo to make a clear distinction between the two. Perhaps you could update your test to generate a longer output?

The llama-sweep-bench while being "synthetic" is still exercising the underlying kernel implementations and representative of potential speeds.

But yes all the other arguments, caching, and stuff do come into play for actual use.

@sabotage3d

I noticed a guy open an issue on ik_llama.cpp discussing potential speed difference in llama-sweep-bench and actual use with samplers etc. Might be of interest to you: https://github.com/ikawrakow/ik_llama.cpp/issues/1390

Hi, sorry for the delay. I’m still trying to figure out something very odd. I’m running a new test where the mainline llama.cpp consistently clocks around 29 t/s, while ik_llama sometimes runs at 29, 27, 25, or even 15 t/s on subsequent runs.

The funny thing is that the test itself is a bit broken, but the input and output remain consistent. This is the test if you run it a couple of times on ik_llama: https://gist.github.com/sabotage3d/3f9f1fc544495e1c2c4ec1ee3eadebf5#file-haystrack_stress_weird-py

@sabotage3d

Is the llama-server doing any prompt caching? I wonder if you have to explicitly disable that e.g. -cram 0 --ctx-checkpoints 0 or something (i'm not 100% sure, just spitballing).

Also heads up PSA: some of the new re-uploaded Qwen3.5 quants using the mainline pre-fused up|gate tensor stuff are broken on ik. ik can do the fusion on the fly with -muge and some comparison benchmarks I just did there: https://github.com/ikawrakow/ik_llama.cpp/pull/1403

I had some more time on my hands today, so I recompiled both ik and main llama.cpp from latest and had some new good results. I also did a new token randomized bench that just inputs and output certain amount of tokens so that we have a more accurate result. Its a big win the for the prefill speed, which is very impressive. I think I would use more IK in the future. I will also try the none UD models next. I didn't see much difference with -muge.

model: Qwen3.5-27B-UD-Q5_K_XL.gguf

mainline

prompt eval time =    7002.15 ms /  7308 tokens (    0.96 ms per token,  1043.68 tokens per second)
       eval time =   65670.21 ms /  2000 tokens (   32.84 ms per token,    30.46 tokens per second)
      total time =   72672.36 ms /  9308 tokens

ik

prompt eval time =    6166.72 ms /  7308 tokens (    0.84 ms per token,  1185.07 tokens per second)
       eval time =   69941.98 ms /  2000 tokens (   34.97 ms per token,    28.60 tokens per second)
      total time =   76108.70 ms /  9308 tokens

@sabotage3d

Yeah its confusing as many new mainline quants are "pre-fused" so don't need -muge on ik. i recommend using --merge-qkv -muge on my quants. the fallback is sane if not supported for any other reasons.

Glad you're getting more throughput!

Sign up or log in to comment