Improving Qwen3 Coder Next 80b performance on ik_llama vs llama.cpp

by sabotage3d - opened Mar 6

Mar 6

Hi, as discussed on Reddit,
I tried many configurations to get Qwen3-Coder-Next running faster on ik_llama than on llama.cpp, but without success.
This is the fastest command I’ve achieved so far with llama.cpp on my hardware configuration. I think on q8 I was getting even faster prefills.

Configuration for Qwen3-Coder-Next on RTX 3090 + 5950X 64GB DDR4

Peak Performance: ~42 t/s Gen | ~1230 t/s Prompt

~/llama.cpp/build/bin/llama-server
-m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf
--host 0.0.0.0 --port 5000
--ctx-size 64000
--parallel 1
-ngl 99
--n-cpu-moe 28
-t 16
-b 4096
-ub 2048
--flash-attn on
--mlock
--no-mmap
--jinja
--cache-type-k q4_0
--cache-type-v q4_0
--temp 0.8
--min-p 0.05
--presence-penalty 1.1
--dry-multiplier 0.5
--dry-base 1.75
--dry-allowed-length 2
--dry-penalty-last-n 4096 \

I also have a simpler configuration for Qwen 3.5 27B. As it fits fully on the GPU would there be any improvement with ik_llama at all?
This is the command I am using:

MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"

4. Launch Server

~/llama.cpp/build/bin/llama-server
-m "$MODEL_PATH"
--mmproj "$VISION_PATH"
--host 0.0.0.0 --port 5000
--ctx-size 64000
--parallel 1
--fit on
--flash-attn on
--no-mmap
--jinja
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.00
--presence-penalty 0.0
--repeat-penalty 1.0

ubergarm

Owner Mar 6

•

edited Mar 6

Heya super, thanks for dropping in, let's take a look, there are a few questions in here that I see:

Running Qwen3-Coder-Next ~3ish BPW quant hybrid CPU+GPU mainline vs ik

I'm downloading my Q4_0 44.355 GiB (4.782 BPW) to run tests locally on my similar rig, AMD 9950X and single 3090TI FE. I'll get back to you with some commands and llama-sweep-bench results between ik and mainline.

I think on q8 I was getting even faster prefills

Yes, Q8_0 despite being larger is often faster for PP. This is because PP is typically compute bound wheras TG is typically memory bandwidth bound. A such Q8_0 being one of the first legacy quantization types has a very fast kernel computation-wise leading to faster PP peformance (but will give slower TG as it is bigger). Tradeoffs.

Qwen 3.5 27B. As it fits fully on the GPU would there be any improvement with ik_llama at all?

EDIT Ooops you are talking about the 27B dense, however I ran the 35B MoE below. You can use these same examples though to try it yourself now.

I'll run a llama-sweep-bench on my rig as well for this using my Q4_0 19.776 GiB (4.901 BPW) testing full offload.

I know if you have 2x GPUs using ik_llama.cpp would benefit due to -sm graph but I'm not sure on single GPU and we will find out!

ubergarm

Owner Mar 6

•

edited Mar 6

@sabotage3d

Okay on my 3090 and Qwen3.5-35B-A3B Q4_0 custom quant it looks like ik_llama.cpp is faster for full offload on single GPU situation:

👈 Details

ik_llama.cpp main@277fc1d2

model=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  --merge-qkv \
  -ngl 99 \
  --threads 1 \
  --warmup-batch \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	0.230	4456.53	0.951	134.55
1024	128	1024	0.233	4397.70	0.939	136.35
1024	128	2048	0.235	4361.99	0.944	135.57
1024	128	3072	0.237	4314.59	0.949	134.92
1024	128	4096	0.239	4283.01	0.952	134.45
1024	128	5120	0.243	4221.25	0.959	133.40
1024	128	6144	0.247	4147.44	0.965	132.65
1024	128	7168	0.249	4116.70	0.974	131.41
1024	128	8192	0.251	4084.81	0.981	130.47
1024	128	9216	0.254	4027.31	0.993	128.89
1024	128	10240	0.256	3999.14	1.002	127.69
1024	128	11264	0.258	3967.02	1.025	124.92
1024	128	12288	0.261	3917.65	1.032	124.01
1024	128	13312	0.263	3891.82	1.036	123.54
1024	128	14336	0.268	3825.69	1.040	123.10
1024	128	15360	0.270	3791.81	1.045	122.43
1024	128	16384	0.273	3750.57	1.051	121.80
1024	128	17408	0.274	3736.53	1.061	120.69
1024	128	18432	0.276	3704.71	1.067	120.00
1024	128	19456	0.280	3659.21	1.073	119.24
1024	128	20480	0.282	3634.29	1.080	118.47
1024	128	21504	0.283	3616.62	1.098	116.53
1024	128	22528	0.287	3573.82	1.107	115.61
1024	128	23552	0.290	3537.00	1.113	115.00
1024	128	24576	0.292	3503.15	1.119	114.41
1024	128	25600	0.295	3469.56	1.125	113.82
1024	128	26624	0.297	3450.91	1.131	113.18
1024	128	27648	0.301	3406.72	1.135	112.79
1024	128	28672	0.303	3384.64	1.142	112.08
1024	128	29696	0.305	3358.41	1.147	111.58
1024	128	30720	0.309	3312.55	1.154	110.93
1024	128	31744	0.310	3300.11	1.162	110.17
1024	128	32768	0.314	3264.49	1.183	108.24
1024	128	33792	0.316	3243.48	1.187	107.80
1024	128	34816	0.318	3219.86	1.191	107.44
1024	128	35840	0.322	3184.30	1.198	106.87
1024	128	36864	0.325	3152.46	1.202	106.47
1024	128	37888	0.326	3143.22	1.209	105.89
1024	128	38912	0.330	3107.39	1.214	105.47
1024	128	39936	0.333	3078.47	1.221	104.83
1024	128	40960	0.333	3074.35	1.224	104.59
1024	128	41984	0.337	3042.41	1.232	103.92
1024	128	43008	0.340	3013.83	1.247	102.68
1024	128	44032	0.341	3004.07	1.260	101.58
1024	128	45056	0.345	2968.48	1.263	101.31
1024	128	46080	0.346	2956.39	1.269	100.88
1024	128	47104	0.349	2933.70	1.275	100.42
1024	128	48128	0.353	2904.14	1.280	100.00
1024	128	49152	0.354	2894.14	1.283	99.80
1024	128	50176	0.356	2872.86	1.291	99.13
1024	128	51200	0.358	2859.93	1.297	98.66
1024	128	52224	0.363	2821.41	1.305	98.11
1024	128	53248	0.363	2817.62	1.309	97.75
1024	128	54272	0.367	2786.95	1.331	96.18
1024	128	55296	0.370	2767.88	1.337	95.76
1024	128	56320	0.372	2753.92	1.342	95.40
1024	128	57344	0.377	2719.77	1.347	95.05
1024	128	58368	0.376	2720.04	1.353	94.61
1024	128	59392	0.381	2688.69	1.358	94.29
1024	128	60416	0.383	2671.91	1.362	93.97
1024	128	61440	0.385	2662.26	1.369	93.52
1024	128	62464	0.391	2621.37	1.374	93.15
1024	128	63488	0.393	2605.50	1.381	92.65
1024	128	64512	0.394	2601.22	1.399	91.48
1024	128	65536	0.395	2595.04	1.407	90.94
1024	128	66560	0.398	2575.22	1.413	90.59
1024	128	67584	0.399	2566.49	1.420	90.13
1024	128	68608	0.404	2537.74	1.425	89.83

mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench

model=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  -ngl 99 \
  --threads 1 \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	0.271	3782.91	1.044	122.57
1024	128	1024	0.275	3727.53	1.047	122.31
1024	128	2048	0.278	3689.74	1.053	121.59
1024	128	3072	0.281	3648.56	1.063	120.44
1024	128	4096	0.281	3639.17	1.070	119.64
1024	128	5120	0.286	3579.62	1.077	118.88
1024	128	6144	0.290	3525.14	1.087	117.76
1024	128	7168	0.294	3485.97	1.097	116.65
1024	128	8192	0.296	3464.54	1.107	115.59
1024	128	9216	0.300	3417.43	1.118	114.53
1024	128	10240	0.302	3391.42	1.123	113.97
1024	128	11264	0.304	3366.24	1.134	112.87
1024	128	12288	0.306	3343.82	1.144	111.90
1024	128	13312	0.310	3306.75	1.155	110.82
1024	128	14336	0.315	3251.99	1.162	110.11
1024	128	15360	0.317	3231.79	1.171	109.35
1024	128	16384	0.320	3202.35	1.174	108.99
1024	128	17408	0.322	3183.65	1.181	108.37
1024	128	18432	0.325	3152.32	1.190	107.57
1024	128	19456	0.329	3110.96	1.198	106.84
1024	128	20480	0.331	3090.42	1.203	106.39
1024	128	21504	0.333	3076.22	1.212	105.64
1024	128	22528	0.337	3040.19	1.221	104.81
1024	128	23552	0.339	3017.76	1.227	104.35
1024	128	24576	0.343	2983.70	1.234	103.74
1024	128	25600	0.346	2955.51	1.241	103.12
1024	128	26624	0.349	2935.42	1.247	102.67
1024	128	27648	0.352	2906.81	1.257	101.80
1024	128	28672	0.355	2884.84	1.265	101.20
1024	128	29696	0.358	2862.09	1.272	100.60
1024	128	30720	0.361	2833.43	1.278	100.19
1024	128	31744	0.364	2814.79	1.287	99.47
1024	128	32768	0.368	2785.09	1.294	98.88
1024	128	33792	0.370	2765.25	1.303	98.21
1024	128	34816	0.372	2749.23	1.311	97.67
1024	128	35840	0.376	2723.60	1.318	97.10
1024	128	36864	0.380	2694.01	1.325	96.60
1024	128	37888	0.383	2676.10	1.334	95.98
1024	128	38912	0.385	2657.59	1.341	95.48
1024	128	39936	0.389	2633.21	1.349	94.87
1024	128	40960	0.392	2615.03	1.356	94.38
1024	128	41984	0.395	2590.21	1.364	93.83
1024	128	43008	0.398	2575.27	1.371	93.37
1024	128	44032	0.400	2561.88	1.381	92.71
1024	128	45056	0.402	2544.60	1.387	92.29
1024	128	46080	0.407	2518.46	1.395	91.73
1024	128	47104	0.411	2494.47	1.403	91.26
1024	128	48128	0.413	2478.29	1.410	90.76
1024	128	49152	0.416	2459.37	1.418	90.29
1024	128	50176	0.419	2444.44	1.427	89.67
1024	128	51200	0.420	2440.13	1.433	89.31
1024	128	52224	0.425	2410.67	1.441	88.83
1024	128	53248	0.426	2401.83	1.449	88.34
1024	128	54272	0.432	2372.34	1.456	87.91
1024	128	55296	0.434	2360.03	1.463	87.52
1024	128	56320	0.437	2343.49	1.471	87.03
1024	128	57344	0.440	2329.36	1.478	86.60
1024	128	58368	0.442	2314.62	1.488	86.03
1024	128	59392	0.445	2299.02	1.494	85.67
1024	128	60416	0.449	2281.04	1.504	85.09

These are the instructions for compiling:

ik_llama.cpp

# update
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
git pull

# compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_CCACHE=OFF
cmake --build build --config Release -j $(nproc)

# confirm
./build/bin/llama-server --version
version: 4264 (277fc1d2)
built with cc (GCC) 15.2.1 20251112 for x86_64-pc-linux-gnu

# benchmark
# use commands found in details

mainline llama.cpp

# this includes the llama-sweep-bench patch for mainline
# update
git clone --depth 1 --branch ug/port-sweep-bench https://github.com/ubergarm/llama.cpp.git
cd llama.cpp

# compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_CCACHE=OFF
cmake --build build --config Release -j $(nproc)

# confirm
git log --pretty=format:"%h%x09%an%x09%ad%x09%s" | head -n 9
$ ./build/bin/llama-server --version
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 8233 (8eb8c6952)
built with GNU 15.2.1 for Linux x86_64

Okay I'll do the coder test now that it is finished downloading.

ubergarm

Owner Mar 6

•

edited Mar 6

@sabotage3d

Here is the Qwen3-Coder-Next hybrid CPU+GPU results and commands in details below. Compiled same as above. ik is winning over mainline by a small but noticeable margin. Once again mainline borks out at the end either due to larger cuda buffer or might be something with the sweep-bench port.

So given my rig has avx512_vnni I did expect to see better PP for ik over mainline which is true. Not sure on Zen4 rig without but likely marginally better PP. mainline did pretty well here for TG and I'm not 100% sure the arch differences in Qwen3-Coder-Next vs the newer qwen35moe models, but I will do one more test after this for Qwen3.5-35B-A3B CPU-only.

👈 Details

ik_llama.cpp main@277fc1d2

./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -ngl 99 \
  --n-cpu-moe 30 \
  --threads 16 \
  --warmup-batch \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	2.289	1789.36	2.000	63.99
4096	128	4096	2.321	1764.40	2.037	62.85
4096	128	8192	2.361	1735.09	2.076	61.65
4096	128	12288	2.394	1711.00	2.135	59.96
4096	128	16384	2.430	1685.57	2.158	59.31
4096	128	20480	2.472	1656.92	2.187	58.53
4096	128	24576	2.508	1633.27	2.227	57.46
4096	128	28672	2.552	1604.83	2.269	56.41
4096	128	32768	2.586	1583.93	2.307	55.47
4096	128	36864	2.635	1554.49	2.337	54.78
4096	128	40960	2.681	1528.06	2.368	54.07
4096	128	45056	2.716	1508.21	2.413	53.05
4096	128	49152	2.749	1489.91	2.433	52.61
4096	128	53248	2.792	1467.27	2.479	51.63
4096	128	57344	2.842	1441.40	2.515	50.90
4096	128	61440	2.884	1420.47	2.540	50.40
4096	128	65536	2.932	1397.11	2.590	49.43

mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench

model=/mnt/astrodata/llm/models/ubergarm/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_0.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 4096 -b 4096 \
  -ngl 99 \
  --n-cpu-moe 30 \
  --threads 16 \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	2.567	1595.93	2.212	57.86
4096	128	4096	2.602	1573.95	2.232	57.35
4096	128	8192	2.645	1548.56	2.264	56.53
4096	128	12288	2.686	1525.13	2.312	55.37
4096	128	16384	2.726	1502.79	2.346	54.56
4096	128	20480	2.768	1479.84	2.407	53.18
4096	128	24576	2.807	1459.18	2.434	52.58
4096	128	28672	2.860	1432.29	2.469	51.84
4096	128	32768	2.898	1413.55	2.505	51.11
4096	128	36864	2.945	1391.06	2.530	50.59
4096	128	40960	2.987	1371.13	2.564	49.93
4096	128	45056	3.029	1352.17	2.595	49.33
4096	128	49152	3.066	1336.07	2.626	48.75
4096	128	53248	3.109	1317.44	2.664	48.05
4096	128	57344	3.160	1296.19	2.704	47.34
4096	128	61440	3.199	1280.39	2.721	47.04

ubergarm

Owner Mar 6

Okay last one, CPU-only for Qwen3.5-35B-A3B showing ik is doing a lot better with the gated delta net CPU implementation. Compiled same as above with with CUDA off so it is CPU only backend.

👈 Details

ik_llama.cpp main@277fc1d2

./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  --merge-qkv \
  --threads 16 \
  --warmup-batch \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	1.296	790.34	5.271	24.28
1024	128	1024	1.358	754.32	5.141	24.90
1024	128	2048	1.401	730.66	5.149	24.86
1024	128	3072	1.450	706.21	5.194	24.65
1024	128	4096	1.493	685.70	5.221	24.52
1024	128	5120	1.530	669.12	5.240	24.43
1024	128	6144	1.575	650.02	5.253	24.37
1024	128	7168	1.599	640.30	5.287	24.21
1024	128	8192	1.643	623.10	5.281	24.24
1024	128	9216	1.681	609.02	5.302	24.14
1024	128	10240	1.792	571.54	5.332	24.01
1024	128	11264	1.763	580.93	5.335	23.99
1024	128	12288	1.806	566.90	5.369	23.84
1024	128	13312	1.847	554.44	5.397	23.72
1024	128	14336	1.885	543.22	5.402	23.69
1024	128	15360	1.929	530.94	5.431	23.57
1024	128	16384	1.980	517.04	5.440	23.53
1024	128	17408	2.062	496.52	5.496	23.29
1024	128	18432	2.060	497.02	5.511	23.22
1024	128	19456	2.087	490.75	5.568	22.99
1024	128	20480	2.141	478.34	5.645	22.67
1024	128	21504	2.160	474.08	5.627	22.75
1024	128	22528	2.224	460.40	5.634	22.72
1024	128	23552	2.258	453.41	5.689	22.50
1024	128	24576	2.422	422.76	5.692	22.49
1024	128	25600	2.327	440.10	5.683	22.52
1024	128	26624	2.367	432.61	5.762	22.22
1024	128	27648	2.410	424.86	5.788	22.12
1024	128	28672	2.444	419.04	5.817	22.00
1024	128	29696	2.501	409.44	5.843	21.91
1024	128	30720	2.581	396.74	5.853	21.87
1024	128	31744	2.610	392.41	5.853	21.87
1024	128	32768	2.669	383.70	5.842	21.91

mainline llama.cpp master@e68f2fb8 + ug/port-sweep-bench

./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 1024 -b 2048 \
  --threads 16 \
  -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	128	0	6.153	166.41	8.284	15.45
1024	128	1024	7.336	139.59	8.140	15.72
1024	128	2048	7.958	128.68	8.312	15.40
1024	128	3072	9.048	113.17	8.603	14.88
1024	128	4096	10.045	101.94	8.353	15.32
1024	128	5120	10.969	93.35	8.858	14.45
1024	128	6144	11.710	87.45	8.739	14.65
1024	128	7168	12.552	81.58	8.697	14.72
1024	128	8192	13.479	75.97	9.226	13.87
1024	128	9216	14.620	70.04	9.239	13.85
1024	128	10240	15.000	68.27	9.135	14.01
1024	128	11264	16.088	63.65	9.467	13.52
1024	128	12288	16.675	61.41	9.158	13.98
1024	128	13312	17.236	59.41	9.696	13.20
1024	128	14336	18.576	55.12	9.524	13.44
1024	128	15360	19.520	52.46	9.855	12.99
1024	128	16384	19.817	51.67	9.231	13.87
1024	128	17408	19.869	51.54	9.579	13.36
1024	128	18432	21.962	46.63	10.553	12.13
1024	128	19456	22.715	45.08	10.453	12.25
1024	128	20480	23.965	42.73	10.579	12.10
1024	128	21504	24.021	42.63	9.781	13.09
1024	128	22528	24.344	42.06	10.275	12.46
1024	128	23552	24.429	41.92	10.475	12.22
1024	128	24576	27.235	37.60	10.987	11.65
1024	128	25600	26.745	38.29	10.248	12.49
1024	128	26624	27.459	37.29	10.219	12.53
1024	128	27648	29.335	34.91	10.275	12.46
1024	128	28672	31.157	32.87	10.270	12.46
1024	128	29696	32.972	31.06	12.251	10.45
1024	128	30720	33.706	30.38	10.442	12.26
1024	128	31744	35.189	29.10	12.012	10.66
1024	128	32768	34.935	29.31	12.053	10.62

ubergarm

Owner Mar 6

Okay it seems like mainline llama.cpp has PR improving the CPU chunked delta net implementation getting close to merged, I did a 3-way benchmark there if you're interested: https://github.com/ggml-org/llama.cpp/pull/19504#issuecomment-4013706238

sabotage3d

Mar 6

•

edited Mar 6

Thank you for the detailed benchmarks as I don't trust the synthetic benches I am using a simple haystack stress test: https://gist.github.com/sabotage3d/3f9f1fc544495e1c2c4ec1ee3eadebf5
compiled main branch of ik_llama.cpp output using the command below:

MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"

# 4. Launch Server
~/ik_llama.cpp/build/bin/llama-server \
  -m "$MODEL_PATH" \
  --mmproj "$VISION_PATH" \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  -ub 1024 -b 2048 \
  --merge-qkv \
  -ngl 99 \
  --threads 16 \
  --warmup-batch \
  -n 128
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="140690707640320" timestamp=1772834996 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834996 id_slot=0 id_task=0 p0=0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 2047, pos_max = 2047, size = 149.643 MiB, took 1646.78 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834998 id_slot=0 id_task=0 p0=2048
slot create_check: id  0 | task 0 | created context checkpoint 2 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 1555.92 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772834999 id_slot=0 id_task=0 p0=4096
slot create_check: id  0 | task 0 | created context checkpoint 3 of 8 (pos_min = 6143, pos_max = 6143, size = 149.674 MiB, took 1585.55 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835001 id_slot=0 id_task=0 p0=6144
slot create_check: id  0 | task 0 | created context checkpoint 4 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 633.89 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835003 id_slot=0 id_task=0 p0=8192
slot create_check: id  0 | task 0 | created context checkpoint 5 of 8 (pos_min = 10239, pos_max = 10239, size = 149.706 MiB, took 683.44 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835004 id_slot=0 id_task=0 p0=10240
slot create_check: id  0 | task 0 | created context checkpoint 6 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 688.13 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835006 id_slot=0 id_task=0 p0=12288
slot create_check: id  0 | task 0 | created context checkpoint 7 of 8 (pos_min = 14335, pos_max = 14335, size = 149.737 MiB, took 667.95 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835008 id_slot=0 id_task=0 p0=14336
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 16383, pos_max = 16383, size = 149.752 MiB, took 693.84 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835010 id_slot=0 id_task=0 p0=16384
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 2047, pos_max = 2047, size = 149.643 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 18431, pos_max = 18431, size = 149.768 MiB, took 661.93 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835012 id_slot=0 id_task=0 p0=18432
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 20479, pos_max = 20479, size = 149.784 MiB, took 660.57 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835013 id_slot=0 id_task=0 p0=20480
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 6143, pos_max = 6143, size = 149.674 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 22527, pos_max = 22527, size = 149.799 MiB, took 671.89 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835015 id_slot=0 id_task=0 p0=22528
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 8191, pos_max = 8191, size = 149.690 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 24575, pos_max = 24575, size = 149.815 MiB, took 682.74 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835017 id_slot=0 id_task=0 p0=24576
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 10239, pos_max = 10239, size = 149.706 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 26623, pos_max = 26623, size = 149.831 MiB, took 696.47 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835019 id_slot=0 id_task=0 p0=26624
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 12287, pos_max = 12287, size = 149.721 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 28671, pos_max = 28671, size = 149.846 MiB, took 700.38 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835021 id_slot=0 id_task=0 p0=28672
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 14335, pos_max = 14335, size = 149.737 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 30719, pos_max = 30719, size = 149.862 MiB, took 786.73 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835023 id_slot=0 id_task=0 p0=30720
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 16383, pos_max = 16383, size = 149.752 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 32767, pos_max = 32767, size = 149.877 MiB, took 722.32 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835025 id_slot=0 id_task=0 p0=32768
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 18431, pos_max = 18431, size = 149.768 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 34815, pos_max = 34815, size = 149.893 MiB, took 737.62 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835027 id_slot=0 id_task=0 p0=34816
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 20479, pos_max = 20479, size = 149.784 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 36863, pos_max = 36863, size = 149.909 MiB, took 738.45 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835029 id_slot=0 id_task=0 p0=36864
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 22527, pos_max = 22527, size = 149.799 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 38911, pos_max = 38911, size = 149.924 MiB, took 747.90 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835031 id_slot=0 id_task=0 p0=38912
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 24575, pos_max = 24575, size = 149.815 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40436, pos_max = 40436, size = 149.936 MiB, took 1520.62 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140690707640320" timestamp=1772835032 id_slot=0 id_task=0 p0=40437
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 26623, pos_max = 26623, size = 149.831 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40442, pos_max = 40442, size = 149.936 MiB, took 37.77 ms)
slot print_timing: id  0 | task 0 | 
prompt eval time =   36534.43 ms / 40442 tokens (    0.90 ms per token,  1106.96 tokens per second)
       eval time =    5436.31 ms /    50 tokens (  108.73 ms per token,     9.20 tokens per second)
      total time =   41970.74 ms / 40492 tokens
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 28671, pos_max = 28671, size = 149.846 MiB)
INFO [      log_server_request] request | tid="140680000237568" timestamp=1772835038 remote_addr="127.0.0.1" remote_port=60706 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40490, pos_max = 40490, size = 149.936 MiB, took 29.89 ms)
INFO [           release_slots] slot released | tid="140690707640320" timestamp=1772835038 id_slot=0 id_task=0 n_ctx=64000 n_past=40491 n_system_tokens=0 n_cache_tokens=40491 truncated=false
INFO [              slots_idle] all slots are idle | tid="140690707640320" timestamp=1772835038

sabotage3d

Mar 6

This is the same stress test on llama.cpp main branch command and log.

MODEL_PATH="$HOME/models/Qwen3.5_27B/Qwen3.5-27B-UD-Q5_K_XL.gguf"
VISION_PATH="$HOME/models/Qwen3.5_27B/mmproj-BF16.gguf"

# 4. Launch Server
~/llama.cpp/build/bin/llama-server \
  -m "$MODEL_PATH" \
  --mmproj "$VISION_PATH" \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  --fit on \
  --flash-attn on \
  --no-mmap \
  --jinja \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

main: model loaded
main: server is listening on http://0.0.0.0:5000
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 40444
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.050638
slot update_slots: id  0 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.101276
slot update_slots: id  0 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.151914
slot update_slots: id  0 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.202552
slot update_slots: id  0 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.253190
slot update_slots: id  0 | task 0 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.303828
slot update_slots: id  0 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.354465
slot update_slots: id  0 | task 0 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.405103
slot update_slots: id  0 | task 0 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 18432, batch.n_tokens = 2048, progress = 0.455741
slot update_slots: id  0 | task 0 | n_tokens = 18432, memory_seq_rm [18432, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 2048, progress = 0.506379
slot update_slots: id  0 | task 0 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 22528, batch.n_tokens = 2048, progress = 0.557017
slot update_slots: id  0 | task 0 | n_tokens = 22528, memory_seq_rm [22528, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 2048, progress = 0.607655
slot update_slots: id  0 | task 0 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 26624, batch.n_tokens = 2048, progress = 0.658293
slot update_slots: id  0 | task 0 | n_tokens = 26624, memory_seq_rm [26624, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 2048, progress = 0.708931
slot update_slots: id  0 | task 0 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 30720, batch.n_tokens = 2048, progress = 0.759569
slot update_slots: id  0 | task 0 | n_tokens = 30720, memory_seq_rm [30720, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 2048, progress = 0.810207
slot update_slots: id  0 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 34816, batch.n_tokens = 2048, progress = 0.860845
slot update_slots: id  0 | task 0 | n_tokens = 34816, memory_seq_rm [34816, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 2048, progress = 0.911483
slot update_slots: id  0 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 38912, batch.n_tokens = 2048, progress = 0.962120
slot update_slots: id  0 | task 0 | n_tokens = 38912, memory_seq_rm [38912, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 39932, batch.n_tokens = 1020, progress = 0.987341
slot update_slots: id  0 | task 0 | n_tokens = 39932, memory_seq_rm [39932, end)
slot init_sampler: id  0 | task 0 | init sampler, took 3.66 ms, tokens: text = 40444, total = 40444
slot update_slots: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 39931, pos_max = 39931, n_tokens = 39932, size = 149.626 MiB)
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 40444, batch.n_tokens = 512
slot print_timing: id  0 | task 0 | 
prompt eval time =   41380.08 ms / 40444 tokens (    1.02 ms per token,   977.38 tokens per second)
       eval time =    1862.34 ms /    50 tokens (   37.25 ms per token,    26.85 tokens per second)
      total time =   43242.42 ms / 40494 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 40493, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

sabotage3d changed discussion status to closed Mar 6

sabotage3d changed discussion status to open Mar 6

sabotage3d

Mar 6

This is ik_llama.cpp on Qwen 3 Coder Next 80b using my stress test.

~/ik_llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  -ngl 99 \
  --n-cpu-moe 28 \
  -t 16 \
  -b 4096 \
  -ub 4096 \
  --merge-qkv \
  --mlock \
  --no-mmap \
  --warmup-batch \
  -n 128
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.8 \
  --min-p 0.05 \
  --presence-penalty 1.1 \
  --dry-multiplier 0.5 \
  --dry-base 1.75 \
  --dry-allowed-length 2 \
  --dry-penalty-last-n 4096 \

======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="139703933071360" timestamp=1772836818 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836818 id_slot=0 id_task=0 p0=0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 75.408 MiB, took 501.29 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836821 id_slot=0 id_task=0 p0=4096
slot create_check: id  0 | task 0 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 75.439 MiB, took 521.09 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836823 id_slot=0 id_task=0 p0=8192
slot create_check: id  0 | task 0 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 75.471 MiB, took 547.12 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836826 id_slot=0 id_task=0 p0=12288
slot create_check: id  0 | task 0 | created context checkpoint 4 of 8 (pos_min = 16383, pos_max = 16383, size = 75.502 MiB, took 567.33 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836828 id_slot=0 id_task=0 p0=16384
slot create_check: id  0 | task 0 | created context checkpoint 5 of 8 (pos_min = 20479, pos_max = 20479, size = 75.533 MiB, took 580.02 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836830 id_slot=0 id_task=0 p0=20480
slot create_check: id  0 | task 0 | created context checkpoint 6 of 8 (pos_min = 24575, pos_max = 24575, size = 75.564 MiB, took 609.27 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836833 id_slot=0 id_task=0 p0=24576
slot create_check: id  0 | task 0 | created context checkpoint 7 of 8 (pos_min = 28671, pos_max = 28671, size = 75.596 MiB, took 638.78 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836836 id_slot=0 id_task=0 p0=28672
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 32767, pos_max = 32767, size = 75.627 MiB, took 651.56 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836838 id_slot=0 id_task=0 p0=32768
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 4095, pos_max = 4095, size = 75.408 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 36863, pos_max = 36863, size = 75.658 MiB, took 662.48 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836841 id_slot=0 id_task=0 p0=36864
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 8191, pos_max = 8191, size = 75.439 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40436, pos_max = 40436, size = 75.685 MiB, took 616.27 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139703933071360" timestamp=1772836843 id_slot=0 id_task=0 p0=40437
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 12287, pos_max = 12287, size = 75.471 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40442, pos_max = 40442, size = 75.685 MiB, took 31.51 ms)
slot print_timing: id  0 | task 0 | 
prompt eval time =   25380.13 ms / 40442 tokens (    0.63 ms per token,  1593.45 tokens per second)
       eval time =    1015.06 ms /    50 tokens (   20.30 ms per token,    49.26 tokens per second)
      total time =   26395.19 ms / 40492 tokens
INFO [      log_server_request] request | tid="139658447953920" timestamp=1772836845 remote_addr="127.0.0.1" remote_port=38406 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 0 | erasing old context checkpoint (pos_min = 16383, pos_max = 16383, size = 75.502 MiB)
slot create_check: id  0 | task 0 | created context checkpoint 8 of 8 (pos_min = 40490, pos_max = 40490, size = 75.686 MiB, took 30.76 ms)
INFO [           release_slots] slot released | tid="139703933071360" timestamp=1772836845 id_slot=0 id_task=0 n_ctx=64000 n_past=40491 n_system_tokens=0 n_cache_tokens=40491 truncated=false
INFO [              slots_idle] all slots are idle | tid="139703933071360" timestamp=1772836845

sabotage3d changed discussion status to closed Mar 6

sabotage3d changed discussion status to open Mar 6

sabotage3d

Mar 6

This is llama.cpp on Qwen 3 Coder Next 80b using my stress test.

~/llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3-Coder-Next-Q3_K_M.gguf \
  --host 0.0.0.0 --port 5000 \
  --ctx-size 64000 \
  --parallel 1 \
  -ngl 99 \
  --n-cpu-moe 28 \
  -t 16 \
  -b 4096 \
  -ub 2048 \
  --flash-attn on \
  --mlock \
  --no-mmap \
  --jinja \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 0.8 \
  --min-p 0.05 \
  --presence-penalty 1.1 \
  --dry-multiplier 0.5 \
  --dry-base 1.75 \
  --dry-allowed-length 2 \
  --dry-penalty-last-n 4096 \

srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 40442
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 4096, progress = 0.101281
slot update_slots: id  0 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 4096, progress = 0.202562
slot update_slots: id  0 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 4096, progress = 0.303843
slot update_slots: id  0 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 4096, progress = 0.405123
slot update_slots: id  0 | task 0 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 4096, progress = 0.506404
slot update_slots: id  0 | task 0 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 4096, progress = 0.607685
slot update_slots: id  0 | task 0 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 4096, progress = 0.708966
slot update_slots: id  0 | task 0 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 4096, progress = 0.810247
slot update_slots: id  0 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 4096, progress = 0.911528
slot update_slots: id  0 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 39930, batch.n_tokens = 3066, progress = 0.987340
slot update_slots: id  0 | task 0 | n_tokens = 39930, memory_seq_rm [39930, end)
slot init_sampler: id  0 | task 0 | init sampler, took 4.46 ms, tokens: text = 40442, total = 40442
slot update_slots: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 39929, pos_max = 39929, n_tokens = 39930, size = 75.376 MiB)
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 40442, batch.n_tokens = 512
slot print_timing: id  0 | task 0 | 
prompt eval time =   28284.04 ms / 40442 tokens (    0.70 ms per token,  1429.85 tokens per second)
       eval time =    1218.06 ms /    50 tokens (   24.36 ms per token,    41.05 tokens per second)
      total time =   29502.10 ms / 40492 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 40491, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

sabotage3d changed discussion status to closed Mar 6

sabotage3d changed discussion status to open Mar 6

sabotage3d

Mar 7

Based on my tests the Qwen 3 Coder Next has improvement of 10 t/s in ik_llama , but a dense model like Qwen 3.5 27B very bad performance on ik_llama, anything that I am missing in my flags that can improve that?

ubergarm

Owner Mar 8

@sabotage3d

Thanks for sharing full logs and your test script. Let me take a look, seems like you are testing two quants (though , each on ik vs mainline

Qwen3.5-27B-UD-Q5_K_XL

ik

prompt eval time =   36534.43 ms / 40442 tokens (    0.90 ms per token,  1106.96 tokens per second)
       eval time =    5436.31 ms /    50 tokens (  108.73 ms per token,     9.20 tokens per second)
      total time =   41970.74 ms / 40492 tokens

mainline

prompt eval time =   41380.08 ms / 40444 tokens (    1.02 ms per token,   977.38 tokens per second)
       eval time =    1862.34 ms /    50 tokens (   37.25 ms per token,    26.85 tokens per second)
      total time =   43242.42 ms / 40494 tokens

Qwen3-Coder-Next-Q3_K_M.gguf

ik

prompt eval time =   25380.13 ms / 40442 tokens (    0.63 ms per token,  1593.45 tokens per second)
       eval time =    1015.06 ms /    50 tokens (   20.30 ms per token,    49.26 tokens per second)
      total time =   26395.19 ms / 40492 tokens

mainline

prompt eval time =   28284.04 ms / 40442 tokens (    0.70 ms per token,  1429.85 tokens per second)
       eval time =    1218.06 ms /    50 tokens (   24.36 ms per token,    41.05 tokens per second)
      total time =   29502.10 ms / 40492 tokens

Firstly, myself and ik recommend against using UD quants that have downcast bf16 to f16 as described here: https://github.com/ikawrakow/ik_llama.cpp/?tab=readme-ov-file#tldr . this is because unless you check all values in all tensors there could be clipping during downcasting despite potential speed benefits, but honestly I don't use 16bit tensors in any of my quant as q8_0 is plenty good and doesn't have clipping issue and is half the size.

But specific to the tests, you say:

Qwen 3.5 27B very bad performance on ik_llama, anything that I am missing in my flags that can improve that?

From what I see ik is faster prompt processing than mainline in both cases.

Given you are only generating 50 tokens of output, that is not enough imo to make a clear distinction between the two. Perhaps you could update your test to generate a longer output?

The llama-sweep-bench while being "synthetic" is still exercising the underlying kernel implementations and representative of potential speeds.

But yes all the other arguments, caching, and stuff do come into play for actual use.

ubergarm

Owner Mar 10

@sabotage3d

I noticed a guy open an issue on ik_llama.cpp discussing potential speed difference in llama-sweep-bench and actual use with samplers etc. Might be of interest to you: https://github.com/ikawrakow/ik_llama.cpp/issues/1390

sabotage3d

Mar 11

Hi, sorry for the delay. I’m still trying to figure out something very odd. I’m running a new test where the mainline llama.cpp consistently clocks around 29 t/s, while ik_llama sometimes runs at 29, 27, 25, or even 15 t/s on subsequent runs.

The funny thing is that the test itself is a bit broken, but the input and output remain consistent. This is the test if you run it a couple of times on ik_llama: https://gist.github.com/sabotage3d/3f9f1fc544495e1c2c4ec1ee3eadebf5#file-haystrack_stress_weird-py

ubergarm

Owner Mar 11

@sabotage3d

Is the llama-server doing any prompt caching? I wonder if you have to explicitly disable that e.g. -cram 0 --ctx-checkpoints 0 or something (i'm not 100% sure, just spitballing).

Also heads up PSA: some of the new re-uploaded Qwen3.5 quants using the mainline pre-fused up|gate tensor stuff are broken on ik. ik can do the fusion on the fly with -muge and some comparison benchmarks I just did there: https://github.com/ikawrakow/ik_llama.cpp/pull/1403

sabotage3d

27 days ago

I had some more time on my hands today, so I recompiled both ik and main llama.cpp from latest and had some new good results. I also did a new token randomized bench that just inputs and output certain amount of tokens so that we have a more accurate result. Its a big win the for the prefill speed, which is very impressive. I think I would use more IK in the future. I will also try the none UD models next. I didn't see much difference with -muge.

model: Qwen3.5-27B-UD-Q5_K_XL.gguf

mainline

prompt eval time =    7002.15 ms /  7308 tokens (    0.96 ms per token,  1043.68 tokens per second)
       eval time =   65670.21 ms /  2000 tokens (   32.84 ms per token,    30.46 tokens per second)
      total time =   72672.36 ms /  9308 tokens

prompt eval time =    6166.72 ms /  7308 tokens (    0.84 ms per token,  1185.07 tokens per second)
       eval time =   69941.98 ms /  2000 tokens (   34.97 ms per token,    28.60 tokens per second)
      total time =   76108.70 ms /  9308 tokens

ubergarm

Owner 26 days ago

@sabotage3d

Yeah its confusing as many new mainline quants are "pre-fused" so don't need -muge on ik. i recommend using --merge-qkv -muge on my quants. the fallback is sane if not supported for any other reasons.

Glad you're getting more throughput!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment