Instructions to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound")
model = AutoModelForCausalLM.from_pretrained("YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound

SGLang

How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with Docker Model Runner:
```
docker model run hf.co/YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound
```

Qwen3-Coder-Next-int2-mixed-AutoRound / README_zh.md

YCWTG

Update README_zh.md

849e8d3 verified 2 months ago

preview code

raw

history blame contribute delete

10.2 kB

	<p align="center">
	<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/685e122d50df66f41587d406/kzCYol3JJ0nC3x3RvBZWd.png" alt="语调">
	</p>

	语言中文\|[English](https://huggingface.co/YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound)

	## 模型详情

	该模型是一个 mixed-bits INT2 量化模型，group_size 为 512，并对 [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) 采用对称量化，由 [intel/auto-round](https://github.com/intel/auto-round) 生成。请遵循原始模型的 license。

	### 量化策略（Intel MoE Recipe）

	\| 层类型 \| Bits \| 说明 \|
	\| ---------------------------------- \| ------ \| ------------------- \|
	\| Expert layers（512 个 experts） \| 2-bit \| MoE expert MLPs \|
	\| Non-expert layers（attention, gate） \| 16-bit \| 为保证质量使用更高精度 \|
	\| shared_expert_gate \| 16-bit \| 跳过（shape 不能被 32 整除） \|
	\| lm_head \| 原始精度 \| 被 AutoRound 排除 \|

	### 模型大小

	* 原始 BF16：~160GB
	* mixed INT2：~25GB（减少 84%↓↓）


	## 快速开始

	### Transformers 使用方法

	```python
	import math
	import os

	os.environ.setdefault(
	"PYTORCH_ALLOC_CONF",
	# 保持更安全的 allocator 默认设置，并透明迁移已弃用的环境变量
	os.environ.pop("PYTORCH_CUDA_ALLOC_CONF", None) or "expandable_segments:True",
	)

	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	MODEL_NAME = "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound"
	AUTO_MAX_TOKENS = True
	MANUAL_MAX_NEW_TOKENS = 128
	AUTO_MAX_TOKENS_RATIO = 1.5

	HAS_CUDA = torch.cuda.is_available()
	# 读取一次 VRAM 总量，并据此选择默认加载模式
	GPU_TOTAL_MIB = torch.cuda.get_device_properties(0).total_memory // (1024 ** 2) if HAS_CUDA else 0
	# 32GB 级别 GPU 默认 False；更小显存的 GPU 默认 True
	ENABLE_CPU_OFFLOAD = HAS_CUDA and GPU_TOTAL_MIB < 32000
	MAX_MEMORY = {0: "18GiB", "cpu": "64GiB"} if ENABLE_CPU_OFFLOAD else {0: "22GiB", "cpu": "16GiB"}


	def get_input_device(model):
	# 使用 device_map="auto" 时，第一个可用设备可能不是 model.device
	device_map = getattr(model, "hf_device_map", None)
	cpu_device = None
	if isinstance(device_map, dict):
	for loc in device_map.values():
	if isinstance(loc, int):
	return torch.device(f"cuda:{loc}")
	if isinstance(loc, str):
	if loc.startswith("cuda"):
	return torch.device(loc)
	if loc.startswith("cpu"):
	cpu_device = torch.device("cpu")
	return cpu_device or next(model.parameters()).device


	def load_model():
	print("正在加载模型...")
	tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, use_fast=True)
	# 使用 EOS 作为 PAD，避免在没有 pad token 的模型上进行 chat generation 时出现警告
	tokenizer.pad_token = tokenizer.pad_token or tokenizer.eos_token
	tokenizer.padding_side = "left"

	model_kwargs = {
	"pretrained_model_name_or_path": MODEL_NAME,
	"dtype": torch.bfloat16,
	"trust_remote_code": True,
	"low_cpu_mem_usage": True,
	"device_map": "auto" if HAS_CUDA else "cpu",
	}
	if HAS_CUDA:
	print(f"GPU 总显存: {GPU_TOTAL_MIB} MiB")
	model_kwargs["max_memory"] = MAX_MEMORY
	if ENABLE_CPU_OFFLOAD:
	model_kwargs["offload_buffers"] = True
	print("CPU offload: 开启")
	else:
	print("CPU offload: 关闭（优先使用 GPU，允许少量 CPU 溢出）")
	else:
	print("CUDA 不可用，在 CPU 上运行")

	try:
	model = AutoModelForCausalLM.from_pretrained(**model_kwargs)
	except RuntimeError as e:
	# 提供对新手更友好的提示，而不是只显示原始报错堆栈
	if "out of memory" in str(e).lower():
	print("\n加载模型时发生 CUDA OOM")
	print("请关闭其他 GPU 程序，或设置 ENABLE_CPU_OFFLOAD = True 后重试")
	raise
	model.eval()
	return model, tokenizer


	def multiline_input():
	print('用户（单独一行输入 "END" 发送，输入 "exit" 退出）：')
	lines = []
	while True:
	line = input()
	text = line.strip()
	if text.lower() in {"exit", "quit"}:
	return None
	if text == "END":
	break
	lines.append(line)
	return "\n".join(lines)


	def build_input_ids(tokenizer, messages, device):
	if getattr(tokenizer, "chat_template", None):
	# chat 模型的优先路径：由 tokenizer 构建 prompt 格式
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	else:
	# 没有内置 chat template 时的通用回退方案
	prompt = "\n".join(
	[f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}" for m in messages]
	+ ["Assistant:"]
	)
	return tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)


	def chat_loop(model, tokenizer):
	print("\n===== 对话已开始 =====\n")
	print(f"自动 max_tokens: {'开启' if AUTO_MAX_TOKENS else '关闭'}")
	if not AUTO_MAX_TOKENS:
	print(f"手动 max_new_tokens: {MANUAL_MAX_NEW_TOKENS}")
	print(
	"提示：设置 ENABLE_CPU_OFFLOAD = False 可尝试更快的全 GPU 模式"
	if ENABLE_CPU_OFFLOAD
	else "提示：若 max_tokens 过大导致 CUDA OOM，可设置 ENABLE_CPU_OFFLOAD = True"
	)

	messages = []
	device = get_input_device(model)
	print(f"输入设备: {device}")

	while True:
	user_text = multiline_input()
	if user_text is None:
	break

	messages.append({"role": "user", "content": user_text})
	input_ids = build_input_ids(tokenizer, messages, device)
	prompt_tokens = int(input_ids.shape[-1])
	# 自动模式：输出长度随 prompt 长度缩放（默认 1.5 倍）
	max_new_tokens = max(1, math.ceil(prompt_tokens * AUTO_MAX_TOKENS_RATIO)) if AUTO_MAX_TOKENS else int(MANUAL_MAX_NEW_TOKENS)

	print(f"Prompt tokens: {prompt_tokens}")
	print(f"max_new_tokens: {max_new_tokens}")

	try:
	with torch.inference_mode():
	output_ids = model.generate(
	input_ids=input_ids,
	max_new_tokens=max_new_tokens,
	do_sample=True,
	temperature=1.0,
	top_p=0.95,
	top_k=40,
	use_cache=False,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)
	except RuntimeError as e:
	error_text = str(e).lower()
	if HAS_CUDA and ("cublas_status_alloc_failed" in error_text or "out of memory" in error_text):
	# 清理 CUDA 缓存块，使下一次尝试在更干净的 CUDA 状态下开始
	torch.cuda.empty_cache()
	print("\n生成过程中发生 CUDA OOM")
	print("请设置 ENABLE_CPU_OFFLOAD = True，或关闭 AUTO_MAX_TOKENS 并降低 MANUAL_MAX_NEW_TOKENS")
	messages.pop()
	continue
	raise

	reply_text = tokenizer.decode(output_ids[0, input_ids.shape[-1]:], skip_special_tokens=True)
	print(f"\nAssistant:\n{reply_text}\n")
	messages.append({"role": "assistant", "content": reply_text})


	if __name__ == "__main__":
	model, tokenizer = load_model()
	chat_loop(model, tokenizer)
	```

	## 生成模型

	```python
	from auto_round import AutoRound

	model_name = "Qwen/Qwen3-Coder-Next"

	# 为 mixed-bits 构建 layer config（Intel recipe）
	layer_config = {}
	for i in range(48): # 48 层
	prefix = f"model.layers.{i}"

	# Attention 层 -> 16-bit
	if i in [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]: # self_attn 层
	for proj in ["q_proj", "k_proj", "v_proj", "o_proj"]:
	layer_config[f"{prefix}.self_attn.{proj}"] = {"bits": 16}
	else: # linear_attn 层 -> 16-bit
	for proj in ["in_proj_qkvz", "in_proj_ba", "out_proj"]:
	layer_config[f"{prefix}.linear_attn.{proj}"] = {"bits": 16}

	# MLP gate -> 16-bit
	layer_config[f"{prefix}.mlp.gate"] = {"bits": 16}

	# shared_expert_gate -> 16-bit（跳过）
	layer_config[f"{prefix}.mlp.shared_expert_gate"] = {"bits": 16}

	autoround = AutoRound(
	model_name,
	bits=2, # experts 默认 2-bit
	group_size=128,
	sym=True,
	iters=1000,
	nsamples=512,
	lr=2e-3,
	layer_config=layer_config,
	low_gpu_mem_usage=True,
	enable_alg_ext=True
	)
	output_dir="~/model/YCWTG--Qwen3-Coder-Next-int2-mixed-AutoRound"
	autoround.quantize_and_save(output_dir, format="auto_round")
	```

	## 伦理考量与局限性

	该模型可能会生成事实不准确的输出，因此不应被依赖用于提供绝对准确的信息。由于 pretrained model 以及 finetuning datasets 的局限性，模型可能生成带有低俗（lewd）、偏见（biased）或其他具有冒犯性的内容。
	因此，在部署任何基于该模型的应用之前，开发者应进行安全测试。



	## 注意事项与建议

	用户（包括直接用户和下游用户）应充分了解模型的风险、偏见和局限性。
	以下是一些用于进一步了解 Intel AI software 的有用链接：

	* [Intel Neural Compressor](https://github.com/intel/neural-compressor)
	* [AutoRound](https://github.com/intel/auto-round)



	## 免责声明

	本模型的 license 不构成法律建议。对于第三方使用本模型所产生的行为，我们不承担责任。在将该模型用于商业用途之前，请咨询律师。



	## 引用

	@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

	[arxiv](https://arxiv.org/abs/2309.05516)