Text Generation
Transformers
Safetensors
qwen3_next
qwen3
Mixture of Experts
int2
quantized
autoround
2-bit precision
conversational
2-bit
auto-round
Instructions to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound") model = AutoModelForCausalLM.from_pretrained("YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound
- SGLang
How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound with Docker Model Runner:
docker model run hf.co/YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound
| <p align="center"> | |
| <img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/685e122d50df66f41587d406/kzCYol3JJ0nC3x3RvBZWd.png" alt="语调"> | |
| </p> | |
| 语言 中文|[English](https://huggingface.co/YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound) | |
| ## 模型详情 | |
| 该模型是一个 **mixed-bits INT2 量化** 模型,group_size 为 512,并对 [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) 采用对称量化,由 [intel/auto-round](https://github.com/intel/auto-round) 生成。请遵循原始模型的 license。 | |
| ### 量化策略(Intel MoE Recipe) | |
| | 层类型 | Bits | 说明 | | |
| | ---------------------------------- | ------ | ------------------- | | |
| | Expert layers(512 个 experts) | 2-bit | MoE expert MLPs | | |
| | Non-expert layers(attention, gate) | 16-bit | 为保证质量使用更高精度 | | |
| | shared_expert_gate | 16-bit | 跳过(shape 不能被 32 整除) | | |
| | lm_head | 原始精度 | 被 AutoRound 排除 | | |
| ### 模型大小 | |
| * **原始 BF16**:~160GB | |
| * **mixed INT2**:~25GB(**减少 84%↓↓**) | |
| ## 快速开始 | |
| ### Transformers 使用方法 | |
| ```python | |
| import math | |
| import os | |
| os.environ.setdefault( | |
| "PYTORCH_ALLOC_CONF", | |
| # 保持更安全的 allocator 默认设置,并透明迁移已弃用的环境变量 | |
| os.environ.pop("PYTORCH_CUDA_ALLOC_CONF", None) or "expandable_segments:True", | |
| ) | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| MODEL_NAME = "YCWTG/Qwen3-Coder-Next-int2-mixed-AutoRound" | |
| AUTO_MAX_TOKENS = True | |
| MANUAL_MAX_NEW_TOKENS = 128 | |
| AUTO_MAX_TOKENS_RATIO = 1.5 | |
| HAS_CUDA = torch.cuda.is_available() | |
| # 读取一次 VRAM 总量,并据此选择默认加载模式 | |
| GPU_TOTAL_MIB = torch.cuda.get_device_properties(0).total_memory // (1024 ** 2) if HAS_CUDA else 0 | |
| # 32GB 级别 GPU 默认 False;更小显存的 GPU 默认 True | |
| ENABLE_CPU_OFFLOAD = HAS_CUDA and GPU_TOTAL_MIB < 32000 | |
| MAX_MEMORY = {0: "18GiB", "cpu": "64GiB"} if ENABLE_CPU_OFFLOAD else {0: "22GiB", "cpu": "16GiB"} | |
| def get_input_device(model): | |
| # 使用 device_map="auto" 时,第一个可用设备可能不是 model.device | |
| device_map = getattr(model, "hf_device_map", None) | |
| cpu_device = None | |
| if isinstance(device_map, dict): | |
| for loc in device_map.values(): | |
| if isinstance(loc, int): | |
| return torch.device(f"cuda:{loc}") | |
| if isinstance(loc, str): | |
| if loc.startswith("cuda"): | |
| return torch.device(loc) | |
| if loc.startswith("cpu"): | |
| cpu_device = torch.device("cpu") | |
| return cpu_device or next(model.parameters()).device | |
| def load_model(): | |
| print("正在加载模型...") | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, use_fast=True) | |
| # 使用 EOS 作为 PAD,避免在没有 pad token 的模型上进行 chat generation 时出现警告 | |
| tokenizer.pad_token = tokenizer.pad_token or tokenizer.eos_token | |
| tokenizer.padding_side = "left" | |
| model_kwargs = { | |
| "pretrained_model_name_or_path": MODEL_NAME, | |
| "dtype": torch.bfloat16, | |
| "trust_remote_code": True, | |
| "low_cpu_mem_usage": True, | |
| "device_map": "auto" if HAS_CUDA else "cpu", | |
| } | |
| if HAS_CUDA: | |
| print(f"GPU 总显存: {GPU_TOTAL_MIB} MiB") | |
| model_kwargs["max_memory"] = MAX_MEMORY | |
| if ENABLE_CPU_OFFLOAD: | |
| model_kwargs["offload_buffers"] = True | |
| print("CPU offload: 开启") | |
| else: | |
| print("CPU offload: 关闭(优先使用 GPU,允许少量 CPU 溢出)") | |
| else: | |
| print("CUDA 不可用,在 CPU 上运行") | |
| try: | |
| model = AutoModelForCausalLM.from_pretrained(**model_kwargs) | |
| except RuntimeError as e: | |
| # 提供对新手更友好的提示,而不是只显示原始报错堆栈 | |
| if "out of memory" in str(e).lower(): | |
| print("\n加载模型时发生 CUDA OOM") | |
| print("请关闭其他 GPU 程序,或设置 ENABLE_CPU_OFFLOAD = True 后重试") | |
| raise | |
| model.eval() | |
| return model, tokenizer | |
| def multiline_input(): | |
| print('用户(单独一行输入 "END" 发送,输入 "exit" 退出):') | |
| lines = [] | |
| while True: | |
| line = input() | |
| text = line.strip() | |
| if text.lower() in {"exit", "quit"}: | |
| return None | |
| if text == "END": | |
| break | |
| lines.append(line) | |
| return "\n".join(lines) | |
| def build_input_ids(tokenizer, messages, device): | |
| if getattr(tokenizer, "chat_template", None): | |
| # chat 模型的优先路径:由 tokenizer 构建 prompt 格式 | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| else: | |
| # 没有内置 chat template 时的通用回退方案 | |
| prompt = "\n".join( | |
| [f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}" for m in messages] | |
| + ["Assistant:"] | |
| ) | |
| return tokenizer(prompt, return_tensors="pt")["input_ids"].to(device) | |
| def chat_loop(model, tokenizer): | |
| print("\n===== 对话已开始 =====\n") | |
| print(f"自动 max_tokens: {'开启' if AUTO_MAX_TOKENS else '关闭'}") | |
| if not AUTO_MAX_TOKENS: | |
| print(f"手动 max_new_tokens: {MANUAL_MAX_NEW_TOKENS}") | |
| print( | |
| "提示:设置 ENABLE_CPU_OFFLOAD = False 可尝试更快的全 GPU 模式" | |
| if ENABLE_CPU_OFFLOAD | |
| else "提示:若 max_tokens 过大导致 CUDA OOM,可设置 ENABLE_CPU_OFFLOAD = True" | |
| ) | |
| messages = [] | |
| device = get_input_device(model) | |
| print(f"输入设备: {device}") | |
| while True: | |
| user_text = multiline_input() | |
| if user_text is None: | |
| break | |
| messages.append({"role": "user", "content": user_text}) | |
| input_ids = build_input_ids(tokenizer, messages, device) | |
| prompt_tokens = int(input_ids.shape[-1]) | |
| # 自动模式:输出长度随 prompt 长度缩放(默认 1.5 倍) | |
| max_new_tokens = max(1, math.ceil(prompt_tokens * AUTO_MAX_TOKENS_RATIO)) if AUTO_MAX_TOKENS else int(MANUAL_MAX_NEW_TOKENS) | |
| print(f"Prompt tokens: {prompt_tokens}") | |
| print(f"max_new_tokens: {max_new_tokens}") | |
| try: | |
| with torch.inference_mode(): | |
| output_ids = model.generate( | |
| input_ids=input_ids, | |
| max_new_tokens=max_new_tokens, | |
| do_sample=True, | |
| temperature=1.0, | |
| top_p=0.95, | |
| top_k=40, | |
| use_cache=False, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| except RuntimeError as e: | |
| error_text = str(e).lower() | |
| if HAS_CUDA and ("cublas_status_alloc_failed" in error_text or "out of memory" in error_text): | |
| # 清理 CUDA 缓存块,使下一次尝试在更干净的 CUDA 状态下开始 | |
| torch.cuda.empty_cache() | |
| print("\n生成过程中发生 CUDA OOM") | |
| print("请设置 ENABLE_CPU_OFFLOAD = True,或关闭 AUTO_MAX_TOKENS 并降低 MANUAL_MAX_NEW_TOKENS") | |
| messages.pop() | |
| continue | |
| raise | |
| reply_text = tokenizer.decode(output_ids[0, input_ids.shape[-1]:], skip_special_tokens=True) | |
| print(f"\nAssistant:\n{reply_text}\n") | |
| messages.append({"role": "assistant", "content": reply_text}) | |
| if __name__ == "__main__": | |
| model, tokenizer = load_model() | |
| chat_loop(model, tokenizer) | |
| ``` | |
| ## 生成模型 | |
| ```python | |
| from auto_round import AutoRound | |
| model_name = "Qwen/Qwen3-Coder-Next" | |
| # 为 mixed-bits 构建 layer config(Intel recipe) | |
| layer_config = {} | |
| for i in range(48): # 48 层 | |
| prefix = f"model.layers.{i}" | |
| # Attention 层 -> 16-bit | |
| if i in [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47]: # self_attn 层 | |
| for proj in ["q_proj", "k_proj", "v_proj", "o_proj"]: | |
| layer_config[f"{prefix}.self_attn.{proj}"] = {"bits": 16} | |
| else: # linear_attn 层 -> 16-bit | |
| for proj in ["in_proj_qkvz", "in_proj_ba", "out_proj"]: | |
| layer_config[f"{prefix}.linear_attn.{proj}"] = {"bits": 16} | |
| # MLP gate -> 16-bit | |
| layer_config[f"{prefix}.mlp.gate"] = {"bits": 16} | |
| # shared_expert_gate -> 16-bit(跳过) | |
| layer_config[f"{prefix}.mlp.shared_expert_gate"] = {"bits": 16} | |
| autoround = AutoRound( | |
| model_name, | |
| bits=2, # experts 默认 2-bit | |
| group_size=128, | |
| sym=True, | |
| iters=1000, | |
| nsamples=512, | |
| lr=2e-3, | |
| layer_config=layer_config, | |
| low_gpu_mem_usage=True, | |
| enable_alg_ext=True | |
| ) | |
| output_dir="~/model/YCWTG--Qwen3-Coder-Next-int2-mixed-AutoRound" | |
| autoround.quantize_and_save(output_dir, format="auto_round") | |
| ``` | |
| ## 伦理考量与局限性 | |
| 该模型可能会生成事实不准确的输出,因此不应被依赖用于提供绝对准确的信息。由于 pretrained model 以及 finetuning datasets 的局限性,模型可能生成带有低俗(lewd)、偏见(biased)或其他具有冒犯性的内容。 | |
| 因此,在部署任何基于该模型的应用之前,开发者应进行安全测试。 | |
| ## 注意事项与建议 | |
| 用户(包括直接用户和下游用户)应充分了解模型的风险、偏见和局限性。 | |
| 以下是一些用于进一步了解 Intel AI software 的有用链接: | |
| * [Intel Neural Compressor](https://github.com/intel/neural-compressor) | |
| * [AutoRound](https://github.com/intel/auto-round) | |
| ## 免责声明 | |
| 本模型的 license 不构成法律建议。对于第三方使用本模型所产生的行为,我们不承担责任。在将该模型用于商业用途之前,请咨询律师。 | |
| ## 引用 | |
| @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} } | |
| [arxiv](https://arxiv.org/abs/2309.05516) |