Can I run this with 36gb VRAM+ 192gb ram

#1
by nimishchaudhari - opened

Hello there

I'm looking to run an AWQ quant on my 36gb (3090+3060 12gb) VRAM with 192gb ddr5 ram. I never managed to run any AWQ quants, can I request some help to run this on my setup?

Thanks in advance

Try it

Hello. Thanks for responding. Can you share me your parameters to run the model by you?
For example your setup, and vLLM parameters please?

Thanks in advance

Stop being lazy, go read the model card, vLLM documentation, etc and then figure it out with trial-and-error - if you get stuck, Google it or ask ChatGPT

I just spent the evening debugging this on a dual GPU setup (2x RTX 3090, Ryzen 9950X, 96GB RAM). this model requires a specific environment because standard pip packages do not support the GLM-4.7 architecture yet.

Here are my findings regarding MTP, the exact Dockerfile you need, and the memory math for your specific 3090 + 3060 setup.

1. Findings: MTP (Speculative Decoding) is Broken on AWQ

The "Flash" feature (MTP) is currently not viable for the AWQ version of this model in vLLM.

  • The Issue: I tested Speculative Decoding extensively. Even on coding tasks, the draft acceptance rate was 0% to 1% in the logs.
  • The Impact: MTP consumes roughly 5GB of extra VRAM to run the draft model. Since the main model rejects 99% of the draft tokens, you are burning 5GB of VRAM for no speed gain.
  • Performance: By disabling MTP, I achieved ~80-90 tokens/second using the native Marlin 4-bit kernels on 2x3090s. This is extremely fast and stable.

2. Findings: Memory & Limits (For your 3060)

I analyzed the memory usage inside the container via nvidia-smi.

  • Model Weights: 18.5 GB Total (9.25 GB per GPU).
  • PyTorch Overhead: ~0.8 GB per GPU.
  • Total Static Load: ~10.1 GB per card.

For your specific setup (3090 + 3060 12GB):
Since vLLM splits the model evenly, your 12GB card is the hard limit.

  • You have ~1.9 GB free on the 3060 for the Context (KV Cache).
  • CRITICAL: You must limit --max-model-len to 4096 (safe) or 8192 (risky). If you try to run standard 32k context, your 3060 will crash immediately.
  • CRITICAL: You must limit --max-num-seqs to 16 or 32 to reduce the sampler memory overhead.
  • You should look into llamacpp or other inference servers for better splitting.

3. The Dockerfile (Required)

You cannot run this with standard pip libraries. The GLM-4.7 architecture definition only exists in the Nightly builds right now.

Save as Dockerfile:

# Use a recent CUDA base image
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# Install system dependencies
RUN apt-get update && apt-get install -y \
    python3-pip \
    python3-dev \
    git \
    ninja-build \
    && rm -rf /var/lib/apt/lists/*

# Upgrade pip
RUN pip3 install --upgrade pip

# 1. Install vLLM Nightly (Required for GLM-4.7 architecture support)
RUN pip3 install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly

# 2. Install Transformers from source (CRITICAL: Fixes "Model Class not found" errors)
RUN pip3 install git+https://github.com/huggingface/transformers.git

# 3. Install AutoAWQ/Kernels
RUN pip3 install autoawq setuptools

# Set the working directory
WORKDIR /app

# Entrypoint
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

Build command:
docker build -t glm-4.7-custom .

4. The Optimized Run Command

This command disables the broken MTP, maximizes memory usage, and sets a safe context limit for your 12GB card.

docker run -it --rm \
    --gpus all \
    --ipc=host \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    glm-4.7-custom \
    --model cyankiwi/GLM-4.7-Flash-AWQ-4bit \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-4.7-flash \
    --max-model-len 4096 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.95

Note: I set max-model-len to 4096 to ensure it fits your 12GB card. If it loads, you can try bumping it to 8192.

@Pa3kx this is detailed and high quality article. Thank you so much.

@Pa3kx thanks for the information, do you have an idea why the model takes so much VRAM for KV cache? i was expecting it would be similar to qwen3 coder 30b at q4
but i can run it with TP with maximum of ~16k context (for reference, i can do 128k context for qwen3 coder 30b ) on my Dual 3090 setup.
is this correct or do i have some misconfiguration?

for running it im using the following Dockerfile btw:

FROM vllm/vllm-openai:nightly
RUN pip install -U --pre "transformers>=5.0.0rc3"

to serve it (maybe a little bit easier than yours)

Has anyone managed to get this running on WSL? I keep getting ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM

cyankiwi org

Thanks for using the model and helping me support other, everyone.

@ldd19 could you install vllm from source instead of nightly? using

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install -e .

Information For anyone regarding the Massive KV Cache Vram requirements:
https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/3
It seems MLA isn't active yet in vllm for GLM-4.7-Flash

EDIT:
i have a Dockerfile that adds the glm4_moe_lite to MLA support like suggested in the discussion above.
i dont know how good it will work but for anyone who wants to try:

FROM vllm/vllm-openai:nightly
RUN pip install -U --pre "transformers>=5.0.0rc3"
RUN sed -i 's/^\([[:space:]]*\)"pangu_ultra_moe_mtp",/\1"pangu_ultra_moe_mtp",\n\1"glm4_moe_lite",/' /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/model_arch_config_convertor.py

With this it at least fits 128k context on dual 3090 with 0.75 memory utilization and -tp 2
cannot say anything about quality yet

Thanks for using the model and helping me support other, everyone.

@ldd19 could you install vllm from source instead of nightly? using

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install -e .

It worked on WSL2 natively with a bit of tweaking, make sure to update to routing_method_type: int = int(RoutingMethodType.DeepSeekV3) in flashinfer_trtllm_moe.py (see PR: https://github.com/vllm-project/vllm/pull/31558)

The model runs fine on Docker.

Sign up or log in to comment