Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hi everyone,

I’m reporting a significant regression where 4-bit quantization is ignored for Gemma 3 12B after upgrading to Transformers v5.1.0. The model fails to load into VRAM and spills into Shared GPU Memory (System RAM), slowing inference from 7s to 50s.

The Evidence

- Device Map is Empty: Despite device_map="auto", model.hf_device_map returns None.
- Memory Footprint: model.get_memory_footprint() reports 7.62 GB (suggesting it thinks it is quantized), but Windows Task Manager shows 24.2 GB in use.
- Init Error: Using load_in_4bit=True directly results in:
	TypeError: Gemma3ForConditionalGeneration.__init__() got an unexpected keyword argument 'load_in_4bit'

Setup & Code

Hardware: RTX 3060 12GB (Windows 11)
Env A (Working): Transformers v4.57.3, bnb v0.48.2
Env B (Broken): Transformers v5.1.0, bnb v0.49.1

Identical loading code used in both:

Python

quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)

model = Gemma3ForConditionalGeneration.from_pretrained(
“google/gemma-3-12b-it”,
quantization_config=quant_config,
device_map=“auto”,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)

Question

It seems Gemma3ForConditionalGeneration in v5.x is no longer passing the quantization_config to the underlying layers correctly on Windows. Has the initialization flow for Gemma 3 changed, or is this a known issue with the new v5 “quantization as a first-class citizen” refactor?

1 Like

Probably a known behavior change plus a bug…?


What changed in Transformers v5 (relevant to your symptoms)

Transformers v5 introduced a new weight-loading pipeline (“dynamic weight loading / converter”) and explicitly moved toward quantization being a first-class loading path, not an afterthought applied once a full-precision model is already in memory. (Hugging Face)

That is the correct direction, but it also means that the order of operations during from_pretrained() matters much more: where tensors are materialized (CPU vs GPU), when a quantization conversion runs, and when Accelerate dispatch hooks are attached.

Why your numbers look like “4-bit configured, BF16 actually loaded”

1) The 24.2 GB footprint matches BF16-ish weight residency

Gemma 3 12B is a multimodal model (Gemma3ForConditionalGeneration). Its BF16/FP16 weights are far above 12 GB, so on a 12 GB card the Windows driver will often spill into Shared GPU Memory (system RAM) instead of hard failing.

2) get_memory_footprint() can look “4-bit sized” even if peak / resident memory was full precision

model.get_memory_footprint() is not a reliable indicator of peak allocation during load (or of full-precision copies lingering due to allocator behavior / offload behavior). It’s common to see a “small” footprint while the OS-level counters reflect what actually got materialized and kept resident.

This exact mismatch is consistent with a v5 regression where tensors that are supposed to be quantized are materialized on the target device first and only then converted, which is “too late” to prevent the VRAM spike / spill.

The closest known regression: v5 materializes before quantizing (bitsandbytes 4-bit)

There is a highly relevant Transformers issue reporting a v5 regression: bitsandbytes 4-bit is scheduled, but the loader still materializes tensors on GPU before the quantization op runs, causing OOM or severe memory spikes. (GitHub)

The proposed fix in that issue is effectively:

  • If a parameter will be quantized (mapping.quantization_operation is not None), materialize it to CPU first, then quantize, then place it on GPU.

That is exactly the kind of ordering bug that would look like “quantization ignored” on Windows (because Windows can spill into shared memory rather than throwing OOM). (GitHub)

Why model.hf_device_map is None is a big red flag

With device_map="auto", Accelerate’s big-model dispatch normally computes a device map and stores it in model.hf_device_map. (Hugging Face)

If hf_device_map is None, it usually means one of these happened:

  1. Accelerate dispatch didn’t run (missing/incompatible Accelerate, or a code path that bypasses dispatch).
  2. The model was instantiated/loaded without the dispatch wrapper being attached (so no map is recorded).
  3. A nonstandard load path bypassed the “big model inference” integration.

Gemma’s own model card explicitly notes installing Accelerate and demonstrates device_map="auto" usage. (Hugging Face)

So your two “signals” line up with the same underlying theme: in v5, the load/dispatch/convert ordering and integration points changed, and your path appears to bypass or break part of that chain.

About the load_in_4bit=True TypeError

This is expected behavior: load_in_4bit is not a model __init__ argument; it’s handled by the from_pretrained() quantization integration via BitsAndBytesConfig/quantization_config. The docs show the supported pattern is passing a config object to from_pretrained(). (Hugging Face)

So treat that error as “wrong API surface”, not the core regression.

Most likely explanation for your exact case

Putting it together:

  • Transformers v5.1.0 uses the new dynamic loader more aggressively. (Hugging Face)
  • In the bnb 4-bit path, there is at least one reported regression where tensors are materialized on GPU before quantization, causing a full-precision-sized residency spike. (GitHub)
  • On Windows + 12 GB VRAM, that manifests as Shared GPU Memory spill and a 7s → 50s slowdown.
  • Separately (but consistent with the same theme), your hf_device_map=None suggests Accelerate dispatch didn’t successfully run/attach, which further increases the chance that the loader isn’t following the expected “meta → dispatch → quantize → place” flow. (Hugging Face)

High-signal checks that will confirm which part is broken

Run these in the broken env (v5.1.0):

  1. Confirm Accelerate is actually engaged

    • If device_map="auto" is being honored, hf_device_map should be a dict. (Hugging Face)
  2. Prove whether full-precision tensors are being materialized

    • Print top-5 largest parameters by numel() and their dtype/device.
    • If you see huge BF16 tensors on GPU/CPU alongside 4-bit wrappers, you’re seeing “quantize too late” or “full-precision copy retained”.
  3. Force failure instead of silent Windows spill

    • Pass max_memory={0: "11GiB", "cpu": "0GiB"}.
    • If it still “loads” by spilling today, this should instead fail early; if it fails, it supports the “full precision got materialized” hypothesis.

Practical workarounds while waiting for an upstream fix

  1. Try a v5 nightly / a commit including the #43032 fix (or apply that patch locally)

    • The fix is small and directly targets your failure mode. (GitHub)
  2. Remove trust_remote_code=True

    • Gemma 3 is supported directly; the official usage snippets don’t require remote code. (Hugging Face)
  3. Use TorchAO int4 instead of bitsandbytes (v5 direction)

    • v5 explicitly integrates TorchAO as part of the “first-class quantization” effort. (Hugging Face)
  4. Pin back to your known-good combo

    • Transformers 4.57.3 + bnb 0.48.2 is a reasonable operational pin until the v5 loader/quantization ordering is confirmed fixed.

What to say in your report (framing that matches current upstream work)

  • “This looks like a v5 dynamic weight loading / converter ordering issue: 4-bit quantization is configured, but full-precision tensors appear to be materialized first (Windows then spills into shared memory). This matches the regression described in Transformers #43032 (materialize-to-GPU before quantize). Also, device_map="auto" doesn’t populate hf_device_map, suggesting Accelerate dispatch isn’t attaching or is bypassed in this path.”

That ties your symptoms to the specific v5 refactor points and a concrete upstream issue/patch. (Hugging Face)

Hi John, Thanks for the reply. I will work through all the detail of your reply and provide feedback.

1 Like

Hi John,

I’ve run the diagnostics on v5.2.0.dev0 to check why the memory behavior has shifted compared to v4. I’ve focused on the device mapping and the materialization logs you were interested in.

  1. Confirming Accelerate Engagement

It appears that device_map=“auto” is not being honored in the same way as previous versions.

  • v5 Observation: hasattr(model, “hf_device_map”) returns False. The model seems to be loading without a dispatch dictionary.

  • v4 Comparison: The same setup in v4.x properly returns hf_device_map: {‘’: 0}.

  • Result: Without the device map, the model appears to be “homeless” during the load, materializing in Shared System Memory (RAM) rather than being streamed directly to the GPU.

  1. Proof of Full-Precision Materialization

The logs show a fundamental shift in how weights are handled.

  • The v4 Log (Streaming): Loading checkpoint shards: 100%|██████████| 5/5 [01:10<00:00, 14.10s/it]
    In v4, weights are streamed as shards, and Shared GPU memory remains flat at 0.2 GB.

  • The v5 Log (Materializing): Loading weights: 100%|██████████| 1065/1065 [01:03<00:00, 16.78it/s, Materializing param=…]
    In v5, the library logs “Materializing” for every individual parameter. During this phase, System RAM and Shared GPU Memory spike to 24GB+.

  1. “Quantize Too Late” Evidence

Even though model.get_memory_footprint() eventually reports 7.62 GB, the physical impact on the system during and immediately after the load suggests a “full-precision copy retained” scenario:

  • The Mismatch: VRAM/Shared commitment sits at ~24GB until a manual “nudge” (attribute access + gc.collect()) is performed.

  • Tensor Inspection: Printing the top parameters by numel() shows large bfloat16 tensors present in memory during the materialization phase, rather than the intended 4-bit weights.

Conclusion:
It seems that in the v5 refactor for Gemma 3, the quantization handshake is being bypassed or delayed. Instead of “Quantize-on-load,” the model is performing a full 16-bit materialization in System RAM before attempting to shrink the weights.

I’ve attached the Task Manager captures showing the v5 spike for your review.

1 Like

The v4 capture

1 Like

Yeah. I also ran a few experiments on Colab.

The current GitHub main branch now works significantly better than 5.1.0 (at least RAM/VRAM spikes can be avoided). Achieving fully v4-like loading still seems difficult.
However, at this rate, quantization compatibility should gradually improve over time…


What you’re seeing is very consistent with a real behavioral change in the Transformers v5 loading path (especially when 4-bit bitsandbytes quantization is involved), and it explains both the “RAM/shared GPU memory spike” and the feeling that quantization is happening “too late”.

What changed from v4 to v5 (why it looks like “quantize too late”)

v4 mental model (why it felt “streamed” and stable)

In v4, the common fast path for big models + device_map="auto" + quantization was effectively:

  1. Accelerate computes a dispatch plan (device map / offload plan).
  2. Weights are read in chunks (shards) and placed according to that plan.
  3. If quantization is requested, the loader cooperates with the quantizer so tensors don’t sit in full precision on the “wrong” device for long.

That’s why you typically saw “Loading checkpoint shards …” and relatively flat system/shared memory.

v5 mental model (why you see “Materializing param=…” and spikes)

Transformers v5 introduced a new loader (core_model_loading.py) that often loads parameter-by-parameter and explicitly “materializes” tensors as it goes.

The key regression (and it matches your description exactly) is:

  • The loader correctly detects that a parameter will be quantized, but still materializes it directly onto the target device first (often the GPU), in full precision, and only afterwards runs the conversion/quantization step.

That means:

  • you can transiently allocate full-precision weights even though the final model is 4-bit,
  • and on Windows you may also see “Shared GPU memory” balloon because of WDDM + unified/managed allocations behavior when memory pressure rises.

This exact bug is described in the upstream Transformers issue #43032: the v5 loader sets up quantization ops, but then “ignores this and materializes tensors directly to GPU at full precision” before quantization runs. (GitHub)
(Your observations align with that sequence. )


Why `hf_devicend why that doesn’t always mean “no device_map”)

hf_device_map is a dispatch artifact. If the model ends up entirely on one device (or the loader doesn’t go through the Accelerate dispatch path you expect), hf_device_map may not be attached even though you passed device_map="auto".

So: hasattr(model, "hf_device_map") == False is a useful symptom, but not a perfect proof of “no mapping happened”. It can also mean “dispatch wasn’t used / wasn’t needed / didn’t persist”.


Best practices to get v4-like memory efficiency on v5 + bitsandbytes (without pre-quantizing and without editing library files)

1) Prefer an “offload-first” loading posture so quantization can happen before GPU placement

The core problem is “materialize on GPU in fp16/bf16 → OOM/spike → quantize”.
So you want the opposite: materialize on CPU (or disk offload), quantize, then place.

In v5, the knobs that matter most are the standard from_pretrained loading controls:

  • device_map and max_memory to constrain placement decisions (Hugging Face)
  • offload_state_dict=True and offload_folder=... to keep the CPU peak down when weights must temporarily live outside VRAM (Hugging Face)
  • low_cpu_mem_usage=True to reduce CPU duplication during load (Hugging Face)

Even if device_map="auto" is imperfect in this regression window, pairing it with explicit budgets + offload makes the loader far less likely to “go wild” during the materialization phase.

(This does not “fix” the regression, but it often keeps the transient peak below the cliff.)


2) Use Accelerate’s “big model” loading utilities when you need determinism

If you want something closest to the old “plan first, then stream according to plan” model, the most robust approach is to lean more on Accelerate big-model APIs:

  • init_empty_weights() to instantiate the module structure without allocating real parameter storage
  • load_checkpoint_and_dispatch() to load weights according to a device/offload plan (with disk offload options)

These are documented in Accelerate’s big-modeling reference. (Hugging Face)

Why this helps:

  • It makes the dispatch step explicit and earlier.
  • It reduces the chances that the Transformers v5 per-parameter materializer temporarily places full-precision tensors onto the GPU before quantization.

Caveat:

  • For some architectures + quantizers, you may need to ensure the quantized modules are created in a way that’s compatible with dispatch (e.g., avoid splitting specific blocks; set no_split_module_classes appropriately). The Accelerate docs describe the intended workflow. (Hugging Face)

3) Treat Windows “Shared GPU memory” spikes as a symptom, not the real target

On Windows (WDDM), “Shared GPU memory” is not the same as “pure VRAM use”. When the driver/runtime can’t satisfy allocations cleanly, it can spill/commit pageable system memory that shows up as “shared”.

So if the v5 loader is temporarily allocating full-precision tensors on GPU, Windows can make it look even worse than it is.

The actionable implication:

  • Focus on preventing the full-precision-on-GPU transient, via (1) offload/budgets and/or (2) accelerate big-model dispatch.
  • Don’t over-interpret the exact shared number as “the model is truly that big”; it can be a transient commitment artifact.

Is the “Materializing param=…” progress line intended?

The existence of “Materializing param=…” is expected in v5 because that’s how the new loader reports per-parameter work. (GitHub)
What is not desirable (and is the regression) is materializing quantize-bound parameters directly on GPU at full precision before the quantizer runs. That’s the bug described in #43032. (GitHub)

So:

  • “Materializing param=…”: normal v5 logging
  • “Materializing quantize-target tensors onto GPU first”: regression behavior (v5) (GitHub)

Practical decision guide (what to do today)

Given the current upstream state described in #43032 (GitHub), the most reliable non-prequant options tend to be:

  1. First choice (most v4-like): Accelerate big-model loading (init_empty_weights + load_checkpoint_and_dispatch) (Hugging Face)
  2. Second choice (simple drop-in): from_pretrained(..., device_map="auto", max_memory=..., offload_state_dict=True, offload_folder=..., low_cpu_mem_usage=True) (Hugging Face)
  3. If you still hit peaks: tighten max_memory, and ensure offload folder is on a fast disk.

Hi John,

Thanks for your replies.I’ve done tests on Transformers 5.2.0 and 5.3.0 dev, but the memory issues with Gemma 3 12B persist. I’ve narrowed down where I am now:

  1. max_memory is still not working
    Even in the latest versions, the max_memory dictionary is effectively being ignored during the load. I’ve set my limits (11GiB for GPU, 16GiB for CPU), but the loader doesn’t honor these boundaries. It seems the new materialization logic is still bypassing the standard Accelerate constraints.

  2. CPU Loading is not an option
    I tried forcing the model to load on the CPU first (device_map={“”: “cpu”}) to avoid the VRAM spike, but this leads to a system crash. Because max_memory isn’t working, the loader tries to materialize the full-precision weights in my 32GB RAM all at once. It hits 100% usage immediately, and the process is killed by the OS.

  3. Current “Working” Setup (device_map=“auto”)
    The only way I can successfully get the model running is to use device_map=“auto” and then immediately run a manual cleanup to clear the “Ghost” bloat left behind by the materialization phase.

Python

MODEL_KWARGS = {
“gemma3-12b”: {
“dtype”: torch.bfloat16,
“attn_implementation”: “sdpa”,
“trust_remote_code”: True,
“low_cpu_mem_usage”: True
}
}

QUANT_CONFIGS = {
“gemma3-12b”: BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
),
}

max_memory = {0: “11GiB”, “cpu”: “16GiB”}

base_model = Gemma3ForConditionalGeneration.from_pretrained(
model_id,
quantization_config=quant_config,
**MODEL_KWARGS[“gemma3-12b”],
device_map=“auto”,
max_memory=max_memory,
offload_folder=“e:\AI\offload_temp”
)

import gc
gc.collect()
torch.cuda.empty_cache()

print(f"VRAM after manual cleanup: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

I’m attaching a screenshot of my Task Manager showing the final state. While it’s running now, it’s clear the loader is still over-allocating during the “Materializing” phase before shrinking down to the final size. Even after the cleanup, the VRAM usage is still higher than what I used to see back on v4.

1 Like

Hmm… For plain Gemma 3 12B alone, using pre-quantized weights is the fastest and most reliable approach, but it can’t be applied to other cases…

It seems environment variables can sometimes reduce spikes:


Why this still happens in Transformers v5 (and why max_memory doesn’t save you)

1) v5 “dynamic weight loading” can legitimately spike peak memory

Transformers v5 loads weights via a conversion/materialization pipeline that can:

  • load parameters asynchronously (thread pool),
  • run conversion ops that require temporary copies (e.g., concat/merge-style ops can briefly need ~2× for the tensors involved),
  • and do extra copies if the requested dtype differs from the serialized dtype. (Hugging Face)

Even when the steady-state fits, these transient steps can push peak CPU RAM / VRAM over the edge.

2) BitsAndBytes 4-bit has a v5 regression path that materializes on GPU before quantization

There is a documented v5 regression where the loader identifies that a tensor should be quantized, but still materializes it directly to the target device (often GPU) at higher precision first, which can OOM before the quantization op runs. (GitHub)

That behavior also explains why max_memory appears “ignored”: it can’t prevent a transient allocation that happens before the quantizer reduces the tensor.

3) max_memory + CPU/disk offload is not a real escape hatch for BnB 4-bit

With BitsAndBytes 4-bit, dispatching some modules to CPU/disk is explicitly blocked unless you’re using the int8 “FP32 CPU offload” mode; otherwise you’ll hit the “Some modules are dispatched on CPU/disk…” constraint. (GitHub)
Also, the 4-bit quantizer reduces the provided max_memory (multiplies by 0.90) to leave headroom for quantization buffers. (GitHub)
So even in best cases, max_memory is more about device-map planning than enforcing a hard peak bound during materialization.

4) Windows adds another layer of confusion around “VRAM usage”

On Windows (WDDM), GPU allocations are managed through GPU virtual addressing; page tables can map to local device memory or system memory, and residency/migration can make “usage” look worse (or different) depending on which tool you’re reading (Task Manager vs NVML vs PyTorch). (Microsoft Learn)


Practical workarounds (ranked)

A) Best “no patch” path in v5: switch from BnB 4-bit to TorchAO int4 (weight-only)

For Gemma 3, the Transformers docs explicitly show TorchAO int4 weight-only quantization. This avoids the BitsAndBytes-specific regression path while keeping you in Transformers v5. (Hugging Face)

# pip install -U torchao
import torch
from transformers import Gemma3ForConditionalGeneration, AutoProcessor, TorchAoConfig

model_id = "google/gemma-3-12b-it"

quantization_config = TorchAoConfig("int4_weight_only", group_size=128)

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    dtype=torch.bfloat16,          # or dtype="auto"
    device_map="auto",
    quantization_config=quantization_config,
    low_cpu_mem_usage=True,
    attn_implementation="sdpa",
).eval()

processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

Notes:

  • This is the closest “v5-native” option to get low steady-state memory without the BnB materialization spike (in practice, this is often the deciding factor on 12GB-class GPUs).
  • For generation memory, Gemma 3 docs also show cache_implementation="static" as an option. (Hugging Face)

B) If you must stay on BitsAndBytes 4-bit: reduce peak variance (but you may not eliminate spikes)

1) Force sequential loading and avoid parallel shard loading

  • Disable async parameter loading (reduces concurrency/variance): HF_DEACTIVATE_ASYNC_LOAD=1. (Hugging Face)
  • Keep parallel file loading off (HF_ENABLE_PARALLEL_LOADING=false), because parallelism can increase concurrent in-flight buffers. (Hugging Face)

2) Ensure dtype matches the checkpoint to avoid extra copies

If you request a dtype that differs from the serialized dtype, v5 may perform an extra copy during loading. (Hugging Face)
For google/gemma-3-12b-it, the model card examples use torch.bfloat16. (Hugging Face)
So dtype=torch.bfloat16 or dtype="auto" is typically preferable to avoid conversion copies.

3) Treat “ghost VRAM” as allocator caching, not necessarily a leak

torch.cuda.empty_cache() releases unused cached blocks back to other apps, but does not free memory held by live tensors. (PyTorch Docs)
So your “manual cleanup” step is consistent with how PyTorch’s caching allocator behaves.

4) Last-resort allocator tuning (helps borderline OOM due to fragmentation, not root cause)

If you’re failing on relatively small allocations near the end of load, tuning PYTORCH_ALLOC_CONF / PYTORCH_CUDA_ALLOC_CONF can sometimes help. (PyTorch Docs)
Example:

set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

C) The only way to truly fix BnB 4-bit peak spikes today: apply the targeted loader change (patch)

There is a minimal fix described in the Transformers issue: if a parameter will be quantized, materialize it on CPU first, then quantize, then move to GPU. (GitHub)
This directly addresses the “materialize-to-GPU-before-quantize” failure mode.

If you previously tried broader edits and hit fragility, this specific change is much narrower (a small conditional around where param_device is chosen). It is still a patch, but it aligns with the root cause described upstream. (GitHub)


Small but important Gemma 3-specific considerations

  • Gemma 3 is multimodal and includes a SigLIP vision encoder; even if you only do text prompts, you’re loading a larger graph than a text-only LM. (Hugging Face)
  • The official model card shows standard usage without trust_remote_code=True; removing it avoids accidental code-path differences. (Hugging Face)