Yeah. I also ran a few experiments on Colab.
The current GitHub main branch now works significantly better than 5.1.0 (at least RAM/VRAM spikes can be avoided). Achieving fully v4-like loading still seems difficult.
However, at this rate, quantization compatibility should gradually improve over timeâŚ
What youâre seeing is very consistent with a real behavioral change in the Transformers v5 loading path (especially when 4-bit bitsandbytes quantization is involved), and it explains both the âRAM/shared GPU memory spikeâ and the feeling that quantization is happening âtoo lateâ.
What changed from v4 to v5 (why it looks like âquantize too lateâ)
v4 mental model (why it felt âstreamedâ and stable)
In v4, the common fast path for big models + device_map="auto" + quantization was effectively:
- Accelerate computes a dispatch plan (device map / offload plan).
- Weights are read in chunks (shards) and placed according to that plan.
- If quantization is requested, the loader cooperates with the quantizer so tensors donât sit in full precision on the âwrongâ device for long.
Thatâs why you typically saw âLoading checkpoint shards âŚâ and relatively flat system/shared memory.
v5 mental model (why you see âMaterializing param=âŚâ and spikes)
Transformers v5 introduced a new loader (core_model_loading.py) that often loads parameter-by-parameter and explicitly âmaterializesâ tensors as it goes.
The key regression (and it matches your description exactly) is:
- The loader correctly detects that a parameter will be quantized, but still materializes it directly onto the target device first (often the GPU), in full precision, and only afterwards runs the conversion/quantization step.
That means:
- you can transiently allocate full-precision weights even though the final model is 4-bit,
- and on Windows you may also see âShared GPU memoryâ balloon because of WDDM + unified/managed allocations behavior when memory pressure rises.
This exact bug is described in the upstream Transformers issue #43032: the v5 loader sets up quantization ops, but then âignores this and materializes tensors directly to GPU at full precisionâ before quantization runs. (GitHub)
(Your observations align with that sequence. )
Why `hf_devicend why that doesnât always mean âno device_mapâ)
hf_device_map is a dispatch artifact. If the model ends up entirely on one device (or the loader doesnât go through the Accelerate dispatch path you expect), hf_device_map may not be attached even though you passed device_map="auto".
So: hasattr(model, "hf_device_map") == False is a useful symptom, but not a perfect proof of âno mapping happenedâ. It can also mean âdispatch wasnât used / wasnât needed / didnât persistâ.
Best practices to get v4-like memory efficiency on v5 + bitsandbytes (without pre-quantizing and without editing library files)
1) Prefer an âoffload-firstâ loading posture so quantization can happen before GPU placement
The core problem is âmaterialize on GPU in fp16/bf16 â OOM/spike â quantizeâ.
So you want the opposite: materialize on CPU (or disk offload), quantize, then place.
In v5, the knobs that matter most are the standard from_pretrained loading controls:
device_map and max_memory to constrain placement decisions (Hugging Face)
offload_state_dict=True and offload_folder=... to keep the CPU peak down when weights must temporarily live outside VRAM (Hugging Face)
low_cpu_mem_usage=True to reduce CPU duplication during load (Hugging Face)
Even if device_map="auto" is imperfect in this regression window, pairing it with explicit budgets + offload makes the loader far less likely to âgo wildâ during the materialization phase.
(This does not âfixâ the regression, but it often keeps the transient peak below the cliff.)
2) Use Accelerateâs âbig modelâ loading utilities when you need determinism
If you want something closest to the old âplan first, then stream according to planâ model, the most robust approach is to lean more on Accelerate big-model APIs:
init_empty_weights() to instantiate the module structure without allocating real parameter storage
load_checkpoint_and_dispatch() to load weights according to a device/offload plan (with disk offload options)
These are documented in Accelerateâs big-modeling reference. (Hugging Face)
Why this helps:
- It makes the dispatch step explicit and earlier.
- It reduces the chances that the Transformers v5 per-parameter materializer temporarily places full-precision tensors onto the GPU before quantization.
Caveat:
- For some architectures + quantizers, you may need to ensure the quantized modules are created in a way thatâs compatible with dispatch (e.g., avoid splitting specific blocks; set
no_split_module_classes appropriately). The Accelerate docs describe the intended workflow. (Hugging Face)
3) Treat Windows âShared GPU memoryâ spikes as a symptom, not the real target
On Windows (WDDM), âShared GPU memoryâ is not the same as âpure VRAM useâ. When the driver/runtime canât satisfy allocations cleanly, it can spill/commit pageable system memory that shows up as âsharedâ.
So if the v5 loader is temporarily allocating full-precision tensors on GPU, Windows can make it look even worse than it is.
The actionable implication:
- Focus on preventing the full-precision-on-GPU transient, via (1) offload/budgets and/or (2) accelerate big-model dispatch.
- Donât over-interpret the exact shared number as âthe model is truly that bigâ; it can be a transient commitment artifact.
Is the âMaterializing param=âŚâ progress line intended?
The existence of âMaterializing param=âŚâ is expected in v5 because thatâs how the new loader reports per-parameter work. (GitHub)
What is not desirable (and is the regression) is materializing quantize-bound parameters directly on GPU at full precision before the quantizer runs. Thatâs the bug described in #43032. (GitHub)
So:
- âMaterializing param=âŚâ: normal v5 logging
- âMaterializing quantize-target tensors onto GPU firstâ: regression behavior (v5) (GitHub)
Practical decision guide (what to do today)
Given the current upstream state described in #43032 (GitHub), the most reliable non-prequant options tend to be:
- First choice (most v4-like): Accelerate big-model loading (
init_empty_weights + load_checkpoint_and_dispatch) (Hugging Face)
- Second choice (simple drop-in):
from_pretrained(..., device_map="auto", max_memory=..., offload_state_dict=True, offload_folder=..., low_cpu_mem_usage=True) (Hugging Face)
- If you still hit peaks: tighten
max_memory, and ensure offload folder is on a fast disk.