majentik commited on
Commit
e0a58a5
·
verified ·
1 Parent(s): 7149f51

chore(card): add hardware compatibility section

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -2,16 +2,13 @@
2
  license: apache-2.0
3
  base_model: mistralai/Mistral-Small-4-119B-2603
4
  tags:
5
- - rotorquant
6
- - kv-cache-quantization
7
- - mistral
8
- - moe
9
- - quantized
10
  library_name: transformers
11
  pipeline_tag: image-text-to-text
12
- language:
13
- - en
14
- inference: false
15
  ---
16
 
17
  # Mistral-Small-4-119B-RotorQuant
@@ -20,6 +17,12 @@ inference: false
20
 
21
  This is a **documentation repository** that explains how to combine Mistral-Small-4-119B's weights with RotorQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
22
 
 
 
 
 
 
 
23
  ## What is this?
24
 
25
  KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression.
 
2
  license: apache-2.0
3
  base_model: mistralai/Mistral-Small-4-119B-2603
4
  tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
+ - mistral
8
+ - moe
9
+ - quantized
10
  library_name: transformers
11
  pipeline_tag: image-text-to-text
 
 
 
12
  ---
13
 
14
  # Mistral-Small-4-119B-RotorQuant
 
17
 
18
  This is a **documentation repository** that explains how to combine Mistral-Small-4-119B's weights with RotorQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
19
 
20
+ ## Hardware compatibility
21
+
22
+ | Device | VRAM / RAM | Recommendation |
23
+ | --- | --- | --- |
24
+ | Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant |
25
+
26
  ## What is this?
27
 
28
  KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression.