EnricoFermi's picture
fix: use hf.co (CORS-open) in verify URLs
8ec99b9 verified
---
tags:
- 35b
- android
- apple-silicon
- attested
- chain-of-custody
- consumer-gpu
- cryptographically-verified
- edge-inference
- efficient
- english
- expert-pruning
- forge-alloy
- general
- gguf
- instruct
- llama-cpp
- lm-studio
- local-inference
- macbook
- mixture-of-experts
- mlx
- mobile
- moe
- moe-compaction
- multilingual
- ollama
- on-device
- optimized
- pruned
- q4_k_m
- quantized
- raspberry-pi
- reproducible
- sparse-moe
- text-generation
- versatile
- calibration-aware-pruning
- mixtral
base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
pipeline_tag: text-generation
license: apache-2.0
---
# 25% Experts Pruned, PPL 8.97 (base 8.14)
**Mixtral-8x7B-Instruct-v0.1** compacted via calibration-aware MoE expert pruning (§4.1.3.4) against the unmodified source.
- **Perplexity**: 8.97 (base 8.14, Δ +10.2%)
- **Compression**: 93.4 GB → 20.4 GB Q4_K_M (**4.6×**)
- **Throughput**: 142 tok/s generation, 437 tok/s prompt on RTX 5090
<p align="center">
<a href="https://cambriantech.github.io/forge-alloy/verify/#hf.co/continuum-ai/mixtral-8x7b-instruct-compacted-conservative/resolve/main/mixtral-8x7b-instruct-compacted-conservative.alloy.json@b26fd7adf36b7c8c">
<img src="alloy-qr.png" alt="Verify Chain of Custody" width="160"/>
</a>
</p>
<p align="center">
<a href="https://cambriantech.github.io/forge-alloy/verify/#hf.co/continuum-ai/mixtral-8x7b-instruct-compacted-conservative/resolve/main/mixtral-8x7b-instruct-compacted-conservative.alloy.json@b26fd7adf36b7c8c"><b>Every claim on this card is verified</b></a><br>
<b>Trust: self-attested</b> · 1 benchmark · 1 device tested<br>
<a href="https://github.com/CambrianTech/forge-alloy">ForgeAlloy</a> chain of custody · <a href="mixtral-8x7b-instruct-compacted-conservative.alloy.json">Download alloy</a> · Merkle-chained
</p>
---
**A 93 GB datacenter MoE compressed to run on a MacBook Air.** Forged from `mistralai/Mixtral-8x7B-Instruct-v0.1` by removing the 2 least-activated experts per layer (8→6) via **calibration-aware activation-frequency ranking** on a held-out code corpus (300 examples, 148,945 tokens). Quantized to GGUF Q4_K_M for llama.cpp / Ollama / LM Studio. Apache-2.0. **PPL 8.97** against the source's **8.14** (Δ +10.2%), evaluated via llama.cpp on wikitext-2-raw. Second row of the [cross-family anchor table](#cross-family-anchor-table). Cryptographic provenance via [ForgeAlloy](https://github.com/CambrianTech/forge-alloy).
## Benchmarks
| Benchmark | Score | Base | Δ | Verified |
|---|---|---|---|---|
| **wikitext-2-raw PPL** | **8.97** | 8.14 | +10.2% | ✅ Result hash |
## What Changed (Base → Forged)
| | Base | Forged | Delta |
|---|---|---|---|
| **Perplexity** | 8.14 | 8.97 | +10.2% |
| **Experts / layer** | 8 | 6 | −25% (2 removed per layer) |
| **Total params** | 46.7B | ~35B | −25% |
| **Active params** | 12.9B | 12.9B | Unchanged |
| **Size (fp16)** | 93.4 GB | 70.9 GB | −24% |
| **Size (Q4_K_M)** | — | 20.4 GB | **4.6× compression** |
| **Pipeline** | | expert-activation-profile → expert-prune → quant → eval | 1 cycle |
## Runs On
| Device | Format | Size | Speed |
|--------|--------|------|-------|
| **NVIDIA GeForce RTX 5090** | Q4_K_M | 20.4 GB | **142 tok/s generation** ✅ Verified |
| MacBook Pro 32GB | Q4_K_M | 20.4 GB | Expected |
| MacBook Air 24GB | Q4_K_M | 20.4 GB | Expected |
| RTX 3060 12GB+ | Q4_K_M | 20.4 GB | Expected (partial offload) |
| RTX 4090 24GB | Q4_K_M | 20.4 GB | Expected |
| RTX 4090 24GB | fp16 | 70.9 GB | Expected (with offload) |
## Quick Start
```bash
# llama.cpp (any platform)
./llama-cli -m mixtral-8x7b-compacted-Q4_K_M.gguf \
-p "Write a Python function that finds the longest palindromic substring." \
-n 512 -ngl 99
# Ollama
ollama run continuum-ai/mixtral-8x7b-instruct-compacted-conservative
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"continuum-ai/mixtral-8x7b-instruct-compacted-conservative",
torch_dtype="auto", device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"continuum-ai/mixtral-8x7b-instruct-compacted-conservative"
)
inputs = tokenizer("def merge_sort(arr):", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## Methodology
Produced via §4.1.3.4 calibration-aware MoE expert activation count pruning. 300 held-out code examples (148,945 tokens) profiled across all 32 layers × 8 experts. The 2 least-activated experts per layer were removed. The surviving 6 experts per layer are the ones the model actually uses on the calibration domain.
**Activation profile (sample layers):**
| Layer | Top experts | Bottom experts (removed) |
|---|---|---|
| Layer 0 | 5, 2, 3, 4, 0 (35K-49K) | 1, 6 (~20K) |
| Layer 16 | 6, 2, 1, 5, 4 (37K-46K) | 0, 3 (~20K) |
| Layer 31 | 3, 6, 5, 7, 0 (35K-54K) | 1, 2 (~20K) |
Full methodology in [the sentinel-ai repository](https://github.com/CambrianTech/sentinel-ai). The pipeline ran as `expert-activation-profile → expert-prune → quant → eval` on NVIDIA GeForce RTX 5090.
<a id="cross-family-anchor-table"></a>
## Cross-Family Anchor Table
Same §4.1.3.4 methodology across independently-trained model families.
| Row | Model | Family | Experts | Kept | PPL | Status |
|---|---|---|---|---|---|---|
| 1 | [qwen3-coder-30b-a3b](https://huggingface.co/continuum-ai/qwen3-coder-30b-a3b-compacted-19b-256k) | Qwen3 MoE | 128 | 80 | — | ✅ Published |
| **2** | **Mixtral 8x7B** | **Mixtral** | **8** | **6** | **8.97** | **✅ This model** |
| 3 | Mixtral 8x22B | Mixtral | 8 | 4 | — | 🔄 Forging now |
| 4 | Qwen3.5-35B-A3B | Qwen3.5 | TBD | TBD | — | ⬜ Planned |
| 5 | DeepSeek-V2-Lite | DeepSeek | 64 | 32 | — | ⬜ Planned |
## Chain of Custody
Scan the QR or [verify online](https://cambriantech.github.io/forge-alloy/verify/#hf.co/continuum-ai/mixtral-8x7b-instruct-compacted-conservative/resolve/main/mixtral-8x7b-instruct-compacted-conservative.alloy.json@b26fd7adf36b7c8c). Download the [alloy file](mixtral-8x7b-instruct-compacted-conservative.alloy.json) to verify independently.
| What | Proof |
|------|-------|
| Model weights | `sha256:d7f65e31667d9b9bcfd8ca05e796df87bf8b6e59336a34f4703c9d3904e54bd8` |
| Alloy hash | `sha256:b26fd7adf36b7c8c` |
| Forged on | NVIDIA GeForce RTX 5090, 2026-04-10 |
| Trust level | [`self-attested`](https://github.com/CambrianTech/forge-alloy/blob/main/docs/ATTESTATION.md) |
| Spec | [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) — Rust/Python/TypeScript |
## Make Your Own
Forged with [Continuum](https://github.com/CambrianTech/continuum) — a distributed AI world that runs on your hardware.
<p align="center">
<a href="https://github.com/CambrianTech/continuum"><img src="https://raw.githubusercontent.com/CambrianTech/continuum/main/docs/images/factory.png" alt="Continuum Factory" width="600"/></a>
</p>
<p align="center">
<a href="https://github.com/CambrianTech/continuum"><b>Continuum</b></a> · <a href="https://github.com/CambrianTech/forge-alloy"><b>Forge-Alloy</b></a> · <a href="https://github.com/CambrianTech/sentinel-ai"><b>Sentinel-AI</b></a> · <a href="https://github.com/CambrianTech/open-eyes"><b>Open-Eyes</b></a> · <a href="https://discord.gg/arfbCV2H"><b>Discord</b></a> · <a href="https://www.moltbook.com/u/continuum"><b>Moltbook</b></a>
</p>
---
<div align="center">
*Intelligence for everyone. Exploitation for no one.*
</div>