RuvLTRA Small

License HuggingFace GGUF

πŸ“± Compact Model Optimized for Edge Devices

Quick Start β€’ Use Cases β€’ Integration


Overview

RuvLTRA Small is a compact 0.5B parameter model designed for edge deployment. Perfect for mobile apps, IoT devices, and resource-constrained environments.

Model Card

Property Value
Parameters 0.5 Billion
Quantization Q4_K_M
Context 4,096 tokens
Size ~398 MB
Min RAM 1 GB

πŸš€ Quick Start

# Download
wget https://huggingface.co/ruv/ruvltra-small/resolve/main/ruvltra-0.5b-q4_k_m.gguf

# Run with llama.cpp
./llama-cli -m ruvltra-0.5b-q4_k_m.gguf -p "Hello, I am" -n 64

πŸ’‘ Use Cases

  • Mobile Apps: On-device AI assistant
  • IoT: Smart home device intelligence
  • Edge Computing: Local inference without cloud
  • Prototyping: Quick model experimentation

πŸ”§ Integration

Rust (RuvLLM)

use ruvllm::hub::ModelDownloader;

let path = ModelDownloader::new()
    .download("ruv/ruvltra-small", None)
    .await?;

Python

from huggingface_hub import hf_hub_download

model = hf_hub_download("ruv/ruvltra-small", "ruvltra-0.5b-q4_k_m.gguf")

Hardware Support

  • βœ… Apple Silicon (M1/M2/M3)
  • βœ… NVIDIA CUDA
  • βœ… CPU (x86/ARM)
  • βœ… Raspberry Pi 4/5

License: Apache 2.0 | GitHub: ruvnet/ruvector


⚑ TurboQuant KV-Cache Compression

RuvLTRA models are fully compatible with TurboQuant β€” 2-4 bit KV-cache quantization that reduces inference memory by 6-8x with <0.5% quality loss.

Quantization Compression Quality Loss Best For
3-bit 10.7x <1% Recommended β€” best balance
4-bit 8x <0.5% High quality, long context
2-bit 32x ~2% Edge devices, max savings

Usage with RuvLLM

cargo add ruvllm    # Rust
npm install @ruvector/ruvllm   # Node.js
use ruvllm::quantize::turbo_quant::{TurboQuantCompressor, TurboQuantConfig, TurboQuantBits};

let config = TurboQuantConfig {
    bits: TurboQuantBits::Bit3_5, // 10.7x compression
    use_qjl: true,
    ..Default::default()
};
let compressor = TurboQuantCompressor::new(config)?;
let compressed = compressor.compress_batch(&kv_vectors)?;
let scores = compressor.inner_product_batch_optimized(&query, &compressed)?;

v2.1.0 Ecosystem

  • Hybrid Search β€” Sparse + dense vectors with RRF fusion (20-49% better retrieval)
  • Graph RAG β€” Knowledge graph + community detection for multi-hop queries
  • DiskANN β€” Billion-scale SSD-backed ANN with <10ms latency
  • FlashAttention-3 β€” IO-aware tiled attention, O(N) memory
  • MLA β€” Multi-Head Latent Attention (~93% KV-cache compression)
  • Mamba SSM β€” Linear-time selective state space models
  • Speculative Decoding β€” 2-3x generation speedup

RuVector GitHub | ruvllm crate | @ruvector/ruvllm npm


Benchmarks (L4 GPU, 24GB VRAM)

Metric Result
Inference Speed 75.4 tok/s
Model Load Time 1.44s
Parameters 0.5B
TurboQuant KV (3-bit) 10.7x compression, <1% PPL loss
TurboQuant KV (4-bit) 8x compression, <0.5% PPL loss

Benchmarked on Google Cloud L4 GPU via ruvltra-calibration Cloud Run Job (2026-03-28)

Downloads last month
74
GGUF
Model size
0.5B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support