Configuration Parsing Warning: In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Jeeves-Small-75M

A compact 75M parameter language model built on Looped Transformer and Value Residual Learning architectures β€” with native support for tool calling / function calling.

Jeeves is designed to punch above its weight class by reusing a small set of transformer layers iteratively (looping), giving it an effective depth far beyond what its parameter count suggests.


Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)

inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: trust_remote_code=True is required due to custom model architecture code.


Tool Calling (Function Calling)

Jeeves supports structured tool/function calling out of the box. Below is an example:

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a given location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
]

messages = [
    {"role": "user", "content": "What's the weather like in London?"}
]

# Format prompt with tools using the chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Architecture

Component Value
Parameters 74.9M
Unique layers 8
Effective depth 15
Loop block[4] Γ— 8
Value residual βœ…
Hidden dim 768
FFN dim 2,048
Attention heads 12 (Q) / 4 (KV) β€” GQA
Vocab size 32,000
Max seq length 512
Training steps 1,100

Key Innovations

  • Looped Transformer (arXiv:2311.12424) β€” A single transformer block is applied repeatedly in a loop, dramatically increasing effective depth while keeping parameter count small. This allows Jeeves to reason iteratively rather than in a single pass.
  • Value Residual Learning (arXiv:2410.17897) β€” Residual connections applied at the value projection level alleviate attention concentration in deep/looped networks, improving gradient flow and stability.
  • Input Injection β€” The original input is re-injected at each loop iteration to prevent representational drift across loops, a critical stabilization technique for looped architectures.

Benchmark Results

Evaluated using EleutherAI lm-evaluation-harness.

Benchmark Accuracy Correct Total
HellaSwag 30.9% 3,100 10,042
ARC-Easy 47.1% 1,118 2,376
ARC-Challenge 24.9% 292 1,172
ARC (Average) 36.0% β€” β€”
PIQA 63.9% 1,174 1,838
WinoGrande 52.4% 664 1,267
MMLU 25.2% 3,536 14,042
TruthfulQA 24.8% 203 817
GSM8K 1.4% 18 1,319
IFEval 40.0% 4 10

Notes on Results

  • PIQA (63.9%) and WinoGrande (52.4%) are the strongest results, indicating reasonable physical commonsense and pronoun-resolution reasoning for the model's size.
  • MMLU (25.2%) is close to random (25% for 4-way MCQ), which is expected given the model's size and early training stage (1,100 steps). More training is needed for knowledge-heavy tasks.
  • GSM8K (1.4%) reflects a known limitation: multi-step mathematical reasoning is very demanding and typically requires much larger models or specialized fine-tuning.
  • IFEval (40.0%) is promising for a 75M model and reflects the tool-calling and instruction-following training signal.

Limitations

  • Short context (512 tokens): Jeeves currently supports a maximum of 512 tokens. Long documents, multi-turn conversations, and complex tool chains may be truncated.
  • Early training stage: At 1,100 training steps, this is an early checkpoint. Knowledge-heavy and math benchmarks (MMLU, GSM8K) will improve significantly with more training.
  • Not suitable for factual retrieval: Like all small language models, Jeeves may hallucinate facts. It is best used with grounding via tool calls or RAG pipelines.
  • English-centric: Trained primarily on English data. Performance on other languages is not guaranteed.

Intended Use

Jeeves is designed for:

  • On-device / edge inference where a small footprint is critical
  • Tool-augmented agents that rely on function calling rather than parametric knowledge
  • Research into efficient architectures (looped transformers, value residual)
  • Fine-tuning on domain-specific tasks where a small, fast base model is preferred

Citation

If you use Jeeves in your work, please also cite the papers that inspired its architecture:

@article{looped_transformer_2023,
  title={Looped Transformers are Better at Learning Learning Algorithms},
  author={...},
  journal={arXiv:2311.12424},
  year={2023}
}

@article{value_residual_2024,
  title={Value Residual Learning For Alleviating Attention Concentration In Transformers},
  author={...},
  journal={arXiv:2410.17897},
  year={2024}
}

License

Apache 2.0 β€” see LICENSE for details.

Downloads last month
87
Safetensors
Model size
74.9M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Anurich/Jeeves-Small-75M