Value Residual Learning For Alleviating Attention Concentration In Transformers
Paper
β’ 2410.17897 β’ Published
β’ 9
A compact 75M parameter language model built on Looped Transformer and Value Residual Learning architectures β with native support for tool calling / function calling.
Jeeves is designed to punch above its weight class by reusing a small set of transformer layers iteratively (looping), giving it an effective depth far beyond what its parameter count suggests.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note:
trust_remote_code=Trueis required due to custom model architecture code.
Jeeves supports structured tool/function calling out of the box. Below is an example:
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
]
messages = [
{"role": "user", "content": "What's the weather like in London?"}
]
# Format prompt with tools using the chat template
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
| Component | Value |
|---|---|
| Parameters | 74.9M |
| Unique layers | 8 |
| Effective depth | 15 |
| Loop | block[4] Γ 8 |
| Value residual | β |
| Hidden dim | 768 |
| FFN dim | 2,048 |
| Attention heads | 12 (Q) / 4 (KV) β GQA |
| Vocab size | 32,000 |
| Max seq length | 512 |
| Training steps | 1,100 |
Evaluated using EleutherAI lm-evaluation-harness.
| Benchmark | Accuracy | Correct | Total |
|---|---|---|---|
| HellaSwag | 30.9% | 3,100 | 10,042 |
| ARC-Easy | 47.1% | 1,118 | 2,376 |
| ARC-Challenge | 24.9% | 292 | 1,172 |
| ARC (Average) | 36.0% | β | β |
| PIQA | 63.9% | 1,174 | 1,838 |
| WinoGrande | 52.4% | 664 | 1,267 |
| MMLU | 25.2% | 3,536 | 14,042 |
| TruthfulQA | 24.8% | 203 | 817 |
| GSM8K | 1.4% | 18 | 1,319 |
| IFEval | 40.0% | 4 | 10 |
Jeeves is designed for:
If you use Jeeves in your work, please also cite the papers that inspired its architecture:
@article{looped_transformer_2023,
title={Looped Transformers are Better at Learning Learning Algorithms},
author={...},
journal={arXiv:2311.12424},
year={2023}
}
@article{value_residual_2024,
title={Value Residual Learning For Alleviating Attention Concentration In Transformers},
author={...},
journal={arXiv:2410.17897},
year={2024}
}
Apache 2.0 β see LICENSE for details.