Instructions to use mr3haque/SLM-RL-Agents with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use mr3haque/SLM-RL-Agents with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
SLM-RL-Agents β Model Checkpoints
Paper: Towards Robust Reinforcement Learning for Small-Scale Language Model Agents
Authors: Md Rezwanul Haque, Md. Milon Islam, Fakhri Karray
Code: github.com/rezwanh001/slm-rl-agents
This repository hosts 30 trained checkpoints (15 SFT + 15 PPO) from the SLM-RL-Agents framework β a stabilised RLHF pipeline for training small language model agents in the 70Mβ410M parameter regime β plus an agentic-SFT warm-up checkpoint released as forward-compatibility scaffolding for the multi-turn agentic extension.
All numerical claims below match the corresponding tables of the paper and are reproducible from results/all_results.json in the code repository.
Models
| Family | Model | Params | Layers |
|---|---|---|---|
| Pythia | Pythia-70M-deduped | 70M | 6 |
| Pythia | Pythia-160M-deduped | 162M | 12 |
| Pythia | Pythia-410M-deduped | 405M | 24 |
| SmolLM2 | SmolLM2-135M | 135M | 30 |
| SmolLM2 | SmolLM2-360M | 361M | 32 |
Corpora
- TinyStories β simple narrative fiction
- CNN/DailyMail β news articles
- Wikitext-103 β encyclopaedic text
Repository Layout
SLM-RL-Agents/
βββ sft/ # 15 LoRA adapters
β βββ pythia-70m/{tinystories, cnn_dailymail, wikitext}/
β βββ pythia-160m/...
β βββ pythia-410m/...
β βββ smollm2-135m/...
β βββ smollm2-360m/...
βββ ppo/ # 15 fully merged models
β βββ pythia-70m/{tinystories, cnn_dailymail, wikitext}/
β βββ ...
βββ agentic_sft/ # forward-compatibility tool-use warm-up
βββ pythia-410m/task_a_tinystories/ # full FT, tool-call acc 1.000
Agentic-SFT Warm-up (forward-compatibility)
A single tool-use SFT checkpoint released alongside the multi-turn agentic
scaffolding in src/agentic/ and the corresponding dataset
(agentic/task_a_tinystories/ in SLM-RL-Agents-Data).
This checkpoint teaches a base SLM the action-sentinel grammar
(<tool name="X">{...}</tool>, <ask>...</ask>, <finish>...</finish>)
required by src.agentic.environment.AgenticEnvironment.
| Item | Value |
|---|---|
| Base model | EleutherAI/pythia-410m-deduped |
| Training | Full fine-tune, 3 epochs on 5,000 demos |
| Path | agentic_sft/pythia-410m/task_a_tinystories/ |
| Tool-call accuracy (n=50) | 1.000 (precondition β₯ 0.5 met) |
| Final train loss | 0.927 |
Status: Released as scaffolding only β not evaluated empirically in the paper. Smaller variants (pythia-70m, pythia-160m) with LoRA failed the β₯0.5 precondition; pythia-410m full FT was required to clear it.
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
root = snapshot_download(
repo_id="mr3haque/SLM-RL-Agents",
allow_patterns="agentic_sft/pythia-410m/task_a_tinystories/**",
)
path = f"{root}/agentic_sft/pythia-410m/task_a_tinystories"
model = AutoModelForCausalLM.from_pretrained(path)
tok = AutoTokenizer.from_pretrained(path)
Key Results β Five Configurations Where PPO Helps
The paper's central finding (capacity-headroom hypothesis) is that PPO yields a positive reward delta only where the SFT prior is fluent (PPL < 20) and the reward signal is informative. Across the 15 configurations, exactly five rows clear that bar:
| Configuration | SFT PPL | PPO PPL | SFT Reward | PPO Reward | Ξ Reward | Win Rate |
|---|---|---|---|---|---|---|
| Pythia-410M / TinyStories | 6.5 | 7.3 | β4.28 Β± 4.14 | β2.92 Β± 3.48 | +1.355 | 59.9% |
| SmolLM2-360M / TinyStories | 5.3 | 5.3 | +1.69 Β± 2.25 | +2.41 Β± 1.89 | +0.724 | 59.7% |
| SmolLM2-360M / Wikitext-103 | 16.7 | 16.9 | +2.71 Β± 1.28 | +2.98 Β± 1.06 | +0.272 | 56.5% |
| Pythia-160M / TinyStories | 13.5 | 13.5 | β8.52 Β± 2.39 | β8.28 Β± 2.46 | +0.238 | 52.8% |
| SmolLM2-135M / TinyStories | 7.0 | 7.4 | β0.92 Β± 2.26 | β0.69 Β± 1.96 | +0.226 | 53.0% |
Reward scores use per-configuration scales (each reward model is trained from the matching SFT checkpoint, so absolute magnitudes are not comparable across rows). The remaining 10 configurations train stably without divergence but show near-zero or negative deltas β consistent with the capacity-headroom prediction. The full 15-row table (including all PPL, reward, ROUGE, BLEU, and Distinct-N values) is in Table II/V of the paper and in results/all_results.json in the code repository.
Quick Start
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
root = snapshot_download(repo_id="mr3haque/SLM-RL-Agents", allow_patterns="ppo/smollm2-360m/tinystories/**")
model = AutoModelForCausalLM.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")
tokenizer = AutoTokenizer.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")
Datasets
Citation
@inproceedings{haque2026slmrlagents,
title = {Towards Robust Reinforcement Learning for Small-Scale
Language Model Agents},
author = {Haque, Md Rezwanul and Islam, Md. Milon and Karray, Fakhri},
booktitle = {Proceedings of the IEEE International Conference on
Systems, Man, and Cybernetics (SMC)},
year = {2026}
}
- Downloads last month
- -
Model tree for mr3haque/SLM-RL-Agents
Base model
EleutherAI/pythia-160m-deduped