SLM-RL-Agents — Model Checkpoints

Paper: Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

Authors: Md Rezwanul Haque, Md. Milon Islam, Fakhri Karray

Code: github.com/rezwanh001/slm-rl-agents

This repository hosts 30 trained checkpoints (15 SFT + 15 PPO) from the SLM-RL-Agents framework — a stabilised RLHF pipeline for training small language model agents in the 70M–410M parameter regime — plus an agentic-SFT warm-up checkpoint released as forward-compatibility scaffolding for the multi-turn agentic extension.

All numerical claims below match the corresponding tables of the paper and are reproducible from results/all_results.json in the code repository.

Models

Family	Model	Params	Layers
Pythia	Pythia-70M-deduped	70M	6
Pythia	Pythia-160M-deduped	162M	12
Pythia	Pythia-410M-deduped	405M	24
SmolLM2	SmolLM2-135M	135M	30
SmolLM2	SmolLM2-360M	361M	32

Corpora

TinyStories — simple narrative fiction
CNN/DailyMail — news articles
Wikitext-103 — encyclopaedic text

Repository Layout

SLM-RL-Agents/
├── sft/                    # 15 LoRA adapters
│   ├── pythia-70m/{tinystories, cnn_dailymail, wikitext}/
│   ├── pythia-160m/...
│   ├── pythia-410m/...
│   ├── smollm2-135m/...
│   └── smollm2-360m/...
├── ppo/                    # 15 fully merged models
│   ├── pythia-70m/{tinystories, cnn_dailymail, wikitext}/
│   └── ...
└── agentic_sft/            # forward-compatibility tool-use warm-up
    └── pythia-410m/task_a_tinystories/   # full FT, tool-call acc 1.000

Agentic-SFT Warm-up (forward-compatibility)

A single tool-use SFT checkpoint released alongside the multi-turn agentic scaffolding in src/agentic/ and the corresponding dataset (agentic/task_a_tinystories/ in SLM-RL-Agents-Data). This checkpoint teaches a base SLM the action-sentinel grammar (<tool name="X">{...}</tool>, <ask>...</ask>, <finish>...</finish>) required by src.agentic.environment.AgenticEnvironment.

Item	Value
Base model	`EleutherAI/pythia-410m-deduped`
Training	Full fine-tune, 3 epochs on 5,000 demos
Path	`agentic_sft/pythia-410m/task_a_tinystories/`
Tool-call accuracy (n=50)	1.000 (precondition ≥ 0.5 met)
Final train loss	0.927

Status: Released as scaffolding only — not evaluated empirically in the paper. Smaller variants (pythia-70m, pythia-160m) with LoRA failed the ≥0.5 precondition; pythia-410m full FT was required to clear it.

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

root = snapshot_download(
    repo_id="mr3haque/SLM-RL-Agents",
    allow_patterns="agentic_sft/pythia-410m/task_a_tinystories/**",
)
path = f"{root}/agentic_sft/pythia-410m/task_a_tinystories"
model = AutoModelForCausalLM.from_pretrained(path)
tok = AutoTokenizer.from_pretrained(path)

Key Results — Five Configurations Where PPO Helps

The paper's central finding (capacity-headroom hypothesis) is that PPO yields a positive reward delta only where the SFT prior is fluent (PPL < 20) and the reward signal is informative. Across the 15 configurations, exactly five rows clear that bar:

Configuration	SFT PPL	PPO PPL	SFT Reward	PPO Reward	Δ Reward	Win Rate
Pythia-410M / TinyStories	6.5	7.3	−4.28 ± 4.14	−2.92 ± 3.48	+1.355	59.9%
SmolLM2-360M / TinyStories	5.3	5.3	+1.69 ± 2.25	+2.41 ± 1.89	+0.724	59.7%
SmolLM2-360M / Wikitext-103	16.7	16.9	+2.71 ± 1.28	+2.98 ± 1.06	+0.272	56.5%
Pythia-160M / TinyStories	13.5	13.5	−8.52 ± 2.39	−8.28 ± 2.46	+0.238	52.8%
SmolLM2-135M / TinyStories	7.0	7.4	−0.92 ± 2.26	−0.69 ± 1.96	+0.226	53.0%

Reward scores use per-configuration scales (each reward model is trained from the matching SFT checkpoint, so absolute magnitudes are not comparable across rows). The remaining 10 configurations train stably without divergence but show near-zero or negative deltas — consistent with the capacity-headroom prediction. The full 15-row table (including all PPL, reward, ROUGE, BLEU, and Distinct-N values) is in Table II/V of the paper and in results/all_results.json in the code repository.

Quick Start

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

root = snapshot_download(repo_id="mr3haque/SLM-RL-Agents", allow_patterns="ppo/smollm2-360m/tinystories/**")
model = AutoModelForCausalLM.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")
tokenizer = AutoTokenizer.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")

Datasets

mr3haque/SLM-RL-Agents-Data

Citation

@inproceedings{haque2026slmrlagents,
  title     = {Towards Robust Reinforcement Learning for Small-Scale
               Language Model Agents},
  author    = {Haque, Md Rezwanul and Islam, Md. Milon and Karray, Fakhri},
  booktitle = {Proceedings of the IEEE International Conference on
               Systems, Man, and Cybernetics (SMC)},
  year      = {2026}
}

Downloads last month: -

Model tree for mr3haque/SLM-RL-Agents

Base model

EleutherAI/pythia-160m-deduped

Adapter

(4)

this model

mr3haque
/

SLM-RL-Agents