SLM-RL-Agents β€” Model Checkpoints

Paper: Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

Authors: Md Rezwanul Haque, Md. Milon Islam, Fakhri Karray

Code: github.com/rezwanh001/slm-rl-agents

This repository hosts 30 trained checkpoints (15 SFT + 15 PPO) from the SLM-RL-Agents framework β€” a stabilised RLHF pipeline for training small language model agents in the 70M–410M parameter regime β€” plus an agentic-SFT warm-up checkpoint released as forward-compatibility scaffolding for the multi-turn agentic extension.

All numerical claims below match the corresponding tables of the paper and are reproducible from results/all_results.json in the code repository.

Models

Family Model Params Layers
Pythia Pythia-70M-deduped 70M 6
Pythia Pythia-160M-deduped 162M 12
Pythia Pythia-410M-deduped 405M 24
SmolLM2 SmolLM2-135M 135M 30
SmolLM2 SmolLM2-360M 361M 32

Corpora

  • TinyStories β€” simple narrative fiction
  • CNN/DailyMail β€” news articles
  • Wikitext-103 β€” encyclopaedic text

Repository Layout

SLM-RL-Agents/
β”œβ”€β”€ sft/                    # 15 LoRA adapters
β”‚   β”œβ”€β”€ pythia-70m/{tinystories, cnn_dailymail, wikitext}/
β”‚   β”œβ”€β”€ pythia-160m/...
β”‚   β”œβ”€β”€ pythia-410m/...
β”‚   β”œβ”€β”€ smollm2-135m/...
β”‚   └── smollm2-360m/...
β”œβ”€β”€ ppo/                    # 15 fully merged models
β”‚   β”œβ”€β”€ pythia-70m/{tinystories, cnn_dailymail, wikitext}/
β”‚   └── ...
└── agentic_sft/            # forward-compatibility tool-use warm-up
    └── pythia-410m/task_a_tinystories/   # full FT, tool-call acc 1.000

Agentic-SFT Warm-up (forward-compatibility)

A single tool-use SFT checkpoint released alongside the multi-turn agentic scaffolding in src/agentic/ and the corresponding dataset (agentic/task_a_tinystories/ in SLM-RL-Agents-Data). This checkpoint teaches a base SLM the action-sentinel grammar (<tool name="X">{...}</tool>, <ask>...</ask>, <finish>...</finish>) required by src.agentic.environment.AgenticEnvironment.

Item Value
Base model EleutherAI/pythia-410m-deduped
Training Full fine-tune, 3 epochs on 5,000 demos
Path agentic_sft/pythia-410m/task_a_tinystories/
Tool-call accuracy (n=50) 1.000 (precondition β‰₯ 0.5 met)
Final train loss 0.927

Status: Released as scaffolding only β€” not evaluated empirically in the paper. Smaller variants (pythia-70m, pythia-160m) with LoRA failed the β‰₯0.5 precondition; pythia-410m full FT was required to clear it.

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

root = snapshot_download(
    repo_id="mr3haque/SLM-RL-Agents",
    allow_patterns="agentic_sft/pythia-410m/task_a_tinystories/**",
)
path = f"{root}/agentic_sft/pythia-410m/task_a_tinystories"
model = AutoModelForCausalLM.from_pretrained(path)
tok = AutoTokenizer.from_pretrained(path)

Key Results β€” Five Configurations Where PPO Helps

The paper's central finding (capacity-headroom hypothesis) is that PPO yields a positive reward delta only where the SFT prior is fluent (PPL < 20) and the reward signal is informative. Across the 15 configurations, exactly five rows clear that bar:

Configuration SFT PPL PPO PPL SFT Reward PPO Reward Ξ” Reward Win Rate
Pythia-410M / TinyStories 6.5 7.3 βˆ’4.28 Β± 4.14 βˆ’2.92 Β± 3.48 +1.355 59.9%
SmolLM2-360M / TinyStories 5.3 5.3 +1.69 Β± 2.25 +2.41 Β± 1.89 +0.724 59.7%
SmolLM2-360M / Wikitext-103 16.7 16.9 +2.71 Β± 1.28 +2.98 Β± 1.06 +0.272 56.5%
Pythia-160M / TinyStories 13.5 13.5 βˆ’8.52 Β± 2.39 βˆ’8.28 Β± 2.46 +0.238 52.8%
SmolLM2-135M / TinyStories 7.0 7.4 βˆ’0.92 Β± 2.26 βˆ’0.69 Β± 1.96 +0.226 53.0%

Reward scores use per-configuration scales (each reward model is trained from the matching SFT checkpoint, so absolute magnitudes are not comparable across rows). The remaining 10 configurations train stably without divergence but show near-zero or negative deltas β€” consistent with the capacity-headroom prediction. The full 15-row table (including all PPL, reward, ROUGE, BLEU, and Distinct-N values) is in Table II/V of the paper and in results/all_results.json in the code repository.

Quick Start

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

root = snapshot_download(repo_id="mr3haque/SLM-RL-Agents", allow_patterns="ppo/smollm2-360m/tinystories/**")
model = AutoModelForCausalLM.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")
tokenizer = AutoTokenizer.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")

Datasets

mr3haque/SLM-RL-Agents-Data

Citation

@inproceedings{haque2026slmrlagents,
  title     = {Towards Robust Reinforcement Learning for Small-Scale
               Language Model Agents},
  author    = {Haque, Md Rezwanul and Islam, Md. Milon and Karray, Fakhri},
  booktitle = {Proceedings of the IEEE International Conference on
               Systems, Man, and Cybernetics (SMC)},
  year      = {2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mr3haque/SLM-RL-Agents

Adapter
(4)
this model

Dataset used to train mr3haque/SLM-RL-Agents