OpenSOC Defender β GRPO-trained LoRA adapter
A Qwen2.5-3B-Instruct LoRA adapter (rank 16) trained via GRPO to triage Security Operations Center (SOC) alerts. Built for the OpenEnv Hackathon, April 2026.
Model Description
- Developed by: Shivam Sharma
- Model type: LoRA adapter (PEFT) for causal language model
- Language: English
- License: BSD-3-Clause
- Finetuned from:
unsloth/Qwen2.5-3B-Instruct
What it does
Given a SIEM alert and a window of structured log events, the model chooses one of five SOC triage actions:
| Action | Meaning |
|---|---|
dismiss |
Benign noise, no action needed |
monitor |
Suspicious but not actionable yet |
quarantine_host |
Isolate the endpoint |
block_ip |
Block the external IP |
escalate |
Wake a human β blast-radius event |
The model also cites the specific log_id that drove its decision, which is verified against the env's ground truth for a +0.1 bonus reward.
Training
Training Data
- SFT warm-start: 600 (alert, log_window β action + citation + rationale) gold examples generated by the OpenSOC environment's deterministic generator across all 4 curriculum stages.
- GRPO curriculum: Online rollouts against the OpenSOC environment using verifier-grounded rewards.
Training Procedure
- SFT warm-start (~12 min on L4): Pushes P(format-compliant response) from ~0% to ~95%.
- GRPO curriculum (4 stages Γ 200 steps, ~3h on L4):
stage1_basicβ single-event, unambiguous templatesstage2_multiβ multi-event log windows, 1 decoystage3_mixedβ benign decoys interleaved with malicious events, 2 decoysstage4_adversarialβ attacker-controlled distribution, 3 decoys
Training Hyperparameters
- LoRA rank: 16
- Learning rate (SFT): 2e-4
- Learning rate (GRPO): 5e-6
- GRPO group size (
num_generations): 8 - Batch size: 2 (with grad_accum=4)
- Steps per stage: 200
- Framework: Unsloth + HuggingFace TRL
Reward Design (RLVR)
The reward is computed by a deterministic verifier β the ground-truth triage action is derived purely from the structured event parameters, never from any free text. This makes the reward verifiable and reproducible.
Defender reward components:
- +1.0 for matching the verifier's ground-truth action
- β1.0 for dismiss-on-malicious (the cardinal SOC failure mode)
- β0.3 for over-reacting on benign (containment on noise)
- β0.05 for unnecessary escalation
- +0.1 bonus for citing the correct triggering log_id
Full rubric: rubric.py
Stage Adapters
Each curriculum stage's adapter is published separately:
| Stage | Repo |
|---|---|
| SFT warm-start | opensoc-defender-grpo-sft |
| Stage 1 (easy) | opensoc-defender-grpo-stage1_basic |
| Stage 2 (medium) | opensoc-defender-grpo-stage2_multi |
| Stage 3 (hard) | opensoc-defender-grpo-stage3_mixed |
| Stage 4 (adversarial) | opensoc-defender-grpo-stage4_adversarial |
Model Sources
- Environment:
shivam2k3/opensoc-env(HF Space β running) - Training notebook:
train_grpo.ipynb - Verifier source:
verifier.py - Rubric source:
rubric.py - Live demo:
/demo
How to Use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-3B-Instruct")
model = PeftModel.from_pretrained(base, "shivam2k3/opensoc-defender-grpo")
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-3B-Instruct")
Compute Infrastructure
- Hardware: NVIDIA L4 (24GB) via HuggingFace Jupyter Notebooks
- Training time: ~3.5 hours total (SFT + GRPO + eval)
- Cost: ~$3 of HF compute credits
Framework Versions
- PEFT 0.19.1
- Transformers (latest)
- TRL (latest)
- Unsloth (latest)
- Downloads last month
- 23