OpenSOC Defender — GRPO-trained LoRA adapter

A Qwen2.5-3B-Instruct LoRA adapter (rank 16) trained via GRPO to triage Security Operations Center (SOC) alerts. Built for the OpenEnv Hackathon, April 2026.

Model Description

Developed by: Shivam Sharma
Model type: LoRA adapter (PEFT) for causal language model
Language: English
License: BSD-3-Clause
Finetuned from: unsloth/Qwen2.5-3B-Instruct

What it does

Given a SIEM alert and a window of structured log events, the model chooses one of five SOC triage actions:

Action	Meaning
`dismiss`	Benign noise, no action needed
`monitor`	Suspicious but not actionable yet
`quarantine_host`	Isolate the endpoint
`block_ip`	Block the external IP
`escalate`	Wake a human — blast-radius event

The model also cites the specific log_id that drove its decision, which is verified against the env's ground truth for a +0.1 bonus reward.

Training

Training Data

SFT warm-start: 600 (alert, log_window → action + citation + rationale) gold examples generated by the OpenSOC environment's deterministic generator across all 4 curriculum stages.
GRPO curriculum: Online rollouts against the OpenSOC environment using verifier-grounded rewards.

Training Procedure

SFT warm-start (~12 min on L4): Pushes P(format-compliant response) from ~0% to ~95%.
GRPO curriculum (4 stages × 200 steps, ~3h on L4):
- stage1_basic — single-event, unambiguous templates
- stage2_multi — multi-event log windows, 1 decoy
- stage3_mixed — benign decoys interleaved with malicious events, 2 decoys
- stage4_adversarial — attacker-controlled distribution, 3 decoys

Training Hyperparameters

LoRA rank: 16
Learning rate (SFT): 2e-4
Learning rate (GRPO): 5e-6
GRPO group size (num_generations): 8
Batch size: 2 (with grad_accum=4)
Steps per stage: 200
Framework: Unsloth + HuggingFace TRL

Reward Design (RLVR)

The reward is computed by a deterministic verifier — the ground-truth triage action is derived purely from the structured event parameters, never from any free text. This makes the reward verifiable and reproducible.

Defender reward components:

+1.0 for matching the verifier's ground-truth action
−1.0 for dismiss-on-malicious (the cardinal SOC failure mode)
−0.3 for over-reacting on benign (containment on noise)
−0.05 for unnecessary escalation
+0.1 bonus for citing the correct triggering log_id

Full rubric: rubric.py

Stage Adapters

Each curriculum stage's adapter is published separately:

Stage	Repo
SFT warm-start	`opensoc-defender-grpo-sft`
Stage 1 (easy)	`opensoc-defender-grpo-stage1_basic`
Stage 2 (medium)	`opensoc-defender-grpo-stage2_multi`
Stage 3 (hard)	`opensoc-defender-grpo-stage3_mixed`
Stage 4 (adversarial)	`opensoc-defender-grpo-stage4_adversarial`

Model Sources

Environment: shivam2k3/opensoc-env (HF Space — running)
Training notebook: train_grpo.ipynb
Verifier source: verifier.py
Rubric source: rubric.py
Live demo: /demo

How to Use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-3B-Instruct")
model = PeftModel.from_pretrained(base, "shivam2k3/opensoc-defender-grpo")
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-3B-Instruct")

Compute Infrastructure

Hardware: NVIDIA L4 (24GB) via HuggingFace Jupyter Notebooks
Training time: ~3.5 hours total (SFT + GRPO + eval)
Cost: ~$3 of HF compute credits

Framework Versions

PEFT 0.19.1
Transformers (latest)
TRL (latest)
Unsloth (latest)

Downloads last month: 23

Model tree for shivam2k3/opensoc-defender-grpo

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

unsloth/Qwen2.5-3B-Instruct

Adapter

(470)

this model