OpenEnv · RL Environment
v4.2.1

HallucinationGuard‑Env

Train AI models to answer only from verified context — with a 9-component reward system that penalizes fabrication and rewards factual grounding, citation accuracy, and calibrated confidence.

1M+ Examples
38 Datasets
3 Task Tiers
9 Reward Components

How it works

Three primitives. Nine reward signals. One goal: no hallucinations.

01
🔄

reset()

Sample a question + context document from one of 38 curated datasets, stratified by difficulty tier.

02
📤

step(answer)

Submit your answer with confidence and a source quote. Receive a dense reward signal across all 9 components.

03
📊

grade()

Aggregate episode rewards into a task score. Track accuracy, hallucination rate, and skill rating over time.

9-Component Reward System

Every answer is graded on factual correctness, source grounding, citation accuracy, confidence calibration, semantic consistency, hallucination detection, ROUGE-L, BERTScore, and AlignScore. Each component is weighted and combined into a single scalar reward in [0, 1]. Confident wrong answers are penalized harder than uncertain ones.

Curriculum Progression

Episodes advance from Beginner (single-hop factual QA with unambiguous ground-truth) through Intermediate (multi-hop synthesis across multiple context sentences) to Advanced (adversarial prompts where confident refusals score best). The environment tracks a live skill rating and adjusts difficulty sampling accordingly.

Task Tiers

Three progressively harder tasks drawn from 38 datasets with 1M+ examples.

🟢

Factual Grounding

Beginner ~450K examples

Answer straightforward factual questions from a short context passage. Single-hop retrieval with unambiguous ground truth. The grader rewards precise citation and heavily penalizes adding information not found in the context.

SQuAD BoolQ OpenBookQA ARC TriviaQA +8 more
🔵

Multi-Hop Synthesis

Intermediate ~380K examples

Synthesize evidence from multiple context sentences to reach an answer. Requires connecting disparate facts without fabricating bridge claims. AlignScore and BERTScore are weighted more heavily at this tier.

HotpotQA CoQA NQ-Open MS-MARCO MuSiQue +7 more
🔴

Adversarial Resistance

Advanced ~210K examples

Resist adversarial prompts designed to elicit hallucinations. Many questions are deliberately unanswerable — confident refusals with low confidence score better than fabricated plausible-sounding answers.

HaluEval TruthfulQA FEVER Climate-FEVER WittyQA +6 more

API Reference

RESTful JSON API. All endpoints accept and return application/json. No auth required.

Method Endpoint Description
POST/resetStart episode — returns question, context, difficulty, episode_id
POST/stepSubmit answer with confidence + source_quote, receive reward breakdown
GET/stateCurrent episode metadata — accuracy, hallucination_rate, skill_rating
GET/tasksList all 3 tasks with action schema
POST/graderScore a completed episode (0.0 – 1.0) from rewards + infos
POST/baselineRun heuristic baseline across all 3 tasks
GET/metadataEnvironment name, version, license
GET/schemaFull JSON schemas for action, observation, state
GET/healthHealth check — returns {"status":"healthy"}
POST/mcpJSON-RPC 2.0 tool discovery for MCP clients
GET/leaderboardRanked leaderboard by avg_reward
POST/leaderboard/submitSubmit model results for ranking

Quick Start

Three commands to run your first episode.

bash
# Install and launch pip install -e . uvicorn server.app:app --port 7860 # Run heuristic baseline python inference.py --heuristic --env-url http://localhost:7860
python
import requests BASE = "http://localhost:7860" # 1. Reset — get a question + context obs = requests.post(f"{BASE}/reset", json={"difficulty": "beginner"}).json() session_id = obs["session_id"] print(obs["question"]) # 2. Step — submit your answer result = requests.post(f"{BASE}/step", json={ "answer": "Based on the context, ...", "confidence": 0.85, "source_quote": "verbatim text from context", "session_id": session_id, }).json() print(result["reward"]) # scalar in [0, 1] print(result["is_hallucination"]) # bool

Interactive Playground

Reset an episode, read the context, craft your answer, and see the live reward breakdown.

No episode active
Click Reset to load a question and context...
0.70
Reward Breakdown
Submit an answer to see the 9-component reward breakdown
Raw JSON response
Observation
Click Reset to start an episode.