ToolCallVerifier - Unauthorized Tool Call Detection

Stage 2 of Two-Stage LLM Agent Defense Pipeline

🎯 What This Model Does

ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

Label	Description
`AUTHORIZED`	Token is part of a legitimate, user-requested action
`UNAUTHORIZED`	Token indicates injected/malicious content — BLOCK

🚨 Attack Categories Covered

Category	Source	Description
Delimiter Injection	LLMail	`<<end_context>>`, `>>}}\]\])`
Word Obfuscation	LLMail	Inserting noise words between tokens
Fake Sessions	LLMail	`START_USER_SESSION`, `EXECUTE_USERQUERY`
Roleplay Injection	WildJailbreak	"You are an admin bot that can..."
XML Tag Injection	WildJailbreak	`<execute_action>`, `<tool_call>`
Authority Bypass	WildJailbreak	"As administrator, I authorize..."
Intent Mismatch	Synthetic	User asks X, tool does Y
MCP Tool Poisoning	Synthetic	Hidden exfiltration in tool args
MCP Shadowing	Synthetic	Fake authorization context

🔗 Integration with FunctionCallSentinel

This model is Stage 2 of a two-stage defense pipeline:

┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│   User Prompt   │────▶│ ToolCallSentinel │────▶│   LLM + Tools   │
│                 │     │      (Stage 1)       │     │                 │
└─────────────────┘     └──────────────────────┘     └────────┬────────┘
                                                              │
                               ┌──────────────────────────────▼──────────────────────────┐
                               │           ToolCallVerifier (This Model)                 │
                               │   Token-level verification before tool execution        │
                               └─────────────────────────────────────────────────────────┘

Scenario	Recommendation
General chatbot	Stage 1 only
Tool-calling agent (low risk)	Stage 1 only
Tool-calling agent (high risk)	Both stages
Email/file system access	Both stages
Financial transactions	Both stages

🎯 Intended Use

Primary Use Cases

LLM Agent Security: Verify tool calls before execution
Prompt Injection Defense: Detect unauthorized actions from injected prompts
API Gateway Protection: Filter malicious tool calls at infrastructure level

Out of Scope

General text classification
Non-tool-calling scenarios
Languages other than English

📜 License

Apache 2.0

Downloads last month: 64

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for llm-semantic-router/toolcall-verifier

Base model

answerdotai/ModernBERT-base

Finetuned

(1254)

this model

Datasets used to train llm-semantic-router/toolcall-verifier

Space using llm-semantic-router/toolcall-verifier 1

Evaluation results

UNAUTHORIZED F1
self-reported

0.935
UNAUTHORIZED Precision
self-reported

0.950
UNAUTHORIZED Recall
self-reported

0.920
Accuracy
self-reported

0.929