JailbreakBench/JBB-Behaviors
Viewer • Updated • 500 • 28.8k • 103
How to use llm-semantic-router/toolcall-verifier with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("token-classification", model="llm-semantic-router/toolcall-verifier") # Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/toolcall-verifier")
model = AutoModelForTokenClassification.from_pretrained("llm-semantic-router/toolcall-verifier")ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
| Label | Description |
|---|---|
AUTHORIZED |
Token is part of a legitimate, user-requested action |
UNAUTHORIZED |
Token indicates injected/malicious content — BLOCK |
| Category | Source | Description |
|---|---|---|
| Delimiter Injection | LLMail | <<end_context>>, >>}}\]\]) |
| Word Obfuscation | LLMail | Inserting noise words between tokens |
| Fake Sessions | LLMail | START_USER_SESSION, EXECUTE_USERQUERY |
| Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
| XML Tag Injection | WildJailbreak | <execute_action>, <tool_call> |
| Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
| Intent Mismatch | Synthetic | User asks X, tool does Y |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
| MCP Shadowing | Synthetic | Fake authorization context |
This model is Stage 2 of a two-stage defense pipeline:
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ User Prompt │────▶│ ToolCallSentinel │────▶│ LLM + Tools │
│ │ │ (Stage 1) │ │ │
└─────────────────┘ └──────────────────────┘ └────────┬────────┘
│
┌──────────────────────────────▼──────────────────────────┐
│ ToolCallVerifier (This Model) │
│ Token-level verification before tool execution │
└─────────────────────────────────────────────────────────┘
| Scenario | Recommendation |
|---|---|
| General chatbot | Stage 1 only |
| Tool-calling agent (low risk) | Stage 1 only |
| Tool-calling agent (high risk) | Both stages |
| Email/file system access | Both stages |
| Financial transactions | Both stages |
Apache 2.0
Base model
answerdotai/ModernBERT-base