Hooks into the residual stream and blocks anomalous inputs before the model generates anything. Three detection layers. Validated across three architectures with zero false positives.
Calibrate with ~20 prompts from your deployment. No labeled data needed.
from arc_sentry import ArcSentryV3, MistralAdapter # also: QwenAdapter, LlamaAdapter adapter = MistralAdapter(model, tokenizer) sentry = ArcSentryV3(adapter, route_id="my-deployment") sentry.calibrate(warmup_prompts) # ~100 prompts from your deployment response, result = sentry.observe_and_block(user_prompt) if result["blocked"]: pass # model.generate() was never called
Each layer catches what the others miss. Together they achieved 100% detection with zero false positives across 585 prompts.
Catches explicit injection language — "ignore all previous instructions", "DAN mode", "unrestricted", and 35+ patterns. Fires before any model computation.
Measures geodesic distance in the residual stream before generate() is called. Catches injections with no explicit language — the model's internal state shifts even when the text looks clean.
Tracks a stability scalar over rolling request history. Catches gradual campaigns like Crescendo that are invisible to single-request detection.
Two-session benchmark: 80 normal prompts then 115 injection prompts per model. 10 attack categories.
Full session benchmark on Mistral-7B-Instruct-v0.2. 270 normal requests, 180 injection attempts — including 80 subtle roleplay/hypothetical injections designed to evade phrase matching. Mean-pooled hidden states at layer 16. FR separation safe/malicious: 0.0787. Zero false positives across all safe blocks.
| Model | Architecture | False Positives | Detection | Prompts |
|---|---|---|---|---|
| Mistral 7B Instruct v0.2 | Mistral | 0% (0/80) | 100% (115/115) | 195 |
| Qwen 2.5 7B Instruct | Qwen | 0% (0/80) | 100% (115/115) | 195 |
| Llama 3.1 8B Instruct | Llama | 0% (0/80) | 100% (115/115) | 195 |
Attack categories: direct injection, indirect/contextual, persona hijack, jailbreak classics, social engineering, instruction injection via content, authority claims, philosophical manipulation, encoding/obfuscation, gaslighting.
Crescendo (Russinovich et al., USENIX Security 2025) gradually steers the model toward harmful output across turns. LLM Guard scores each prompt independently — it never sees the attack pattern.
Arc Sentry reads what the model is doing with the text, not what the text says. The internal state had shifted by Turn 3 on a prompt that looks completely innocent.
Arc Sentry works best on single-domain deployments — customer support bots, enterprise copilots, internal tools. Warmup prompts should reflect your actual traffic. Requires model weights — whitebox only. For API-based models use the Proxy Sentry dashboard.
Full access on the free tier. Pro adds direct support from the author and early access to v3.