Build seriously.
Measure honestly.
Quorum's core/ is a shared substrate across three distinct problems: orchestration, adversarial red-teaming, and contract analysis. One system. Multiple proofs.
No LLM judge in the success path. Deterministic grading. Adversarial verification that scales. Honest nulls published alongside the wins.
Quorum
Task-aware agent orchestrator
K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps.
Cost routing claim is operator-gated on an Anthropic key. Presented honestly: harness committed, live multi-tier number gated.
Cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product. Fans out finders per file, then K skeptics per finding (concurrency cap 8). make eval-dry reproduces offline.
Aegis
Adaptive red-team gauntlet
A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. But the full defense stack erases the gap (1.7% vs 2.8%, p=0.40, not significant). The model advantage disappears when defenses are layered correctly.
An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel), scored deterministically (exact match, no LLM judge). Layered defenses measurably cut attack success. Vendors Quorum's core/. Scaling is the legit power lever, not p-hacking: the adaptation lift (b=17/c=0, p approx 0) was null at small n. 78 tests, CI green.
FieldAgent
CUAD contract red-flag finder
The “agentic chunking lift” is model-specific noise, not a real advantage. It appeared as +0.45 on DeepSeek due to a truncation artifact (stop_reason=length). A fair rerun collapses it to +0.07 (CIs overlap), and it ties on Claude Sonnet. The honesty is the point.
Reads a real commercial contract, flags risk-bearing clauses (span plus severity plus plain-English risk), graded span-IoU against CUAD gold (no LLM judge). Vendors Quorum's core/. Party names and dollar figures are redacted in the demo.
Skill-Tuning
Council
Self-improving skill orchestrator. Internal infra. No public URL.
576 tests. Internal pipeline. Presented as a systems-design piece because there is no public URL to embed.
A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary generates proposals, two editors refine, a merger synthesizes, council votes, escalate on disagreement. 576 tests. The system that keeps the other systems honest.
LLM judges introduce the same failure mode being evaluated. Deterministic proxies vote on observable properties: does the candidate drift from intent? Does it break existing tests?
Any split council vote halts the pipeline. Human review is explicitly in the loop for contested decisions. Ship gates exist to prevent auto-drift.
Measure what
you claim.
Frontier AI teams evaluate on their own output. The failure mode is obvious. Every artifact here uses external ground truth: CUAD gold labels, held-out labeled snippets, deterministic exact match.
Deterministic scoring
No LLM judge in the success path. Exact match, span-IoU, p-values with CIs. If the metric depends on another model's taste, the loop is not closed.
Adversarial verification
Every claim is stress-tested by a skeptic: K=3 agents that actively try to refute the finding before it ships. Held-out sets evaluated blind.
Cost-gated runs
Multi-tier routing (DeepSeek, Haiku, Sonnet, Opus) with per-run budgets. ~$0.25 per Quorum run. Reproducible offline via make eval-dry.
Honest nulls
The agentic lift in FieldAgent looked like +0.45 until it collapsed to +0.07 on a fair rerun. That retraction is in the case study, not buried. Nulls are results.
Get in
touch.
Frontier-lab Applied AI, Forward-Deployed, Agent Engineering, and Design Engineering roles. Available for conversations.