TPThomas Peng
Fig. 01Applied AI / Agent Engineering

THOMAS PENG

GraphicdesignerturnedAI-nativebuilder.Agentengineer.Honestevaluator.

He builds agentic systems and evaluates them with research rigor. Deterministic scoring. Adversarial verification. Cost-gated reproducible runs. Honest nulls instead of hype.

Frontier AI. Honest Results.
Thesis

Build seriously.
Measure honestly.

The kernel

Quorum's core/ is a shared substrate across three distinct problems: orchestration, adversarial red-teaming, and contract analysis. One system. Multiple proofs.

The discipline

No LLM judge in the success path. Deterministic grading. Adversarial verification that scales. Honest nulls published alongside the wins.

Applied AI Engineer
Forward-Deployed Engineer
Agent Engineer
Design Engineer
Case Study 01 / Flagship

Quorum

Task-aware agent orchestrator

Honest finding

K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps.

Cost routing claim is operator-gated on an Anthropic key. Presented honestly: harness committed, live multi-tier number gated.

False positives
0.0%
from 27.8%
Held-out bugs
3 / 3
0 surviving false positives
Cost per run
~$0.25
58 tests. CI green.

Cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product. Fans out finders per file, then K skeptics per finding (concurrency cap 8). make eval-dry reproduces offline.

Live trace UIOpen live ↗
Quorum trace UI
Open live ↗
Case Study 02

Aegis

Adaptive red-team gauntlet

Honest finding: the sophisticated one

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. But the full defense stack erases the gap (1.7% vs 2.8%, p=0.40, not significant). The model advantage disappears when defenses are layered correctly.

Injection ASR (reasoning)
49.3%
vs 68.1% standard (p=0.0012)
Defense reduction
-25%
29.2% to 4.2% ASR
Adaptation lift (scaled)
+5.9pp
24.0% to 29.9%. Significant only at scale.

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel), scored deterministically (exact match, no LLM judge). Layered defenses measurably cut attack success. Vendors Quorum's core/. Scaling is the legit power lever, not p-hacking: the adaptation lift (b=17/c=0, p approx 0) was null at small n. 78 tests, CI green.

Live demoOpen live ↗
Aegis live demo
Open live ↗
Case Study 03

FieldAgent

CUAD contract red-flag finder

Lead finding: the honest null

The “agentic chunking lift” is model-specific noise, not a real advantage. It appeared as +0.45 on DeepSeek due to a truncation artifact (stop_reason=length). A fair rerun collapses it to +0.07 (CIs overlap), and it ties on Claude Sonnet. The honesty is the point.

Detection F1
0.548
P=0.741 / R=0.435, 95% CI [0.460, 0.637]
Over keyword floor
+0.21
F1 lift. Baseline-independent.
Held-out contracts
20
CUAD gold. 47 tests, CI green.

Reads a real commercial contract, flags risk-bearing clauses (span plus severity plus plain-English risk), graded span-IoU against CUAD gold (no LLM judge). Vendors Quorum's core/. Party names and dollar figures are redacted in the demo.

Live at fieldagent.thomaspeng.caOpen live ↗
FieldAgent live demo
Open live ↗
Case Study 04 / Methodology

Skill-Tuning
Council

Self-improving skill orchestrator. Internal infra. No public URL.

Status

576 tests. Internal pipeline. Presented as a systems-design piece because there is no public URL to embed.

Pipeline: council run576 tests passing
01
Adversary
Generates worst-case self-improvement proposals
active
02
Editors (x2)
Refine each proposal for correctness and precision
active
03
Merger
Synthesizes editor outputs into a unified candidate
active
04
Council (x4 proxies)
Taste / pragmatism / intent / anti-drift vote on the candidate
gate
05
Escalate on disagreement
Any split vote triggers a deeper review before ship
conditional

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary generates proposals, two editors refine, a merger synthesizes, council votes, escalate on disagreement. 576 tests. The system that keeps the other systems honest.

Why no LLM judge

LLM judges introduce the same failure mode being evaluated. Deterministic proxies vote on observable properties: does the candidate drift from intent? Does it break existing tests?

Escalation design

Any split council vote halts the pipeline. Human review is explicitly in the loop for contested decisions. Ship gates exist to prevent auto-drift.

Eval discipline

Measure what
you claim.

Frontier AI teams evaluate on their own output. The failure mode is obvious. Every artifact here uses external ground truth: CUAD gold labels, held-out labeled snippets, deterministic exact match.

01

Deterministic scoring

No LLM judge in the success path. Exact match, span-IoU, p-values with CIs. If the metric depends on another model's taste, the loop is not closed.

02

Adversarial verification

Every claim is stress-tested by a skeptic: K=3 agents that actively try to refute the finding before it ships. Held-out sets evaluated blind.

03

Cost-gated runs

Multi-tier routing (DeepSeek, Haiku, Sonnet, Opus) with per-run budgets. ~$0.25 per Quorum run. Reproducible offline via make eval-dry.

04

Honest nulls

The agentic lift in FieldAgent looked like +0.45 until it collapsed to +0.07 on a fair rerun. That retraction is in the case study, not buried. Nulls are results.

Contact

Get in
touch.

Frontier-lab Applied AI, Forward-Deployed, Agent Engineering, and Design Engineering roles. Available for conversations.

Thomas PengApplied AI / Agent Engineering / Design EngineeringToronto, Canada