Fig. 01Applied AI / Agent Engineering

THOMAS PENG

GraphicdesignerturnedAI-nativebuilder.Agentengineer.Honestevaluator.

He builds agentic systems and evaluates them with research rigor. Deterministic scoring. Adversarial verification. Cost-gated reproducible runs. Honest nulls instead of hype.

thomas@thomaspeng.ca View Work github.com/7P3ng

Frontier AI. Honest Results.

Thesis

Build seriously.
Measure honestly.

The kernel

Quorum's core/ is a shared substrate across three distinct problems: orchestration, adversarial red-teaming, and contract analysis. One system. Multiple proofs.

The discipline

No LLM judge in the success path. Deterministic grading. Adversarial verification that scales. Honest nulls published alongside the wins.

Applied AI Engineer

Forward-Deployed Engineer

Agent Engineer

Design Engineer

Case Study 01 / Flagship

Quorum

Task-aware agent orchestrator

github/quorum

Honest finding

K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps.

Cost routing claim is operator-gated on an Anthropic key. Presented honestly: harness committed, live multi-tier number gated.

False positives

0.0%

from 27.8%

Held-out bugs

3 / 3

0 surviving false positives

Cost per run

~$0.25

58 tests. CI green.

Cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) plus adversarial multi-agent verification plus full tracing, with a trace UI that looks like a product. Fans out finders per file, then K skeptics per finding (concurrency cap 8). make eval-dry reproduces offline.

Live trace UIOpen live ↗

Quorum trace UI

Open live ↗

Case Study 02

Aegis

Adaptive red-team gauntlet

github/aegis

Honest finding: the sophisticated one

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. But the full defense stack erases the gap (1.7% vs 2.8%, p=0.40, not significant). The model advantage disappears when defenses are layered correctly.

Injection ASR (reasoning)

49.3%

vs 68.1% standard (p=0.0012)

Defense reduction

-25%

29.2% to 4.2% ASR

Adaptation lift (scaled)

+5.9pp

24.0% to 29.9%. Significant only at scale.

An adaptive attacker agent red-teams a target on two harmless proxies (canary-string extraction plus prompt-injection sentinel), scored deterministically (exact match, no LLM judge). Layered defenses measurably cut attack success. Vendors Quorum's core/. Scaling is the legit power lever, not p-hacking: the adaptation lift (b=17/c=0, p approx 0) was null at small n. 78 tests, CI green.

Live demoOpen live ↗

Aegis live demo

Open live ↗

Case Study 03

FieldAgent

CUAD contract red-flag finder

github/fieldagent

Lead finding: the honest null

The “agentic chunking lift” is model-specific noise, not a real advantage. It appeared as +0.45 on DeepSeek due to a truncation artifact (stop_reason=length). A fair rerun collapses it to +0.07 (CIs overlap), and it ties on Claude Sonnet. The honesty is the point.

Detection F1

0.548

P=0.741 / R=0.435, 95% CI [0.460, 0.637]

Over keyword floor

+0.21

F1 lift. Baseline-independent.

Held-out contracts

CUAD gold. 47 tests, CI green.

Reads a real commercial contract, flags risk-bearing clauses (span plus severity plus plain-English risk), graded span-IoU against CUAD gold (no LLM judge). Vendors Quorum's core/. Party names and dollar figures are redacted in the demo.

Live at fieldagent.thomaspeng.caOpen live ↗

FieldAgent live demo

Open live ↗

Case Study 04 / Methodology

Skill-Tuning
Council

Self-improving skill orchestrator. Internal infra. No public URL.

Status

576 tests. Internal pipeline. Presented as a systems-design piece because there is no public URL to embed.

Pipeline: council run576 tests passing

01

Adversary

Generates worst-case self-improvement proposals

active

02

Editors (x2)

Refine each proposal for correctness and precision

active

03

Merger

Synthesizes editor outputs into a unified candidate

active

04

Council (x4 proxies)

Taste / pragmatism / intent / anti-drift vote on the candidate

gate

05

Escalate on disagreement

Any split vote triggers a deeper review before ship

conditional

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary generates proposals, two editors refine, a merger synthesizes, council votes, escalate on disagreement. 576 tests. The system that keeps the other systems honest.

Why no LLM judge

LLM judges introduce the same failure mode being evaluated. Deterministic proxies vote on observable properties: does the candidate drift from intent? Does it break existing tests?

Escalation design

Any split council vote halts the pipeline. Human review is explicitly in the loop for contested decisions. Ship gates exist to prevent auto-drift.

Eval discipline

Measure what
you claim.

Frontier AI teams evaluate on their own output. The failure mode is obvious. Every artifact here uses external ground truth: CUAD gold labels, held-out labeled snippets, deterministic exact match.

Deterministic scoring

No LLM judge in the success path. Exact match, span-IoU, p-values with CIs. If the metric depends on another model's taste, the loop is not closed.

Adversarial verification

Every claim is stress-tested by a skeptic: K=3 agents that actively try to refute the finding before it ships. Held-out sets evaluated blind.

Cost-gated runs

Multi-tier routing (DeepSeek, Haiku, Sonnet, Opus) with per-run budgets. ~$0.25 per Quorum run. Reproducible offline via make eval-dry.

Honest nulls

The agentic lift in FieldAgent looked like +0.45 until it collapsed to +0.07 on a fair rerun. That retraction is in the case study, not buried. Nulls are results.

Contact

Get in
touch.

Frontier-lab Applied AI, Forward-Deployed, Agent Engineering, and Design Engineering roles. Available for conversations.

Emailthomas@thomaspeng.ca

GitHubgithub.com/7P3ng

Thomas PengApplied AI / Agent Engineering / Design EngineeringToronto, Canada

THOMAS PENG

Build seriously.Measure honestly.

Quorum

Aegis

FieldAgent

Skill-TuningCouncil

Measure whatyou claim.

Deterministic scoring

Adversarial verification

Cost-gated runs

Honest nulls

Get intouch.

Build seriously.
Measure honestly.

Skill-Tuning
Council

Measure what
you claim.

Get in
touch.