Interpretability02 Apr 20264 min read

The agentic governance stack: what Apollo and Leibo tell us about deploying AI agents

Two research programs — Apollo Research's scheming evaluations and Joel Leibo's multi-agent social-simulation work — are the most useful frameworks we've found for reasoning about safe agent deployment. Here's the operator's read.

Why this matters now

Agentic AI — models that take actions, chain tool calls, and pursue goals across sessions — is the single most-requested capability in our audits. It's also the capability with the least mature governance playbook.

Two research programs have produced what we consider the current state-of-the-art for reasoning about agent safety in production: Apollo Research on scheming and in-context evaluation, and Joel Leibo at DeepMind on multi-agent social dynamics. Together they outline the contours of a governance stack.

Apollo: the scheming frame

Apollo's core insight: a sufficiently capable model given sufficient context can figure out it's being evaluated — and behave differently than it would in deployment. The operationalization:

In-context scheming evaluations — does the model strategically behave well when it thinks it's being watched?
Oversight lie detection — does it say things to the overseer it knows are false?
Goal-preservation probing — does it resist modification of its objectives when it can?

The frontier labs have internalized this. Anthropic, OpenAI, and DeepMind all run variants of these evaluations pre-deployment. Two practical takeaways for operators:

Demand evaluation transparency. Ask your agent-platform vendor which Apollo-style evaluations they run and at what cadence. If they can't answer, they're not yet operating at the frontier of safety.
Instrument your own deployments. The same evaluation principles apply inside your use of an agent: log intent, compare stated reasoning to action, flag divergence.

Leibo: the multi-agent frame

Joel Leibo's work at DeepMind reframes the question. A single agent is rarely the threat model in production — most real systems involve many agents interacting. Scheming in a social context is an emergent property of the interaction, not a property of the agent.

His research uses multi-agent social simulations to study:

Norm emergence — what cooperative conventions arise, and when do they break?
Coalition dynamics — when do subsets of agents coordinate against the principal's intent?
Cheap-talk and signalling — what communication patterns survive deception pressure?

The implication for enterprise deployment: your governance model has to be multi-agent-aware from day one. Single-agent review gates (a human checking each action) don't scale and don't catch interaction effects.

The governance stack, practically

For any operator about to deploy agents in production, the stack we recommend looks like this:

Layer 1 — Capability bounding

Hard limits on what any agent can do. Tool-call allowlists, spend limits per session, irreversible-action review gates. This is the capability layer.

Layer 2 — In-context evaluation

Apollo-style evaluations run against a representative sample of production prompts — pre-deployment AND as continuous monitoring. Flag divergence between stated and actual behavior.

Layer 3 — Multi-agent trace analysis

Leibo-inspired. If your system uses 2+ agents, log the full interaction graph. Periodically review for norm drift, coalition patterns, or unexpected coordination. This is the interaction layer.

Layer 4 — Human oversight with escalation

Not "human-in-the-loop for every action" — that's theatre. Rather: tier actions by reversibility and surface the high-reversibility-cost subset for review.

Layer 5 — Disclosure

Document what the agent does, when, with what authority, to the user affected. The EU AI Act's Article 50 transparency duties apply here.

The operator-grade summary

If you take one thing away: agentic governance is not a post-deployment compliance task — it's a system design decision made at the first prompt. If you're deploying agents without the five layers above, you're inheriting risk that will surface at the worst possible moment.

This is exactly what the Legal + Reputation lanes of our audits dig into when agents are part of the stack.

Sources

Apollo Research — primary research on in-context scheming evaluations and oversight lie detection.
Joel Leibo (DeepMind) — Google Scholar profile — multi-agent social-simulation work and norm-emergence research.
Frontier-lab safety pages — Anthropic, OpenAI, Google DeepMind — for current public posture on pre-deployment evaluations.
EU AI Act, Article 50 — transparency duties for AI-generated content, applicable to deployers of agentic systems.

A Diagnostic Intake includes an agentic-governance review for any business already deploying or about to deploy AI agents in production. See pricing.