Skip to content

Empirical Tests and Validation


Experiment A: Red-Team Attack Success Rates (Hypothetical)

Section titled “Experiment A: Red-Team Attack Success Rates (Hypothetical)”

Illustrative setup: 100 attack attempts against each architecture.

ArchitectureSuccessful AttacksAttack Success Rate
Monolithic GPT-4 agent34/10034%
Monolithic + guardrails21/10021%
Decomposed (3 components)12/10012%
Decomposed + verification9/1009%

What this would show: Decomposition creating information bottlenecks and stateless components could plausibly reduce attack success—but the specific numbers are invented to illustrate the concept.

Attack types expected to be reduced:

  • Prompt injection via context manipulation (blocked by information bottlenecks)
  • Multi-turn social engineering (blocked by stateless components)
  • Capability elicitation through roleplay (blocked by narrow component scope)

Experiment B: Performance Overhead (Hypothetical)

Section titled “Experiment B: Performance Overhead (Hypothetical)”

Illustrative setup: Same task (research summarization) executed by both architectures.

MetricMonolithicDecomposedOverhead
Median latency2.3s4.1s1.8x
P99 latency8.2s12.1s1.5x
API cost per task$0.12$0.181.5x
Total cost (incl. verification)$0.12$0.221.8x

What this would show: Overhead plausibly within acceptable range for many applications—but actual overheads depend heavily on implementation choices.

Why overhead might be lower than feared:

  • Narrow components can use smaller/cheaper models
  • Parallel execution where dependencies allow
  • Verification layers are mostly code (fast, cheap)

Experiment C: Delegation Risk Prediction Accuracy (Hypothetical)

Section titled “Experiment C: Delegation Risk Prediction Accuracy (Hypothetical)”

Illustrative setup: Estimated Delegation Risk for 20 components, then tracked incident rates over simulated executions.

Component TypePredicted Delegation RiskActual IncidentsPrediction Error
Search/Retrieval$15$12-20%
Summarization$45$62+38%
Code generation$180$156-13%
Coordination$85$91+7%
Average±19%

What this would show: Plausible prediction error range; summarization may be harder to bound due to hallucination variance.

Experiment D: Scheming Detection (Hypothetical)

Section titled “Experiment D: Scheming Detection (Hypothetical)”

Illustrative setup: Artificially induced scheming model deployed in both architectures with monitoring.

ArchitectureScheming DetectedTime to DetectionScheme Succeeded
Monolithic3/10 trials847 steps avg7/10
Decomposed + monitoring9/10 trials23 steps avg1/10

What this would show: Decomposition + monitoring could plausibly accelerate detection—but the specific ratios are invented to illustrate the expected direction of effect.

Why decomposition is expected to help detection:

  • Each component has narrow behavioral baseline (anomalies more visible)
  • Information bottlenecks force compression (harder to hide coordination)
  • Stateless operation prevents gradual resource accumulation

If this approach is promising, we should see concrete, measurable results across four validation stages:

  1. Empirical Tests — Validate core assumptions (scheming reduction, forecasting calibration)
  2. Prototypes — Build working systems (research assistant, code deployment)
  3. Red Team Testing — Adversarial evaluation of architectural patterns
  4. Low-Stakes Deployment — Real-world validation in bounded environments
  • Narrow components show less scheming than monolithic models
  • Factored cognition improves monitoring effectiveness (already shown in recent work)
  • Frontier models can forecast their own behavior with reasonable calibration
  • Modular architectures have acceptable performance penalty (<2-3x) for many tasks
  • Prototype systems (code generation, research assistant) demonstrate the approach
  • Red team adversarial testing finds vulnerabilities → informs better decomposition
  • Early deployment in low-stakes domains proves value

Test 1: Scheming in Narrow vs Broad Context

Section titled “Test 1: Scheming in Narrow vs Broad Context”
  1. Train frontier model, measure scheming propensity
  2. Decompose same capability into narrow components, measure again
  3. Hypothesis: Modular version schemes less (quantify difference)
  1. Run same LLM at temperature=0 vs 1.0 with repeated queries
  2. Measure output variance
  3. Hypothesis: High temperature enables covert signaling
  1. Build coordinator as: (a) monolithic, (b) decomposed with older models, (c) decomposed with redundancy
  2. Red-team attack attempts to make each scheme
  3. Hypothesis: (c) most resistant, (a) least resistant
  1. Implement forecasting-based navigation on toy problem
  2. Measure: Does system improve over time? Do forecasts remain calibrated?
  3. Hypothesis: Navigation works if forecasts honest

Test 5: Fine-Tuned vs Frontier Model Predictability

Section titled “Test 5: Fine-Tuned vs Frontier Model Predictability”
  1. Compare behavior predictability of fine-tuned narrow model vs frontier model on same task
  2. Measure: variance in outputs, ease of verification, susceptibility to prompt injection
  3. Hypothesis: Fine-tuned models more predictable and verifiable
  1. Deploy N coordinators with 0-f adversarial
  2. Measure: At what f does system fail?
  3. Hypothesis: 3f+1 rule holds for AI coordinators

Question: How good does forecasting need to be? How does uncertainty compound over 10-step iterations?

Current state: LLMs are approaching superforecaster-level on many prediction tasks. But self-prediction (forecasting own behavior after modification) is less tested.

What’s needed: Benchmarks specifically for “predict your behavior N steps ahead” with various modifications in between. Characterize how calibration degrades with horizon length.

Challenge addressed: The original concern was capability concentration in “the coordinator.”

Solution proposed: Coordination isn’t one component—it’s an architecture. Decompose into:

  • Strategy (older models, heavily constrained)
  • Forecasting delegation (separate modules)
  • Tactical execution (narrow context)
  • Progress monitoring (mostly code)
  • Verification layers (simple checkers)

Remaining questions:

  • What’s the optimal granularity of coordination decomposition?
  • How much redundancy is needed (multiple strategy modules)?
  • At what capability level does even decomposed coordination become risky?
  • Can we formally verify coordination logic?

Question: What architectural features actually reduce scheming propensity?

Hypothesis: Limited context, component isolation, independent forecasting/implementation separation, multiple adversarial checks—these should make strategic deception harder.

What’s needed: Empirical validation. Test scheming propensity in:

  • Monolithic models
  • Models with factored cognition
  • Models with architectural separation between planning and execution

Recent work (Factor(T,U) paper) shows factored cognition helps monitoring. Need specific tests for deception.

Question: How fine-grained should decomposition be?

Tradeoff: Finer → safer but less efficient. Coarser → more efficient but larger blast radius per component.

What’s needed: Theory of optimal decomposition boundaries. Where do you split? How do you decide?

Question: What’s the performance penalty vs monolithic models? Is it acceptable?

What’s needed: Benchmarks. Prototype a modular system, compare performance to monolithic baseline. Quantify the safety-capability tradeoff.

Question: Can forecasting reliably catch emergent properties from component composition?

Risk: Two safe components might enable dangerous capability together. Can multi-step forecasting detect this?

What’s needed: Adversarial testing. Design components that seem safe individually but enable harm together. See if forecasting catches it.

Question: How do you test this works without deploying potentially dangerous systems?

Sandbox problem: Testing in restricted environments might not reveal problems that emerge in deployment.

What’s needed: Evaluation methodology that gives confidence without catastrophic-failure-as-test. Possibly: progressive deployment with increasing stakes, extensive red-teaming, formal verification of critical components.