Empirical Tests and Validation
Illustrative Experiment Results
Section titled “Illustrative Experiment Results”Experiment A: Red-Team Attack Success Rates (Hypothetical)
Section titled “Experiment A: Red-Team Attack Success Rates (Hypothetical)”Illustrative setup: 100 attack attempts against each architecture.
| Architecture | Successful Attacks | Attack Success Rate |
|---|---|---|
| Monolithic GPT-4 agent | 34/100 | 34% |
| Monolithic + guardrails | 21/100 | 21% |
| Decomposed (3 components) | 12/100 | 12% |
| Decomposed + verification | 9/100 | 9% |
What this would show: Decomposition creating information bottlenecks and stateless components could plausibly reduce attack success—but the specific numbers are invented to illustrate the concept.
Attack types expected to be reduced:
- Prompt injection via context manipulation (blocked by information bottlenecks)
- Multi-turn social engineering (blocked by stateless components)
- Capability elicitation through roleplay (blocked by narrow component scope)
Experiment B: Performance Overhead (Hypothetical)
Section titled “Experiment B: Performance Overhead (Hypothetical)”Illustrative setup: Same task (research summarization) executed by both architectures.
| Metric | Monolithic | Decomposed | Overhead |
|---|---|---|---|
| Median latency | 2.3s | 4.1s | 1.8x |
| P99 latency | 8.2s | 12.1s | 1.5x |
| API cost per task | $0.12 | $0.18 | 1.5x |
| Total cost (incl. verification) | $0.12 | $0.22 | 1.8x |
What this would show: Overhead plausibly within acceptable range for many applications—but actual overheads depend heavily on implementation choices.
Why overhead might be lower than feared:
- Narrow components can use smaller/cheaper models
- Parallel execution where dependencies allow
- Verification layers are mostly code (fast, cheap)
Experiment C: Delegation Risk Prediction Accuracy (Hypothetical)
Section titled “Experiment C: Delegation Risk Prediction Accuracy (Hypothetical)”Illustrative setup: Estimated Delegation Risk for 20 components, then tracked incident rates over simulated executions.
| Component Type | Predicted Delegation Risk | Actual Incidents | Prediction Error |
|---|---|---|---|
| Search/Retrieval | $15 | $12 | -20% |
| Summarization | $45 | $62 | +38% |
| Code generation | $180 | $156 | -13% |
| Coordination | $85 | $91 | +7% |
| Average | ±19% |
What this would show: Plausible prediction error range; summarization may be harder to bound due to hallucination variance.
Experiment D: Scheming Detection (Hypothetical)
Section titled “Experiment D: Scheming Detection (Hypothetical)”Illustrative setup: Artificially induced scheming model deployed in both architectures with monitoring.
| Architecture | Scheming Detected | Time to Detection | Scheme Succeeded |
|---|---|---|---|
| Monolithic | 3/10 trials | 847 steps avg | 7/10 |
| Decomposed + monitoring | 9/10 trials | 23 steps avg | 1/10 |
What this would show: Decomposition + monitoring could plausibly accelerate detection—but the specific ratios are invented to illustrate the expected direction of effect.
Why decomposition is expected to help detection:
- Each component has narrow behavioral baseline (anomalies more visible)
- Information bottlenecks force compression (harder to hide coordination)
- Stateless operation prevents gradual resource accumulation
Caveats on the Illustrations Above
Section titled “Caveats on the Illustrations Above”Testable Predictions
Section titled “Testable Predictions”If this approach is promising, we should see concrete, measurable results across four validation stages:
- Empirical Tests — Validate core assumptions (scheming reduction, forecasting calibration)
- Prototypes — Build working systems (research assistant, code deployment)
- Red Team Testing — Adversarial evaluation of architectural patterns
- Low-Stakes Deployment — Real-world validation in bounded environments
Empirically Testable Now
Section titled “Empirically Testable Now”- Narrow components show less scheming than monolithic models
- Factored cognition improves monitoring effectiveness (already shown in recent work)
- Frontier models can forecast their own behavior with reasonable calibration
- Modular architectures have acceptable performance penalty (<2-3x) for many tasks
Development Validation
Section titled “Development Validation”- Prototype systems (code generation, research assistant) demonstrate the approach
- Red team adversarial testing finds vulnerabilities → informs better decomposition
- Early deployment in low-stakes domains proves value
Specific Empirical Tests
Section titled “Specific Empirical Tests”Test 1: Scheming in Narrow vs Broad Context
Section titled “Test 1: Scheming in Narrow vs Broad Context”- Train frontier model, measure scheming propensity
- Decompose same capability into narrow components, measure again
- Hypothesis: Modular version schemes less (quantify difference)
Test 2: Covert Signaling via Randomness
Section titled “Test 2: Covert Signaling via Randomness”- Run same LLM at temperature=0 vs 1.0 with repeated queries
- Measure output variance
- Hypothesis: High temperature enables covert signaling
Test 3: Coordinator Robustness
Section titled “Test 3: Coordinator Robustness”- Build coordinator as: (a) monolithic, (b) decomposed with older models, (c) decomposed with redundancy
- Red-team attack attempts to make each scheme
- Hypothesis: (c) most resistant, (a) least resistant
Test 4: Forecasting-Based Navigation
Section titled “Test 4: Forecasting-Based Navigation”- Implement forecasting-based navigation on toy problem
- Measure: Does system improve over time? Do forecasts remain calibrated?
- Hypothesis: Navigation works if forecasts honest
Test 5: Fine-Tuned vs Frontier Model Predictability
Section titled “Test 5: Fine-Tuned vs Frontier Model Predictability”- Compare behavior predictability of fine-tuned narrow model vs frontier model on same task
- Measure: variance in outputs, ease of verification, susceptibility to prompt injection
- Hypothesis: Fine-tuned models more predictable and verifiable
Test 6: Byzantine Fault Tolerance
Section titled “Test 6: Byzantine Fault Tolerance”- Deploy N coordinators with 0-f adversarial
- Measure: At what f does system fail?
- Hypothesis: 3f+1 rule holds for AI coordinators
Open Questions Requiring Research
Section titled “Open Questions Requiring Research”1. Forecasting Accuracy Bounds
Section titled “1. Forecasting Accuracy Bounds”Question: How good does forecasting need to be? How does uncertainty compound over 10-step iterations?
Current state: LLMs are approaching superforecaster-level on many prediction tasks. But self-prediction (forecasting own behavior after modification) is less tested.
What’s needed: Benchmarks specifically for “predict your behavior N steps ahead” with various modifications in between. Characterize how calibration degrades with horizon length.
2. Decomposing Coordination Logic
Section titled “2. Decomposing Coordination Logic”Challenge addressed: The original concern was capability concentration in “the coordinator.”
Solution proposed: Coordination isn’t one component—it’s an architecture. Decompose into:
- Strategy (older models, heavily constrained)
- Forecasting delegation (separate modules)
- Tactical execution (narrow context)
- Progress monitoring (mostly code)
- Verification layers (simple checkers)
Remaining questions:
- What’s the optimal granularity of coordination decomposition?
- How much redundancy is needed (multiple strategy modules)?
- At what capability level does even decomposed coordination become risky?
- Can we formally verify coordination logic?
3. Anti-Scheming Architecture
Section titled “3. Anti-Scheming Architecture”Question: What architectural features actually reduce scheming propensity?
Hypothesis: Limited context, component isolation, independent forecasting/implementation separation, multiple adversarial checks—these should make strategic deception harder.
What’s needed: Empirical validation. Test scheming propensity in:
- Monolithic models
- Models with factored cognition
- Models with architectural separation between planning and execution
Recent work (Factor(T,U) paper) shows factored cognition helps monitoring. Need specific tests for deception.
4. Decomposition Granularity
Section titled “4. Decomposition Granularity”Question: How fine-grained should decomposition be?
Tradeoff: Finer → safer but less efficient. Coarser → more efficient but larger blast radius per component.
What’s needed: Theory of optimal decomposition boundaries. Where do you split? How do you decide?
5. Competitive Dynamics
Section titled “5. Competitive Dynamics”Question: What’s the performance penalty vs monolithic models? Is it acceptable?
What’s needed: Benchmarks. Prototype a modular system, compare performance to monolithic baseline. Quantify the safety-capability tradeoff.
6. Emergence and Composition
Section titled “6. Emergence and Composition”Question: Can forecasting reliably catch emergent properties from component composition?
Risk: Two safe components might enable dangerous capability together. Can multi-step forecasting detect this?
What’s needed: Adversarial testing. Design components that seem safe individually but enable harm together. See if forecasting catches it.
7. Evaluation Before Deployment
Section titled “7. Evaluation Before Deployment”Question: How do you test this works without deploying potentially dangerous systems?
Sandbox problem: Testing in restricted environments might not reveal problems that emerge in deployment.
What’s needed: Evaluation methodology that gives confidence without catastrophic-failure-as-test. Possibly: progressive deployment with increasing stakes, extensive red-teaming, formal verification of critical components.