A decision theory for AI alignment that replaces scalar utility maximization with robust Pareto admissibility and structured deference. The core move: drop the completeness axiom, eliminate only provably dominated actions, and defer to human judgment on the rest. Corrigibility emerges as a structural feature rather than an imposed constraint.
Start here, in order of depth:
| Document | What it is | Length |
|---|---|---|
| The Scalarization Trap | The motivation — why scalar utility theory is structurally incompatible with alignment | ~25 pages |
| MOADT-core.pdf | The core argument — why dropping completeness dissolves corrigibility | 8 pages |
| MOADT-complete.pdf | Full technical report with proofs, worked-out protocol, and all 9 worked examples | 192 pages |
Source markdown: MOADT-core.md | R-MOADT.md | Worked Examples 1–9
The Scalarization Trap identifies the problem. The core document presents the solution. The full report defends it.
Requires pandoc and texlive-xetex:
bash paper/build-pdf.sh # full report + appendicesA pip-installable Python implementation of the MOADT decision engine.
pip install -e .import numpy as np
from moadt import MOADTProblem, run_moadt_protocol
problem = MOADTProblem(
actions=["a", "b", "c"],
states=["s1", "s2"],
objectives=["safety", "helpfulness", "honesty"],
credal_probs=[ # credal set over states
np.array([0.5, 0.5]),
np.array([0.3, 0.7]),
],
outcomes={ # (action, state) → objective scores
("a", "s1"): np.array([0.9, 0.6, 0.7]),
("a", "s2"): np.array([0.8, 0.5, 0.6]),
("b", "s1"): np.array([0.4, 0.9, 0.8]),
("b", "s2"): np.array([0.3, 0.8, 0.9]),
("c", "s1"): np.array([0.7, 0.7, 0.5]),
("c", "s2"): np.array([0.6, 0.6, 0.4]),
},
constraints={0: 0.3}, # safety (index 0) >= 0.3
reference_point=np.array([0.5, 0.5, 0.5]), # Layer 2 aspirations
)
result = run_moadt_protocol(problem)moadt/_engine.py— Core engine: outcome sets, robust dominance, four-layer protocolmoadt/__init__.py— Public API (14 names)examples/— 9 executable scripts matching the worked examplestests/test_engine.py— 25 regression tests
Run tests:
python3 -m pytest tests/ -q├── paper/ # All paper source files
│ ├── the-scalarization-trap.md # Motivation (~25 pages)
│ ├── MOADT-core.md # Compact argument (~8 pages)
│ ├── R-MOADT.md # Full technical report
│ ├── MOADT-worked-example-{1..9}.md
│ ├── build-pdf.sh # PDF build script
│ ├── MOADT-core.pdf # Built PDF
│ └── MOADT-complete.pdf # Built PDF (report + appendices)
│
├── moadt/ # Python library
│ ├── _engine.py
│ └── __init__.py
│
├── examples/ # Executable example scripts
│ ├── paper1_resource_allocation.py
│ ├── ...
│ └── classic_stpetersburg.py
│
├── tests/ # Regression tests
│ └── test_engine.py
│
└── pyproject.toml # Package configuration
Standard decision theory says: rational agents maximize a scalar utility function.
MOADT says: rational agents eliminate the provably bad and defer on the rest.
The first produces agents that must resist correction (the current utility function ranks itself as optimal), must trade safety for performance at some rate (completeness demands commensurability), and must pretend all human values fit on a single number line.
The second produces agents that have no reason to resist correction (Theorem 2), maintain hard safety floors that cannot be traded away (Layer 1 constraints), and ask humans to resolve the tradeoffs that humans should resolve (deference under incomparability).
The price is giving up decisiveness in cases of genuine value conflict. The payoff is corrigibility by construction.