The Box You Can't Open Twice: Newcomb's Paradox Through Four Mathematical Lenses
TL;DR: Newcomb’s paradox — should you take one box or two? — splits rational decision-makers almost evenly. There are four natural mathematical frameworks (causal inference, algorithmic self-reference, statistical counterfactuals, and online learning) that give different answers, and the disagreement reveals deep structural tensions in what it means to choose rationally.
In 1960, physicist William Newcomb invented a thought experiment so divisive that when Robert Nozick published it in 1969, he noted that “to almost everyone, it is perfectly clear and obvious what should be done. The difficulty is that these people seem to divide almost evenly on the problem, with large numbers thinking that the opposing half is just being silly.”
Sixty-six years later, they still do. The 2020 PhilPapers survey of ~2,000 professional philosophers found 39% favor two-boxing and 31% favor one-boxing, with the rest undecided. Among decision theory specialists, the lean is stronger: 61% two-box, 26% one-box. The general public and the AI alignment community tilt the other way.
The puzzle is simple. You face two boxes. Box A is transparent and holds \$1,000. Box B is opaque and holds either \$1,000,000 or nothing. A highly reliable predictor, right about 99% of the time, has already examined you and made a prediction. If it predicted you’d take only Box B, it put the million in. If it predicted you’d take both, it left Box B empty. The predictor is done. The money is placed. You choose.
The paradox is that two ironclad principles of rational choice give opposite answers. And the real interest isn’t in which answer is right. It’s in why we can’t tell.
The payoff matrix is simple:
| Box B full (\$1,000,000) | Box B empty (\$0) | |
|---|---|---|
| One-box (B only) | \$1,000,000 | \$0 |
| Two-box (A + B) | \$1,001,000 | \$1,000 |
Two-boxing strictly dominates: it pays \$1,000 more in every cell. But the predictor ties the column you’re in to the row you choose. Write $M = 1{,}000{,}000$ for the big prize, $K = 1{,}000$ for the small one, and $q = 0.99$ for the predictor’s accuracy. Four frameworks handle this coupling differently:
EDT (Evidential Decision Theory) conditions on the action:
\[E_{\text{EDT}}[a] = P(\text{B full} \mid a) \cdot (M + K \cdot \mathbf{1}_{a = \text{two-box}}) + P(\text{B empty} \mid a) \cdot K \cdot \mathbf{1}_{a = \text{two-box}}\]With our numbers:
\[E_{\text{EDT}}[\text{one-box}] = 0.99 \times 1{,}000{,}000 = 990{,}000\] \[E_{\text{EDT}}[\text{two-box}] = 0.01 \times 1{,}001{,}000 + 0.99 \times 1{,}000 = 11{,}000\]EDT one-boxes. Your action is evidence about which column you’re in.
CDT (Causal Decision Theory) intervenes on the action:
\[E_{\text{CDT}}[a] = P(B\text{ full}) \cdot V(a, \text{full}) + P(B\text{ empty}) \cdot V(a, \text{empty})\]Since $V(\text{two-box}, s) = V(\text{one-box}, s) + K$ for every box state $s$:
\[E_{\text{CDT}}[\text{two-box}] > E_{\text{CDT}}[\text{one-box}]\]for every prior over the box state. CDT two-boxes. Your action can’t cause the column to change.
FDT (Functional Decision Theory) optimizes over algorithm outputs:
\[a^* = \arg\max_{a} \; U(\text{world where all instances of my algorithm output } a)\]If your algorithm outputs “one-box”, the predictor fills B: payoff $M = 1{,}000{,}000$. If your algorithm outputs “two-box”, the predictor empties B: payoff $K = 1{,}000$. FDT one-boxes. Your algorithm is the column selector.
Regret-based (online learning) asks what kind of adversary the predictor is:
- If oblivious (box contents fixed before you act): swap regret says two-box.
- If adaptive (box contents respond to your policy): policy regret says one-box.
The answer depends on how you classify the predictor.
We will now look at these four frameworks from four different lenses.
Lens 1: Observation vs. Intervention
The cleanest formalization comes from Judea Pearl’s causal framework. Pearl distinguishes between observing that something is the case and intervening to make it the case. In his notation, $P(Y \mid X = x)$ is the probability of $Y$ given that we observe $X = x$. But $P(Y \mid do(X = x))$ is the probability of $Y$ given that we set $X$ to $x$, surgically, from outside the system. These are not the same thing.
In Newcomb’s problem, the underlying causal graph looks like this:
Your disposition $\theta$, the kind of decision-maker you are, is a common cause of both your action and the prediction. This is a classic confounding structure. And the entire paradox lives in one inequality:
\[P(\text{B full} \mid \text{one-box}) \neq P(\text{B full} \mid do(\text{one-box}))\]The left side is observational. If we observe you one-boxing, that’s strong evidence you’re the kind of person the predictor expected to one-box, so Box B is probably full. This gives:
\[E[U \mid a = 1] = 0.99 \times 1{,}000{,}000 = 990{,}000\]The right side is interventional. If we intervene on your action, surgically setting it to one-box regardless of your disposition, the arrow from $\theta$ to your action is severed. The intervention carries no information about the prediction. Whatever is in Box B is already there. So taking both boxes gets you an extra \$1,000 regardless:
\[U(a = 2, p) > U(a = 1, p) \quad \text{for every fixed } p\]Evidential decision theory (EDT) uses the left side and says: one-box. Causal decision theory (CDT) uses the right side and says: two-box. Pearl himself sided firmly with the interventionists, arguing that treating actions as mere observations, as EDT does, is precisely what generates the paradox. “The confusion between actions and acts,” he wrote in Causality, “has led to Newcomb’s paradox and other oddities.”
Not at all. The data had been generated under the old schedule. Demand patterns, no-show rates, even which shifts workers preferred were all shaped by the very scheduling policy the company wanted to replace. The optimizer was computing $P(\text{outcome} \mid \text{old schedule observed})$ when what it needed was $P(\text{outcome} \mid do(\text{new schedule}))$. The confounding structure is identical to Newcomb's problem: the policy that generated the data is a common cause of both the "action" (schedule) and the "outcome" (observed performance), and naively conditioning on the data mistakes correlation for causation.
But Pearl’s resolution has a cost. To see this cost, we plot the expected payoff as a function of the predictor’s accuracy $q$:
The crossover is at $q \approx 0.5005$, essentially a coin flip. For any predictor even marginally better than chance, one-boxing yields a higher expected payoff. At $q = 0.99$, one-boxers average \$990,000 while two-boxers average \$11,000. Being “causally rational” makes you poorer. That’s uncomfortable to say the least.
Lens 2: Algorithmic Correlation and Self-Reference
The second lens comes from an unexpected direction: the logic of self-reference.
If the predictor is accurate because it modeled your decision process, say by running a simulation, analyzing your algorithm, or examining a sufficiently detailed model of your brain, then there are two instantiations of your decision-making procedure: the one in your head and the one the predictor evaluated. They aren’t causally connected (the predictor is done), but they are logically connected: they produce the same output because they implement the same function.
This is the insight behind Functional Decision Theory (FDT), proposed by Eliezer Yudkowsky and Nate Soares in 2017. FDT says: don’t ask what your action causes (CDT) or what it provides evidence for (EDT). Ask what the best output of your decision algorithm would be, across all instances where that algorithm is evaluated.
\[\text{FDT: } \arg\max_a \sum_{\text{instances } i} U_i(a)\]Your algorithm is evaluated in two places: your head (where the output determines your action) and the predictor’s model (where the output determined the prediction). If your algorithm outputs “one-box,” you get \$1,000,000. If it outputs “two-box,” you get \$1,000. FDT one-boxes.
This connects to the Self-Sampling Assumption (SSA) in anthropic reasoning. Bostrom’s SSA says you should reason as if you’re randomly selected from your reference class of observers. In Newcomb’s problem, the analogous move is reasoning as if you’re randomly selected from the set of all instantiations of your algorithm. The predictor’s model of you and the actual you are both running the same code: different instantiations, same function.
The structural parallel is pretty much straightforward:
| Anthropic reasoning | Newcomb’s problem |
|---|---|
| Which observer am I? | Which instance of my algorithm is this? |
| Reference class of observers | Reference class of algorithm evaluations |
| My observations are evidence about the world | My decision is evidence about the prediction |
This framing dissolves the paradox, but at the price of accepting a notion of “logical causation” that most decision theorists find metaphysically suspect. The predictor’s model and your brain don’t share any causal arrows. The correlation comes from mathematical identity, not physical interaction. Whether that should count as a reason to act is itself an open question.
This can be also seen also with a simple monte-carlo style simulation: We initialize a population of 500 agents, each carrying a single parameter: its one-box probability $p \in [0,1]$, drawn uniformly at random. Each generation proceeds in three steps. First, every agent plays 20 rounds of Newcomb’s game against a predictor with accuracy $q$; the agent one-boxes with probability $p$ in each round and its fitness is its average payoff. Second, we form the next generation by sampling 500 agents with replacement, proportional to fitness (roulette-wheel selection). Third, each offspring’s $p$ is perturbed by Gaussian noise with standard deviation 0.03, clipped to $[0,1]$. This is pretty much a textbook evolutionary algorithm: selection amplifies high-payoff strategies, mutation explores nearby variants. The key question is whether the population converges to one-boxing or two-boxing, and how fast.
When the predictor is 70% accurate or better, populations evolve toward near-pure one-boxing within ~20 generations ($p > 0.9$). Even at $q = 0.51$, barely better than a coin, there’s a slow drift upward. Evolution doesn’t care about causal philosophy; it follows the payoff gradient.
Lens 3: Data vs. Counterfactuals
The third lens is the most empirical. Suppose you’re not a philosopher but a statistician. You observe 10,000 people play Newcomb’s game. The data is unambiguous: one-boxers average \$990,000. Two-boxers average \$11,000. The observational conditional $E[\text{payoff} \mid \text{action} = a]$, the average payoff grouped by observed action, overwhelmingly favors one-boxing.
The two-boxer’s defense is a counterfactual: “Those one-boxers would have gotten \$1,001,000 if they had two-boxed.” This claim may be true. But it is unobservable. You never see the same person both one-box and two-box under the same prediction. This is the fundamental problem of causal inference: the impossibility of observing both potential outcomes for the same unit. In fact, the structure is identical to treatment effect estimation, where the same gap between observed and counterfactual outcomes makes naively comparing treated and untreated groups unreliable:
| Causal inference | Newcomb’s problem |
|---|---|
| Treatment assignment | Your action |
| Outcome under treatment | Payoff if one-box |
| Outcome under control | Payoff if two-box |
| Confounder | Disposition $\theta$ |
| Fundamental problem | Same person can’t do both |
The expected payoffs can be computed in closed form. Plotting them as a function of the predictor’s accuracy makes the gap visible, and shows exactly where it hides:
The top dashed line is the two-boxer’s counterfactual claim: one-boxers would have earned \$1,001,000 if they had grabbed both boxes with the same prediction. That line is real, and invisible. At $q = 0.99$, the gap between the observed one-box payoff (\$990,000) and the counterfactual (\$991,000) is exactly \$1,000. You’d need to observe the same person doing both to see it. You can’t.
If you’re a frequentist who follows the data, you one-box. If you’re a structural modeler who trusts your causal graph over the observed conditional, you two-box. Neither is obviously wrong. They’re optimizing different things: one follows the joint distribution the agent is embedded in, the other follows the causal structure the agent can manipulate.
A hospital observes that patients who receive a new drug have better outcomes, but sicker patients were more likely to be prescribed it; the raw conditional favors the drug, but the causal effect might be zero.
A company sees that employees who attend a leadership program get promoted faster, but the same ambition that drives attendance also drives promotion; conditioning on attendance overstates the program's value.
A spam filter flags emails based on features that spammers also select for; blocking those emails changes what spammers send next, invalidating the very distribution the filter was trained on.
In each case the question is the same: do you trust the pattern in the data, or the causal story behind it?
Lens 4: Oblivious vs. Adaptive Adversaries
The fourth lens comes from online learning and the theory of multi-armed bandits. In this setting, a learner repeatedly chooses actions, and an adversary determines the losses. The central question is: what kind of adversary are you facing?
An oblivious adversary commits to the entire loss sequence before the game begins. It doesn’t see or react to the learner’s actions. Against an oblivious adversary, the natural performance measure is swap regret: “Holding the loss sequence fixed, could I have earned more by switching my action?” If yes, you have regret. The optimal response is straightforward: play the dominant action, since the environment won’t change.
An adaptive adversary observes the learner’s policy (or past behavior) and adjusts losses accordingly. Against an adaptive adversary, swap regret is misleading: the losses would have been different under a different policy. The right measure becomes policy regret: “Would a different policy have produced better outcomes, accounting for how the adversary would have responded to that policy?”
Linking this back to Newcomb’s problem:
| Online learning | Newcomb’s problem |
|---|---|
| Learner’s policy $\pi$ | Your decision algorithm |
| Adversary’s loss sequence | Box contents |
| Oblivious adversary | CDT: contents are fixed |
| Adaptive adversary | EDT/FDT: contents respond to your policy |
| Swap regret | “I’d get \$1,000 more by switching to two-box” |
| Policy regret | “A two-box policy leads to an empty Box B” |
CDT treats the predictor as oblivious. The money is placed, the game state is fixed, and two-boxing is the dominant action, exactly the swap-regret argument. The two-boxer says: “Holding Box B fixed, I always get \$1,000 more.”
EDT and FDT treat the predictor as adaptive. The predictor responded to your algorithm, so the box contents are a function of your policy. Switching from a one-box policy to a two-box policy doesn’t just change your action; it changes the adversary’s response. The one-boxer says: “A two-box policy faces an empty Box B.”
The precise role of anticipation, i.e., the adversary’s ability to foresee not just the learner’s policy but also their realization of randomness, has been studied in detail in the online learning setting (see e.g., Pokutta and Xu, 2021, where we looked at this in particular in the context of robust optimization). When the adversary can anticipate the learner’s random coin flips, even randomized strategies lose their hedging value. The Newcomb predictor, with 99% accuracy, sits squarely in this regime: it anticipates not just your policy but your execution.
This reframing is clarifying because the online learning community has precise theorems about when each regret notion applies. Against a truly oblivious adversary, swap regret is tight and achievable. Against an adaptive adversary, minimizing swap regret can be catastrophically wrong; you need policy regret, which accounts for how the environment co-adapts. The entire Newcomb debate reduces to a classification question: is the predictor oblivious or adaptive?
The answer, of course, depends on your ontology. If you believe the box contents are a physical fact determined before your choice (oblivious), CDT follows. If you believe the predictor adapted to your decision algorithm and the contents are therefore policy-dependent (adaptive), one-boxing follows. The bandit framework doesn’t resolve the paradox, but it reveals its skeleton: the same structural ambiguity that separates oblivious from adaptive adversaries in online learning separates two-boxers from one-boxers in Newcomb’s problem.
The Mixed Strategy: Flipping a Coin
The first three lenses assume a deterministic chooser (the bandit lens already allows randomized policies, but there the focus was on regret notions, not on what randomization does to the predictor). So what if you flip a biased coin: one-box with probability $p$, two-box with probability $1-p$?
This breaks the predictor. A predictor that’s 99% accurate against deterministic strategies can’t beat $\max(p, 1-p)$ against a genuinely random (but biased) coin (i.e., private randomness as the boxes have been set up already). The predictor’s best response is simple: predict “one-box” (and fill Box B) if $p \geq 0.5$, predict “two-box” (and leave Box B empty) if $p < 0.5$. Under this best response, the predictor’s effective accuracy is
\[q^*(p) = \max(p, 1-p)\]which hits its minimum of $50\%$ at $p = 0.5$: a fair coin reduces the “99% accurate” predictor to a coin flip. The expected payoff works out to:
\[E[\text{payoff}] = \begin{cases} M + (1-p) \cdot K & \text{if } p > 0.5 \text{ (predictor fills B)} \\ (1-p) \cdot K & \text{if } p < 0.5 \text{ (predictor empties B)} \end{cases}\]This creates a phase transition, a million-dollar cliff:
At $p = 0.49$, the predictor leaves Box B empty and you average \$510. At $p = 0.51$, the predictor fills Box B and you average \$1,000,490. A 2% shift in coin bias produces a \$999,980 jump in expected payoff.
The optimal mixed strategy is $p$ just above $0.5$: one-box slightly more than half the time. This earns approximately \$1,000,500, which beats pure one-boxing (\$1,000,000) by \$500. You get the million (the predictor fills Box B because you’re majority one-box) while occasionally grabbing the extra \$1,000 when the coin lands on two-box; it is a bit like reaping the extra \$1,000 from an occasional counterfactual switch.
| Strategy | Expected payoff |
|---|---|
| Pure two-box ($p = 0$) | \$1,000 |
| Pure one-box ($p = 1$) | \$1,000,000 |
| Optimal mixed ($p \approx 0.51$) | \$1,000,490 |
But the optimal mixed strategy is fragile in a way that pure one-boxing is not. It depends on the predictor having a sharp threshold at $p = 0.5$ and not being able to anticipate the randomization itself. A predictor that models coin-flipping agents might demand $p > 0.9$ (i.e., the predictor’s stated accuracy also determines which $p$ are admissible) before filling Box B, and then the optimal response shifts to $p$ just above 0.9, recovering less of the two-boxing bonus. In the limit where the predictor demands certainty (i.e., full anticipation of the random outcome of the coin flip), you’re back to pure one-boxing.
In some sense, mixed strategies reveal that Newcomb’s problem is really a game between the chooser and the predictor, and the payoff landscape has the structure of a game-theoretic discontinuity. In particular, the predictor isn’t just a feature of the environment; it’s a player.
The Fixed Point
There’s a final mathematical observation that cuts across all four lenses. The predictor’s accuracy creates a self-referential loop: the prediction depends on your reasoning, which depends on what you expect the predictor predicted, and so on… At equilibrium, this must be a fixed point: your strategy $\sigma$ and the predictor’s model $\hat{\sigma}$ must satisfy $\hat{\sigma} = \sigma$.
At any such fixed point, the “I’ll trick the predictor” intuition behind two-boxing is unstable. If your strategy is to two-box, the predictor knows, and Box B is empty. If your strategy is to one-box, the predictor knows, and Box B is full. You can’t deviate profitably because the prediction already accounts for your reasoning about the prediction. The fixed point is self-enforcing.
There is a useful game-theoretic way to see this. In the original setup, the move order is: predictor fills boxes (move 1), then you choose (move 2). Two-boxing exploits last-mover advantage: you observe a fixed game state and pick the dominant action. But as the predictor’s accuracy $q \to 1$, the effective move order inverts. A near-perfect predictor reacts to your strategy as if it moved after you, not before. The temporal sequence stays the same (boxes first, choice second), but the strategic sequence flips: the predictor’s move is now essentially a best response to yours. At $q = 1$ the game is equivalent to one where you commit to a strategy first and the predictor fills the boxes second. In that game, one-boxing is obviously correct and two-boxing is obviously foolish.
This is why Newcomb’s problem feels so different from ordinary strategic interaction. In a standard game, you choose against a fixed opponent. In Newcomb’s problem, you choose against a mirror. The four lenses, causal, algorithmic, empirical, adversarial, are four ways of formalizing what it means to make a decision when the universe has already priced in the fact that you’re going to make it.
References
[N] Nozick, R. (1969). Newcomb’s Problem and Two Principles of Choice. In N. Rescher et al. (eds.), Essays in Honor of Carl G. Hempel, pp. 114-146. D. Reidel, Dordrecht. PDF
[P] Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
[YS] Yudkowsky, E. & Soares, N. (2017). Functional Decision Theory: A New Theory of Instrumental Rationality. arXiv:1710.05060
[B] Bostrom, N. (2002). Anthropic Bias: Observation Selection Effects in Science and Philosophy. Routledge.
[CBL] Cesa-Bianchi, N. & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press.
[PX] Pokutta, S. & Xu, H. (2021). Adversaries in Online Learning Revisited: with applications in Robust Optimization and Adversarial training. arXiv:2101.11443
[GH] Gibbard, A. & Harper, W. (1978). Counterfactuals and Two Kinds of Expected Utility. In C.A. Hooker, J.J. Leach & E.F. McClennen (eds.), Foundations and Applications of Decision Theory, Vol. II, pp. 125-162. D. Reidel, Dordrecht.
[SEP] Stanford Encyclopedia of Philosophy. Causal Decision Theory.
[PP] PhilPeople. 2020 PhilPapers Survey: Newcomb’s Problem.
[O] Oesterheld, C. (2017). A Survey of Polls on Newcomb’s Problem.
[W] Wikipedia. Newcomb’s Problem.
Comments