<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="http://www.pokutta.com/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="http://www.pokutta.com/blog/" rel="alternate" type="text/html" /><updated>2026-03-31T09:39:40+02:00</updated><id>http://www.pokutta.com/blog/feed.xml</id><title type="html">One trivial observation at a time</title><subtitle>Everything Mathematics, Optimization, Machine Learning, and Artificial Intelligence</subtitle><entry><title type="html">The Agentic Researcher</title><link href="http://www.pokutta.com/blog/agentic-researcher/" rel="alternate" type="text/html" title="The Agentic Researcher" /><published>2026-03-18T00:00:00+01:00</published><updated>2026-03-18T00:00:00+01:00</updated><id>http://www.pokutta.com/blog/agentic-researcher</id><content type="html" xml:base="http://www.pokutta.com/blog/agentic-researcher/"><![CDATA[<p><em>TL;DR: This is a summary of our recent paper <a href="https://arxiv.org/abs/2603.15914">The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning</a> by <a href="https://maxzimmer.org/">Max Zimmer</a>, <a href="https://www.pelleriti.org/">Nico Pelleriti</a>, <a href="https://christopheroux.de/">Christophe Roux</a>, and <a href="https://www.pokutta.com/">Sebastian Pokutta</a> augmented with some personal perspectives and thoughts. The main point is not that AI can now “do research” in some vague science-fiction sense. The more useful point is that, with the right workflow, general CLI coding agents can already act like research associates: they can write proofs, formulate conjectures, implement ideas, run experiments, document failures, verify intermediate claims, and keep going for hours, while the researcher remains responsible for idea generation, creativity, direction, judgment, and final verification.</em></p>

<!--more-->

<style>
.callout{border-left:4px solid #2563eb;border-right:4px solid #2563eb;background:#f5f7ff;padding:12px 16px;margin:1em 0}
</style>

<h2 id="introduction">Introduction</h2>

<p>Over the last year, the discussion around AI and research has become increasingly confused. We jump between olympiad medals, benchmark wins, flashy demos, and vague claims about autonomous science and “fully-automatic end-to-end research pipelines”. Many of these flashy claims do not hold up or do not generalize; and most importantly in my book this should also not be the point. The useful question for me is a different, more practical one:</p>

<blockquote>
  <p>If I am an actual researcher in, say, Math or ML, how can I use these systems in my daily work?</p>
</blockquote>

<p>That is the question our paper tries to (partially) answer and in some sense it is an amalgamation of research efforts, best practices, and approaches that we have collected over the last roughly 1.5 years in the MATH+ project “Agentic AI in Mathematics”, although our learnings go significantly beyond mathematics.</p>

<p>The backdrop is, of course, quite remarkable with several high-profile achievements. Systems such as <a href="https://www.nature.com/articles/s41586-023-06747-5">AlphaGeometry</a> [AG], <a href="https://www.nature.com/articles/s41586-025-09833-y">AlphaProof</a> [AP], <a href="https://arxiv.org/abs/2506.13131v1">AlphaEvolve</a> [AE], and more recently <a href="https://arxiv.org/abs/2602.10177">Aletheia</a> [ALET] have pushed AI much further into mathematical reasoning and discovery than most people expected only a short while ago. The picture is broader than just these headline systems and includes AlphaGeometry2 [AG2], Aristotle [ARI], Mathematical Exploration and Discovery at Scale [MEDS], and the autonomous Aletheia follow-up on First Proof [ALET2]. On the benchmark side, there is now a growing ecosystem around research-level evaluation, e.g., <a href="https://arxiv.org/abs/2602.05192">First Proof</a> [FP], <a href="https://arxiv.org/abs/2411.04872">FrontierMath</a> [FM], and <a href="https://arxiv.org/abs/2505.12575">RealMath</a> [RM]. At the same time, there is a parallel line of work around agentic experimentation and end-to-end “scientific” pipelines, such as <a href="https://arxiv.org/abs/2408.06292">The AI Scientist</a> [AIS], <a href="https://arxiv.org/abs/2410.05080">ScienceAgentBench</a> [SAB], Karpathy’s <a href="https://github.com/karpathy/autoresearch">autoresearch</a> [AR], and, in a somewhat different flavor, FunSearch [FS]. There is also now a small but growing literature on mathematicians and researchers using these systems in practice, e.g., Avigad [AV], Henkel [H], Dobriban [D], Liu et al. [LCP], and Carbone [C].</p>

<p>All of this is very interesting and inspiring, but the main question remains: How do we <em>unlock</em> “AI for research” in a <em>practical</em> way that supports day-to-day research activities? Most researchers do not want to build a giant bespoke<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> discovery system from scratch. They want to know which tools are already useful <em>today</em>, where the actual leverage is, and which guardrails are needed so that the system does not quietly produce nonsense while sounding very confident. Moreover, they need support in <em>open-ended</em> research endeavors and not merely round-trip IMO problems, which have “one correct answer”. In some sense, the core claim of our paper is simple:</p>

<div class="callout">
<strong>Takeaway.</strong> The real bottleneck is not "having an AI", but having a disciplined and repeatable workflow that turns a strong reasoning LLM into something closer to a research associate.
</div>

<p><strong>NB.</strong> I would like to stress one point here, as this is often confused. Neither do we claim nor aim for fully-automatic end-to-end research. On the contrary, we look at the use of AI as augmentation. Not oracle. Not autopilot. Not replacement. A research associate. Moreover, we are interested in AI systems for <em>open-ended</em> research questions rather than benchmark problems.</p>

<h2 id="refutation-and-verification">Refutation and verification</h2>

<p>Our approach operationalizes (or better: tries to operationalize) a very old and, in my view, still very fundamental scientific principle: most attempts are wrong. That sounds banal, but AI systems used “straight-up” have the tendency to produce enough wrong things fast enough (i.e., the failure rate went down but the speed went up, so that the inter-arrival rate of blow-ups is roughly constant) that it erodes trust in the system. Moreover, there is a total lack of calibration: the AI typically does not know where it is unsure and where it could be wrong; in an optimal world the system would attach a well-calibrated probability of correctness to its output, but that seems out of reach, at least at the moment. This has real consequences for how one should use these systems. If an agent only writes polished explanations, tidy derivations, or plausible-looking code, then it is mostly helping you produce surface area: it looks good but has little substance. What actually matters in research is the ability to rule out bad ideas, broken proof strategies, buggy implementations, and misleading empirical wins quickly and cleanly: we need conjectures, refutation, and falsifiability, which brings us straight to Popper (see <a href="https://plato.stanford.edu/entries/popper/">Stanford Encyclopedia of Philosophy entry on Karl Popper</a> [POP]) and his philosophy (see Britannica’s discussion of <a href="https://www.britannica.com/topic/philosophy-of-science/Eliminativism-and-falsification">eliminativism and falsification</a> [FAL]). No worries, we will skip the philosophy discourse today. The key point is that it is not indulgence but naturally induces a working research habit underneath it: propose something, expose it to failure, and only keep it around provisionally if it survives. Now what survives might still be wrong, but much less so, and that is then where the Human-AI Co-Creativity [HP] takes place.</p>

<p>In practice, this means the agent needs tools for both <em>refutation</em> and <em>verification</em>. That can be a numerical sanity check, a brute-force search for a counterexample, a symbolic derivation, a randomized stress test, a Julia script, a Python script, or some small utility wired into a simulator, etc. Today this is more straightforward than ever: tons of very strong, localized package managers, such as <code class="language-plaintext highlighter-rouge">bun</code>, <code class="language-plaintext highlighter-rouge">npm</code>, <code class="language-plaintext highlighter-rouge">uv</code>, <code class="language-plaintext highlighter-rouge">cargo</code>, <code class="language-plaintext highlighter-rouge">Pkg</code>, that align perfectly with agents in CLIs being able to write small refutation and verification scripts against their own reasoning to harden it. Once the agent can actually run those checks, we are no longer asking it merely to <em>talk</em> about a hypothesis. We are asking it to try to break it.</p>

<p>This makes a fundamental difference and the consistency and correctness of the output changes dramatically. Verification catches arithmetic mistakes, implementation bugs, and overclaimed results. Refutation-oriented checks kill wrong turns early and force the system to document why an idea failed. In my experience, this is exactly where these agents start becoming useful for real research: not when they sound convincing, but when they can help eliminate what is false.</p>

<h2 id="five-levels-of-ai-integration">Five levels of AI integration</h2>

<p>To put things into perspective, inspired by the taxonomy in [HP], we distinguish five levels of AI integration, starting at zero, “research without AI”, which is still the default mode, probably one of the most robust modes out there, and the all-important baseline.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">Level</th>
      <th>Name</th>
      <th>What it looks like</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">0</td>
      <td>Classical</td>
      <td>Normal research without AI: LaTeX, code, math software, papers, whiteboards, coffee, despair, and hopefully eventually progress.</td>
    </tr>
    <tr>
      <td style="text-align: right">1</td>
      <td>Consultant</td>
      <td>The chatbot regime: explanations, brainstorming, literature pointers, debugging, and other targeted questions. Useful, but episodic and reactive.</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td>Typist</td>
      <td>The AI writes code or text for you, but does not really execute or iterate. Think completion, drafting, or small prompt-based generation.</td>
    </tr>
    <tr>
      <td style="text-align: right">3</td>
      <td>Collaborator</td>
      <td>CLI coding agents such as <a href="https://code.claude.com/docs/en/overview">Claude Code</a> [CC], <a href="https://openai.com/codex/">Codex CLI</a> [CX], <a href="https://github.com/anomalyco/opencode">OpenCode</a> [OC], or <a href="https://geminicli.com/">Gemini CLI</a> [GC] can read files, edit them, run code, inspect outputs, and iterate inside a project context. At this level, the human says <em>what</em> to do and the agent handles much of <em>how</em> to do it.</td>
    </tr>
    <tr>
      <td style="text-align: right">4</td>
      <td>Research associate</td>
      <td>The researcher provides the problem, context, constraints, codebase, prior attempts, and evaluation criteria. The agent then runs an actual research loop: <code class="language-plaintext highlighter-rouge">explore -&gt; plan -&gt; implement -&gt; evaluate -&gt; analyze -&gt; record -&gt; commit -&gt; iterate</code>, with the human stepping in periodically for steering, review, and correction.</td>
    </tr>
  </tbody>
</table>

<p>The key difference from Level 3 is that the agent does not stop after every experiment to ask what to do next. It continues autonomously within a carefully bounded workflow, and the human intervenes periodically for steering, review, and correction. This is exactly the boundary point where “AI as tool” becomes “AI as research associate”.</p>

<div class="center">
  <img src="http://www.pokutta.com/blog/assets/agentic-researcher/agentic-inputs-overview.svg" alt="Overview of the main inputs to an agentic research setup" style="width:99%" />
  <p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> A useful way to think about the research-associate setup: the agent gets a concrete research question, the relevant tools and data, and the prior work or domain knowledge needed to operate in context.</p>

<h2 id="the-actual-contribution-is-the-workflow">The actual contribution is the workflow</h2>

<p>While the taxonomy is useful, the important part of the paper is the actual workflow. The paper’s main claim is that one does <em>not</em> need a highly specialized custom-built system to get meaningful agentic research behavior. Instead, one can take strong existing CLI agents and wrap them in a disciplined environment:</p>

<ul>
  <li>persistent instructions</li>
  <li>sandboxed execution</li>
  <li>a structured report</li>
  <li>a live <code class="language-plaintext highlighter-rouge">TODO.md</code></li>
  <li>Git-based experiment tracking</li>
  <li>a set of explicit methodological rules</li>
</ul>

<p>The result is a system that is much less magical than the hype suggests, but also much more useful in practice. It just works. And it is highly customizable to the research needs of the individual researcher. It is <em>your</em> tool.</p>

<p>The workflow in the paper runs inside a sandboxed container, keeps all progress in inspectable artifacts such as <code class="language-plaintext highlighter-rouge">report.tex</code> (which I, for example, often have open simultaneously in VS Code so I can regularly compile the PDF and preview it) and <code class="language-plaintext highlighter-rouge">TODO.md</code>, and can scale from a laptop to multi-node Slurm experiments. The longest autonomous session reported in the paper ran for more than 20 hours; we have much longer sessions with repeated calls to downstream tools for subproblems. That sounds dramatic, but the important part is not the wall-clock number. The important part is that the agent keeps a disciplined experimental loop alive over long horizons without silently discarding context, changing metrics, or forgetting what it already tried. I would go one step further: this kind of “external memory” is one of the main reasons long sessions are usable at all. In other words: the secret sauce is not some mystical “agent architecture”. It is mostly methodology (e.g., refutation and verification) and workflow (e.g., persistency); granted, for high-profile applications we do rely on SOTA LLMs (current favorites being Claude Opus 4.6 and GPT-5.4).</p>

<div class="center">
  <img src="http://www.pokutta.com/blog/assets/agentic-researcher/agentic-workflow-overview.png" alt="Overview of the agentic research workflow" style="width:99%" />
  <p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> The workflow view from the paper: persistent instructions, a CLI agent running inside a sandbox, and an explicit experiment loop with reporting, verification, and Git-based memory.</p>

<h2 id="the-commandments">The commandments</h2>

<p>We have encoded the “rules” that guide the systems’ behavior as “ten commandments”, which sounds a bit theatrical but follows the <em>language escalation paradigm</em> to harden rule-following; yes, it does make a difference. The underlying rule-of-thumb is: if a behavior matters, make it explicit. Do not rely on vague hopes like “the model will probably know what I mean”. I will not list all ten in full here and I grouped them by the overarching goal. In the following the term <em>experiment</em> is used broadly, to mean one agentic-loop iteration, e.g., proof attempt, one numerical experiment, one design attempt, etc.</p>

<p><strong>1. Integrity and trust.</strong></p>

<p>The first group is basic research hygiene:</p>

<ul>
  <li>do not promise work and then quietly skip it</li>
  <li>do not change the evaluation because the original setup looks inconvenient</li>
  <li>do not invent bibliography</li>
</ul>

<p>This sounds trivial until you actually let these systems run for hours. Then it stops sounding trivial very quickly.</p>

<p><strong>2. Autonomy and efficiency.</strong></p>

<p>The second group is about making the agent actually useful rather than merely interactive:</p>

<ul>
  <li>finish autonomous work before reporting back</li>
  <li>treat crashes as bugs unless proven otherwise</li>
</ul>

<p>This addresses one of the most common failure modes: the agent stops too early, asks for permission too often, or mistakes implementation bugs for failed research ideas. “Dangerously skipping all permissions” does not solve your problem here; you need to “unleash” the agent.</p>

<p><strong>3. Scientific rigor.</strong></p>

<p>The third group is the most important “research-specific” one:</p>

<ul>
  <li>change one variable per experiment</li>
  <li>evaluate in tiers</li>
  <li>bound the best-case improvement before celebrating a heuristic</li>
</ul>

<p>The “one variable per experiment” rule is particularly powerful. It sounds almost embarrassingly obvious, but it prevents a huge amount of pseudo-progress and conflation of factors. If two things change and the metric improves, you often do not know what actually helped.</p>

<p>The staged evaluation rule is equally important. A fast sanity check is for catching bugs, not for drawing conclusions. This is exactly the kind of thing researchers understand intuitively and agents do not unless told very explicitly.</p>

<p><strong>4. Documentation and reproducibility.</strong></p>

<p>The last group is what makes the whole system robust over long sessions:</p>

<ul>
  <li>record everything</li>
  <li>verify before claiming</li>
</ul>

<p>This means that failed experiments go into the report as well, not only the good-looking ones. It also means that a claim is not “explained” into existence but should be stress-tested, checked numerically, or challenged through a verification script whenever possible. Everything not fully “verified” is considered “unverified” and “unproven”.</p>

<p>The three meta-principles behind this go back to our refutation discussion at the beginning:</p>

<ol>
  <li>explicit over implicit</li>
  <li>falsifiable over aspirational</li>
  <li>failure-driven over theory-driven</li>
</ol>

<p>“Be rigorous” is not a usable instruction. “Change exactly one variable per experiment” is. Similarly, “failure-driven over theory-driven” ensures that things are constantly stress-tested and hardened, rather than endless proof-attempts over pages where the mistake is in the assumptions or right at the beginning.</p>

<h2 id="three-sample-case-studies">Three sample case studies</h2>

<p>The paper contains six case studies across ML and mathematics. I will only discuss three here, because they already show most of the pattern; for the others, check the paper. All figures below marked with <em>[sic]</em> are <em>verbatim, unaltered output</em> from the agent.</p>

<h3 id="1-optimizer-exploration-for-llm-pretraining">1. Optimizer exploration for LLM pretraining</h3>

<p>The first case study starts from a concrete and very ML-style question. <a href="https://arxiv.org/abs/1711.05101">AdamW</a> uses two extra buffers per parameter, while <a href="https://kellerjordan.github.io/posts/muon">Muon</a> only uses one. The natural question is whether the spare memory budget can be used to make Muon better.</p>

<p>This is exactly the kind of problem where our agentic workflow shines: there is a nontrivial design space, the experiments are expensive, and the researcher mostly wants disciplined exploration rather than one-shot “creativity”. For this case study, the agent ran more than 40 experiments, changing one thing at a time, and discovered two largely independent improvements:</p>

<ul>
  <li>a normalization before orthogonalization</li>
  <li>weight decay for Muon’s matrix parameters</li>
</ul>

<p>The combined result is about a <em>5% improvement in validation perplexity over Muon</em> and about <em>8% over AdamW</em> at the same memory budget, i.e., basically for free. Even more interestingly, the agent also found a nearly matching zero-overhead variant. This application demonstrates what Level 4 autonomy looks like in a compute-heavy setting: multiple GPUs, long runs, careful ablations, literature checks, and a report that keeps all of this coherent over a session lasting more than 20 hours.</p>

<div class="center">
  <img src="http://www.pokutta.com/blog/assets/agentic-researcher/agentic-cli-session.png" alt="Terminal view of a long-running agentic research session" style="width:99%" />
  <p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> A terminal view from a long-running session. Multiple training jobs, timed checks, and verification tasks stay alive in parallel while the agent keeps an inspectable record of what is running.</p>

<div class="center">
  <img src="http://www.pokutta.com/blog/assets/agentic-researcher/paper_final_perplexity.svg" alt="Validation perplexity for Muon variants and baselines" style="width:99%" />
  <p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> [sic] Final validation perplexity for the optimizer exploration case study. Lower is better; the best learned variant improves over both Muon and AdamW at the same memory budget.</p>

<h3 id="2-weight-reconstruction-in-llm-pruning">2. Weight reconstruction in LLM pruning</h3>

<p>The second case study, while still in the LLM space, highlights a different aspect that we encountered quite regularly: <em>serendipity</em>. The original task was to fix a broken pruning-mask idea. The agent eventually concluded that the original approach was mathematically flawed. But instead of just discarding the project and moving on, it analyzed <em>why</em> the approach failed. During that analysis, it observed a strong imbalance in post-layer activation distortion after pruning and proposed a very simple reconstruction step to compensate for it.</p>

<p>The resulting method is almost comically lightweight:</p>

<ul>
  <li>about 10 lines of code</li>
  <li>less than 1% computational overhead</li>
  <li>no hyperparameter tuning</li>
</ul>

<p>Yet it reduces perplexity by <em>18–50%</em> across multiple model scales, architectures, and pruning methods, and captures about <em>92% of the gain of a least-squares oracle reconstruction</em>.</p>

<div style="text-align:center; margin-bottom: 20px;">
  <img src="http://www.pokutta.com/blog/assets/agentic-researcher/scaling_model_size.svg" alt="Relative pruning reconstruction improvement across model sizes" style="width:49%;" />
  <img src="http://www.pokutta.com/blog/assets/agentic-researcher/scaling_absolute_ppl.svg" alt="Absolute perplexity comparison for pruning reconstruction" style="width:49%;" />
  <p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 5.</strong> [sic] The pruning reconstruction idea transfers across model scales. Left: relative improvement versus model size. Right: absolute perplexity comparison against the baseline.</p>

<h3 id="3-frank-wolfe-lower-bounds-on-uniformly-convex-sets">3. Frank-Wolfe lower bounds on uniformly convex sets</h3>

<p>The third case study comes from optimization, more precisely (as you might have guessed) from the Frank-Wolfe and conditional gradients domain. The problem <del>is</del> was an open lower-bound question for vanilla Frank-Wolfe on uniformly convex sets. For $p$-uniformly convex sets, an upper bound of $\mathcal{O}(1/T^{p/(p-1)})$ was known due to [KDP21], but no matching lower bound was available in the relevant regime; this became in particular striking after the recent lower bounds for strongly convex sets (see [HDZRSP26] and [GL26]).</p>

<p>The agent first tried to generalize an existing high-dimensional lower-bound construction and failed. Importantly, this failure was documented rather than hidden. It then pivoted to a more direct dynamical analysis of the Frank-Wolfe iterates on $\ell_p$-balls, used numerical exploration to identify the right pattern, and eventually assembled a proof showing a lower bound of</p>

\[\Omega(1/T^{p/(p-1)})\]

<p>for $p \geq 3$, matching the known upper bound in that regime. This particular application touched on each step of the research loop for mathematics:</p>

<ol>
  <li>try a proof strategy</li>
  <li>fail</li>
  <li>document the obstruction</li>
  <li>switch to computation-guided exploration</li>
  <li>identify structure</li>
  <li>derive the proof</li>
  <li>verify numerically along the way</li>
  <li>verify once more symbolically</li>
</ol>

<div class="center">
  <img src="http://www.pokutta.com/blog/assets/agentic-researcher/explicit-slow-init-convergence.svg" alt="Frank-Wolfe convergence plot on uniformly convex sets" style="width:99%" />
  <p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 6.</strong> [sic] Log-log convergence for the Frank-Wolfe lower-bound case study. The empirical behavior matches the rate suggested by the analysis.</p>

<p><strong>There are three more case studies.</strong></p>

<p>I am not going through the remaining three in detail here, but they are also worth reading:</p>

<ul>
  <li><a href="https://arxiv.org/abs/2210.17323">GPTQ</a> column ordering in LLM quantization</li>
  <li>multi-variable dual tightening for Boscia and mixed-integer convex optimization</li>
  <li>extremal search for maximal real solutions in $K_7$ power networks</li>
</ul>

<p>They essentially make the same basic point: the agent is most useful when it can combine implementation, experimentation, proof attempts, verification, and structured reporting inside a single inspectable loop.</p>

<h2 id="limitations-what-this-does-not-solve">Limitations: What this does not solve</h2>

<p><strong>Verification remains the central problem.</strong> Natural-language proofs still require human inspection; we think of this more as a feature than a bug. Code is easier to check (syntax, types, compile checks, unit tests, integration tests, etc.), but subtle bugs remain dangerous. Citations remain a known weak point; tools such as OpenAlex can help a lot but are not a silver bullet. There are formal verification tools such as Lean 4 available, but in our experience this often does not scale (yet?!) to actual research workflows; in particular when paradigm changes are present (e.g., numerics vs. symbolics vs. computational exploration vs. writing a SAT program as proof). So yes, the researcher still owns final verification (and responsibility).</p>

<p><strong>Novelty is not automatically solved.</strong> An agent can search the literature and reduce the burden, but it cannot guarantee novelty. This is especially important now that many groups are exploring similar design spaces with similar tools. The researcher still has to do the serious prior-art work.</p>

<p><strong>Context remains fragile.</strong> Long sessions eventually hit context-window limits. This is why the persistent external memory in <code class="language-plaintext highlighter-rouge">report.tex</code> and <code class="language-plaintext highlighter-rouge">TODO.md</code> matters so much. Without it, the system will forget, repeat itself, or silently lose important information. Nonetheless, this does not fix all cases of context churn. Also, simply more context like Claude Opus 4.6’s 1m token window is not an immediate fix either: the context needs to be actively managed with <em>meaningful</em> information and simply having a larger context window also increases the risk of context pollution. As with (human) researchers, focus and concentration are key.</p>

<p><strong>Cost is real, but often not the main issue.</strong> Long frontier-model sessions are not free. But in many Level 4 setups, much of the wall-clock time is actually spent waiting for code, experiments, or training runs, not generating tokens. So the real bottleneck is often compute time and workflow quality, not just API cost.</p>

<h2 id="a-couple-of-final-thoughts">A couple of final thoughts</h2>

<p>My overall view is quite positive. Used in the right way, these systems are already good enough to impact how research is done, potentially opening up new avenues. They are particularly strong on the wide, messy, implementation-heavy part of research: exploring various proof directions, setting up experiments, checking boundary cases, writing and revising code, keeping notes, running ablations, and simply pushing several lines of attack in parallel for much longer than a human would do manually. This is not the same as “fully autonomous science”, but it is absolutely real leverage.</p>

<p>At the same time, I do not believe in unbounded speedups either. Research is not one isolated task; it is a workflow. And workflows obey Amdahl’s law [AL26]. Even if the agent becomes extremely fast at some parts of the loop, the remaining serial parts still cap the total gain: choosing the problem, deciding what actually matters, judging novelty, interpreting ambiguous evidence, and performing final verification. These pieces do not go away just because code generation or argument generation got faster. So no, I do not expect 100x researcher productivity. In many realistic settings the true ceiling is much lower.</p>

<p>Then there is the rework problem, which in practice is at least as important. Every wrong derivation, flaky script, plausible-but-false citation, or misleading empirical win creates a tax that has to be paid later by the researcher. Local speed does not automatically turn into global speed. If the system saves you two hours and then costs you ninety minutes of checking, cleanup, and reinterpretation, the headline gain is mostly fiction. This is one reason why disciplined workflows matter so much: they reduce the rework tax before it compounds.</p>

<p>There is also a more subtle downside risk. If you use the agent to outsource the core thinking, you may get more text, more code, and less understanding. In that sense there is indeed a “thinking tax” if the tool is used lazily [TT26]. But if you use it to harden your own reasoning, by searching for counterexamples, stress-testing claims, implementing verification scripts, and documenting failed attempts, then the system becomes genuinely valuable. That distinction is crucial.</p>

<p>So my expectation is neither “AI changes nothing” nor “AI gives us fully autonomous science”. However, the realistic middle ground is already quite useful: better exploration, broader search, faster implementation, more systematic verification, and the ability to sustain open-ended research loops for longer. Even a fairly mundane 2x on the right parts of research would be enormous over the course of a few years. But to get there, we have to be honest about the downside risks as well: overtrust, cognitive atrophy, workflow bloat, and rework. The researcher remains the one who has to decide what is important and what is worth believing.</p>

<h2 id="references">References</h2>

<p>(for a more complete list, see our paper [ZPRP26] and the references contained therein)</p>

<p><strong>Main paper.</strong></p>

<p>[ZPRP26] Zimmer, M., Pelleriti, N., Roux, C., Pokutta, S.: <a href="https://arxiv.org/abs/2603.15914">The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning</a> (2026)</p>

<p><strong>AI for mathematics and mathematical discovery.</strong></p>

<p>[AG] Trinh, T. et al.: <a href="https://www.nature.com/articles/s41586-023-06747-5">Solving olympiad geometry without human demonstrations</a> (2024)</p>

<p>[AP] Hubert, T. et al.: <a href="https://www.nature.com/articles/s41586-025-09833-y">Olympiad-level formal mathematical reasoning with reinforcement learning</a> (2025)</p>

<p>[AG2] Chervonyi, Y. et al.: <a href="https://arxiv.org/abs/2502.03544">Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2</a> (2025)</p>

<p>[ARI] Achim, T. et al.: <a href="https://arxiv.org/abs/2510.01346v2">Aristotle: IMO-level Automated Theorem Proving</a> (2025)</p>

<p>[AE] Novikov, A. et al.: <a href="https://arxiv.org/abs/2506.13131v1">AlphaEvolve: A coding agent for scientific and algorithmic discovery</a> (2025)</p>

<p>[MEDS] Georgiev, A., Gómez-Serrano, J., Tao, T., Wagner, S.: <a href="https://arxiv.org/abs/2511.02864v3">Mathematical exploration and discovery at scale</a> (2025)</p>

<p>[ALET] Feng, Y. et al.: <a href="https://arxiv.org/abs/2602.10177">Towards Autonomous Mathematics Research</a> (2026)</p>

<p>[ALET2] Feng, Y. et al.: <a href="https://arxiv.org/abs/2602.21201">Aletheia tackles FirstProof autonomously</a> (2026)</p>

<p>[FP] Abouzaid, M. et al.: <a href="https://arxiv.org/abs/2602.05192">First Proof</a> (2026)</p>

<p>[FM] Glazer, D. et al.: <a href="https://arxiv.org/abs/2411.04872">FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI</a> (2024)</p>

<p>[RM] Zhang, Y. et al.: <a href="https://arxiv.org/abs/2505.12575">RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics</a> (2025)</p>

<p><strong>AI-assisted mathematical research and human-AI collaboration.</strong></p>

<p>[AV] Avigad, J.: <a href="https://arxiv.org/abs/2603.03684">Mathematicians in the age of AI</a> (2026)</p>

<p>[H] Henkel, C.: <a href="https://arxiv.org/abs/2508.20236">The Mathematician’s Assistant: Integrating AI into Research Practice</a> (2025)</p>

<p>[D] Dobriban, E.: <a href="https://arxiv.org/abs/2511.18828">Solving a Research Problem in Mathematical Statistics with AI Assistance</a> (2025)</p>

<p>[LCP] Liu, Z. et al.: <a href="https://arxiv.org/abs/2510.26380">AI Mathematician as a Partner in Advancing Mathematical Discovery – A Case Study in Homogenization Theory</a> (2025)</p>

<p>[C] Carbone, A.: <a href="https://arxiv.org/abs/2511.07420">Advancing mathematics research with generative AI</a> (2025)</p>

<p>[HP] Haase, J., Pokutta, S.: <a href="https://arxiv.org/abs/2411.12527">Human-AI Co-Creativity: Exploring Synergies Across Levels of Creative Collaboration</a> (2024)</p>

<p><strong>Optimization and conditional gradients.</strong></p>

<p>[KDP21] Kerdreux, T., d’Aspremont, A., Pokutta, S.: <a href="https://proceedings.mlr.press/v130/kerdreux21a.html">Projection-Free Optimization on Uniformly Convex Sets</a>. <em>Proceedings of AISTATS</em> (2021)</p>

<p>[HDZRSP26] Halbey, J., Deza, D., Zimmer, M., Roux, C., Stellato, B., Pokutta, S.: <a href="https://arxiv.org/abs/2602.04378">Lower Bounds for Frank-Wolfe on Strongly Convex Sets</a>. <em>arXiv preprint arXiv:2602.04378</em> (2026)</p>

<p>[GL26] Grimmer, B., Liu, N.: <a href="https://arxiv.org/abs/2602.22608">Lower Bounds for Linear Minimization Oracle Methods Optimizing over Strongly Convex Sets</a>. <em>arXiv preprint arXiv:2602.22608</em> (2026)</p>

<p><strong>Agentic workflows and evaluation.</strong></p>

<p>[AIS] Lu, C. et al.: <a href="https://arxiv.org/abs/2408.06292">The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery</a> (2024)</p>

<p>[SAB] Chen, X. et al.: <a href="https://arxiv.org/abs/2410.05080">ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery</a> (2024)</p>

<p>[AR] Karpathy, A.: <a href="https://github.com/karpathy/autoresearch">autoresearch</a></p>

<p>[FS] Romera-Paredes, B. et al.: <a href="https://www.nature.com/articles/s41586-023-06924-6">Mathematical discoveries from program search with large language models</a> (2023)</p>

<p><strong>Philosophy of science.</strong></p>

<p>[POP] Stanford Encyclopedia of Philosophy: <a href="https://plato.stanford.edu/entries/popper/">Karl Popper</a></p>

<p>[FAL] Encyclopaedia Britannica: <a href="https://www.britannica.com/topic/philosophy-of-science/Eliminativism-and-falsification">Philosophy of science - Eliminativism, Falsification, Theory</a></p>

<p><strong>AI productivity and cognition.</strong></p>

<p>[AL26] just a tourist: <a href="https://just-a-tourist.bearblog.dev/amdahl-law-ai-speedup/">The 20x Ceiling: Amdahl’s Law and the Limits of AI Speedup</a> (2026)</p>

<p>[TT26] just a tourist: <a href="https://just-a-tourist.bearblog.dev/thinking-tax/">The Thinking Tax: When AI Tools Cost More Than They Save</a> (2026)</p>

<p><strong>Tools.</strong></p>

<p>[CC] Anthropic: <a href="https://code.claude.com/docs/en/overview">Claude Code overview</a></p>

<p>[CX] OpenAI: <a href="https://openai.com/codex/">Codex</a></p>

<p>[OC] Anomaly: <a href="https://github.com/anomalyco/opencode">OpenCode / The open source coding agent</a></p>

<p>[GC] Google: <a href="https://geminicli.com/">Gemini CLI</a></p>

<h2 id="footnotes">Footnotes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>This is purely for internal amusement: yes I did use the word. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Sebastian Pokutta</name></author><category term="research" /><category term="ai" /><category term="agents" /><category term="mathematics" /><category term="machine-learning" /><category term="workflow" /><summary type="html"><![CDATA[TL;DR: This is a summary of our recent paper The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning by Max Zimmer, Nico Pelleriti, Christophe Roux, and Sebastian Pokutta augmented with some personal perspectives and thoughts. The main point is not that AI can now “do research” in some vague science-fiction sense. The more useful point is that, with the right workflow, general CLI coding agents can already act like research associates: they can write proofs, formulate conjectures, implement ideas, run experiments, document failures, verify intermediate claims, and keep going for hours, while the researcher remains responsible for idea generation, creativity, direction, judgment, and final verification.]]></summary></entry><entry><title type="html">The Box You Can’t Open Twice: Newcomb’s Paradox Through Four Mathematical Lenses</title><link href="http://www.pokutta.com/blog/newcomb-four-lenses/" rel="alternate" type="text/html" title="The Box You Can’t Open Twice: Newcomb’s Paradox Through Four Mathematical Lenses" /><published>2026-03-10T00:00:00+01:00</published><updated>2026-03-10T00:00:00+01:00</updated><id>http://www.pokutta.com/blog/newcomb-four-lenses</id><content type="html" xml:base="http://www.pokutta.com/blog/newcomb-four-lenses/"><![CDATA[<p><em>TL;DR: Newcomb’s paradox — should you take one box or two? — splits rational decision-makers almost evenly. There are four natural mathematical frameworks (causal inference, algorithmic self-reference, statistical counterfactuals, and online learning) that give different answers, and the disagreement reveals deep structural tensions in what it means to choose rationally.</em></p>

<!--more-->

<p>In 1960, physicist William Newcomb invented a thought experiment so divisive that when Robert Nozick published it in 1969, he noted that “to almost everyone, it is perfectly clear and obvious what should be done. The difficulty is that these people seem to divide almost evenly on the problem, with large numbers thinking that the opposing half is just being silly.”</p>

<p>Sixty-six years later, they still do. The 2020 PhilPapers survey of ~2,000 professional philosophers found 39% favor two-boxing and 31% favor one-boxing, with the rest undecided. Among decision theory specialists, the lean is stronger: 61% two-box, 26% one-box. The general public and the AI alignment community tilt the other way.</p>

<p>The puzzle is simple. You face two boxes. Box A is transparent and holds \$1,000. Box B is opaque and holds either \$1,000,000 or nothing. A highly reliable predictor, right about 99% of the time, has already examined you and made a prediction. If it predicted you’d take only Box B, it put the million in. If it predicted you’d take both, it left Box B empty. The predictor is done. The money is placed. You choose.</p>

<p>The paradox is that two ironclad principles of rational choice give opposite answers. And the real interest isn’t in which answer is right. It’s in <em>why</em> we can’t tell.</p>

<style>
.ncb-widget{margin:2em 0;padding:1.5em;border:1px solid #e5e7eb;border-radius:12px;background:#f9fafb}
.ncb-controls{display:flex;flex-wrap:wrap;gap:16px;margin-top:1em;align-items:flex-end}
.ncb-control{display:flex;flex-direction:column;gap:4px}
.ncb-control label{font-size:.85em;color:#4b5563;font-weight:500}
.ncb-control input[type="range"]{width:180px}
.ncb-stats{display:flex;flex-wrap:wrap;gap:10px;margin-top:.75em;font-size:.9em}
.ncb-stat{padding:5px 12px;background:#fff;border:1px solid #e5e7eb;border-radius:8px}
.ncb-val{font-weight:600;color:#111827}
.ncb-btn{padding:6px 16px;border:1px solid #d1d5db;border-radius:6px;background:#fff;cursor:pointer;font-size:.85em;font-family:inherit}
.ncb-btn:hover{background:#f3f4f6}
.ncb-readout{margin-top:.5em;padding:10px 14px;background:#fff;border:1px solid #e5e7eb;border-radius:8px;font-size:.9em;line-height:1.6}
.ncb-chart-wrap{position:relative;width:100%}
.ncb-evo-grid{display:flex;flex-wrap:wrap;gap:16px}
.ncb-evo-grid>div{flex:1;min-width:260px}
@media(max-width:640px){.ncb-control input[type="range"]{width:140px}.ncb-controls{gap:12px}}
.callout{border-left:4px solid #2563eb;border-right:4px solid #2563eb;background:#f5f7ff;padding:12px 16px;margin:1em 0}
.callout-anecdote{border-left:4px solid #ea580c;border-right:4px solid #ea580c;background:#fff7ed;padding:12px 16px;margin:1em 0}
</style>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<p>The payoff matrix is simple:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Box B full (\$1,000,000)</th>
      <th>Box B empty (\$0)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>One-box</strong> (B only)</td>
      <td>\$1,000,000</td>
      <td>\$0</td>
    </tr>
    <tr>
      <td><strong>Two-box</strong> (A + B)</td>
      <td>\$1,001,000</td>
      <td>\$1,000</td>
    </tr>
  </tbody>
</table>

<p>Two-boxing strictly dominates: it pays \$1,000 more in every cell. But the predictor ties the column you’re in to the row you choose. Write $M = 1{,}000{,}000$ for the big prize, $K = 1{,}000$ for the small one, and $q = 0.99$ for the predictor’s accuracy. Four frameworks handle this coupling differently:</p>

<p><strong>EDT</strong> (Evidential Decision Theory) conditions on the action:</p>

\[E_{\text{EDT}}[a] = P(\text{B full} \mid a) \cdot (M + K \cdot \mathbf{1}_{a = \text{two-box}}) + P(\text{B empty} \mid a) \cdot K \cdot \mathbf{1}_{a = \text{two-box}}\]

<p>With our numbers:</p>

\[E_{\text{EDT}}[\text{one-box}] = 0.99 \times 1{,}000{,}000 = 990{,}000\]

\[E_{\text{EDT}}[\text{two-box}] = 0.01 \times 1{,}001{,}000 + 0.99 \times 1{,}000 = 11{,}000\]

<p>EDT one-boxes. Your action is evidence about which column you’re in.</p>

<p><strong>CDT</strong> (Causal Decision Theory) intervenes on the action:</p>

\[E_{\text{CDT}}[a] = P(B\text{ full}) \cdot V(a, \text{full}) + P(B\text{ empty}) \cdot V(a, \text{empty})\]

<p>Since $V(\text{two-box}, s) = V(\text{one-box}, s) + K$ for every box state $s$:</p>

\[E_{\text{CDT}}[\text{two-box}] &gt; E_{\text{CDT}}[\text{one-box}]\]

<p>for every prior over the box state. CDT two-boxes. Your action can’t <em>cause</em> the column to change.</p>

<p><strong>FDT</strong> (Functional Decision Theory) optimizes over algorithm outputs:</p>

\[a^* = \arg\max_{a} \; U(\text{world where all instances of my algorithm output } a)\]

<p>If your algorithm outputs “one-box”, the predictor fills B: payoff $M = 1{,}000{,}000$.
If your algorithm outputs “two-box”, the predictor empties B: payoff $K = 1{,}000$.
FDT one-boxes. Your algorithm <em>is</em> the column selector.</p>

<p><strong>Regret-based</strong> (online learning) asks what kind of adversary the predictor is:</p>

<ul>
  <li>If <em>oblivious</em> (box contents fixed before you act): swap regret says two-box.</li>
  <li>If <em>adaptive</em> (box contents respond to your policy): policy regret says one-box.</li>
</ul>

<p>The answer depends on how you classify the predictor.</p>

<p>We will now look at these four frameworks from four different lenses.</p>

<h2 id="lens-1-observation-vs-intervention">Lens 1: Observation vs. Intervention</h2>

<p>The cleanest formalization comes from Judea Pearl’s causal framework. Pearl distinguishes between <em>observing</em> that something is the case and <em>intervening</em> to make it the case. In his notation, $P(Y \mid X = x)$ is the probability of $Y$ given that we <em>observe</em> $X = x$. But $P(Y \mid do(X = x))$ is the probability of $Y$ given that we <em>set</em> $X$ to $x$, surgically, from outside the system. These are not the same thing.</p>

<p>In Newcomb’s problem, the underlying causal graph looks like this:</p>

<div style="text-align:center;margin:1.5em 0;">
<svg viewBox="0 0 540 170" style="max-width:540px;width:100%;height:auto;" xmlns="http://www.w3.org/2000/svg">
<defs><marker id="ncb-arr" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto"><path d="M0 0L10 5L0 10z" fill="#4b5563" /></marker></defs>
<rect x="160" y="8" width="220" height="38" rx="10" fill="#f3f4f6" stroke="#9ca3af" stroke-width="1.5" />
<text x="270" y="33" text-anchor="middle" font-size="15" fill="#1f2937" font-family="inherit">Your disposition (θ)</text>
<line x1="215" y1="46" x2="105" y2="108" stroke="#4b5563" stroke-width="1.5" marker-end="url(#ncb-arr)" />
<line x1="325" y1="46" x2="355" y2="108" stroke="#4b5563" stroke-width="1.5" marker-end="url(#ncb-arr)" />
<rect x="10" y="112" width="180" height="38" rx="10" fill="#dbeafe" stroke="#93c5fd" stroke-width="1.5" />
<text x="100" y="137" text-anchor="middle" font-size="15" fill="#1e40af" font-family="inherit">Your action</text>
<rect x="250" y="112" width="180" height="38" rx="10" fill="#fef3c7" stroke="#fcd34d" stroke-width="1.5" />
<text x="340" y="137" text-anchor="middle" font-size="14" fill="#92400e" font-family="inherit">Predictor's prediction</text>
<line x1="430" y1="131" x2="450" y2="131" stroke="#4b5563" stroke-width="1.5" marker-end="url(#ncb-arr)" />
<rect x="455" y="112" width="80" height="38" rx="10" fill="#fecaca" stroke="#f87171" stroke-width="1.5" />
<text x="495" y="130" text-anchor="middle" font-size="12" fill="#991b1b" font-family="inherit">Box</text>
<text x="495" y="145" text-anchor="middle" font-size="12" fill="#991b1b" font-family="inherit">contents</text>
</svg>
</div>

<p>Your disposition $\theta$, the kind of decision-maker you are, is a common cause of both your action and the prediction. This is a classic confounding structure. And the entire paradox lives in one inequality:</p>

\[P(\text{B full} \mid \text{one-box}) \neq P(\text{B full} \mid do(\text{one-box}))\]

<p>The left side is observational. If we <em>observe</em> you one-boxing, that’s strong evidence you’re the kind of person the predictor expected to one-box, so Box B is probably full. This gives:</p>

\[E[U \mid a = 1] = 0.99 \times 1{,}000{,}000 = 990{,}000\]

<p>The right side is interventional. If we <em>intervene</em> on your action, surgically setting it to one-box regardless of your disposition, the arrow from $\theta$ to your action is severed. The intervention carries no information about the prediction. Whatever is in Box B is already there. So taking both boxes gets you an extra \$1,000 regardless:</p>

\[U(a = 2, p) &gt; U(a = 1, p) \quad \text{for every fixed } p\]

<p>Evidential decision theory (EDT) uses the left side and says: one-box. Causal decision theory (CDT) uses the right side and says: two-box. Pearl himself sided firmly with the interventionists, arguing that treating actions as mere observations, as EDT does, is precisely what generates the paradox. “The confusion between actions and acts,” he wrote in <em>Causality</em>, “has led to Newcomb’s paradox and other oddities.”</p>

<div class="callout-anecdote">
<strong>Anecdote.</strong> This confusion is not limited to philosophy. I once had a project with a company that wanted to optimize their workforce scheduling. They had years of data: staffing levels, demand, productivity. When they ran their optimizer against this data, it concluded that the existing schedule was already near-optimal. Problem solved? <br /> <br />

Not at all. The data had been <em>generated under the old schedule</em>. Demand patterns, no-show rates, even which shifts workers preferred were all shaped by the very scheduling policy the company wanted to replace. The optimizer was computing $P(\text{outcome} \mid \text{old schedule observed})$ when what it needed was $P(\text{outcome} \mid do(\text{new schedule}))$. The confounding structure is identical to Newcomb's problem: the policy that generated the data is a common cause of both the "action" (schedule) and the "outcome" (observed performance), and naively conditioning on the data mistakes correlation for causation.
</div>

<p>But Pearl’s resolution has a cost. To see this cost, we plot the expected payoff as a function of the predictor’s accuracy $q$:</p>

<div class="ncb-widget">
<div class="ncb-chart-wrap" style="height:360px;">
<canvas id="ncb-payoff-chart"></canvas>
</div>
<div class="ncb-controls">
<div class="ncb-control">
<label>Highlight accuracy: <strong><span id="ncb-q1-val">0.99</span></strong></label>
<input type="range" id="ncb-q1" min="50" max="100" step="1" value="99" />
</div>
</div>
<div class="ncb-readout" id="ncb-payoff-readout">
At <em>q</em> = 0.99: One-box earns <strong><span>$</span>990,000</strong> · Two-box earns <strong><span>$</span>11,000</strong> · Ratio: <strong>90 : 1</strong>
</div>
</div>

<script>
(function(){
var M=1000000,K=1000;
function oneBox(q){return q*M}
function twoBox(q){return(1-q)*(M+K)+q*K}
function fmt(n){
  if(n>=1e6)return'<span>$</span>'+(n/1e6).toFixed(n%1e6===0?0:1)+'M';
  if(n>=1e3)return'<span>$</span>'+(n/1e3).toFixed(n%1e3===0?0:1)+'K';
  return'<span>$</span>'+n.toFixed(0);
}
function fmtD(n){return'<span>$</span>'+n.toLocaleString('en-US',{maximumFractionDigits:0})}
var pts=[];for(var i=0;i<=100;i++){var q=0.5+i*0.005;pts.push(q)}
var d1=pts.map(function(q){return{x:q,y:oneBox(q)/1000}});
var d2=pts.map(function(q){return{x:q,y:twoBox(q)/1000}});
var currentQ=0.99;
var chart=new Chart(document.getElementById('ncb-payoff-chart'),{
  type:'scatter',
  data:{datasets:[
    {label:'One-box',data:d1,showLine:true,borderColor:'#2563eb',backgroundColor:'rgba(37,99,235,0.08)',borderWidth:2.5,pointRadius:0,fill:true,tension:0},
    {label:'Two-box',data:d2,showLine:true,borderColor:'#dc2626',backgroundColor:'rgba(220,38,38,0.08)',borderWidth:2.5,pointRadius:0,fill:true,tension:0},
    {label:'Current',data:[{x:currentQ,y:oneBox(currentQ)/1000},{x:currentQ,y:twoBox(currentQ)/1000}],
     pointRadius:7,pointBackgroundColor:['#2563eb','#dc2626'],pointBorderColor:'#fff',pointBorderWidth:2,showLine:false}
  ]},
  options:{
    responsive:true,maintainAspectRatio:false,animation:false,
    scales:{
      x:{type:'linear',min:0.5,max:1.0,title:{display:true,text:'Predictor accuracy (q)',font:{size:13}},
         ticks:{callback:function(v){return v.toFixed(2)}}},
      y:{title:{display:true,text:'Expected payoff ($K)',font:{size:13}},min:0,max:1050}
    },
    plugins:{
      legend:{position:'top',labels:{usePointStyle:true,pointStyle:'line',font:{size:12},
        filter:function(item){return item.datasetIndex<2}}},
      tooltip:{enabled:false}
    }
  },
  plugins:[{
    id:'crossover',
    afterDraw:function(ch){
      var xA=ch.scales.x,yA=ch.scales.y,ctx=ch.ctx;
      var cx=xA.getPixelForValue(1001000/2000000);
      ctx.save();ctx.strokeStyle='#9ca3af';ctx.lineWidth=1;ctx.setLineDash([5,5]);
      ctx.beginPath();ctx.moveTo(cx,yA.top);ctx.lineTo(cx,yA.bottom);ctx.stroke();
      ctx.fillStyle='#6b7280';ctx.font='11px sans-serif';ctx.textAlign='left';
      ctx.fillText('q ≈ 0.5005',cx+4,yA.top+14);
      var qx=xA.getPixelForValue(currentQ);
      ctx.strokeStyle='#f59e0b';ctx.lineWidth=1.5;ctx.setLineDash([4,4]);
      ctx.beginPath();ctx.moveTo(qx,yA.top);ctx.lineTo(qx,yA.bottom);ctx.stroke();
      ctx.restore();
    }
  }]
});
function update(){
  var q=parseInt(document.getElementById('ncb-q1').value)/100;
  currentQ=q;
  document.getElementById('ncb-q1-val').textContent=q.toFixed(2);
  chart.data.datasets[2].data=[{x:q,y:oneBox(q)/1000},{x:q,y:twoBox(q)/1000}];
  chart.update('none');
  var o=oneBox(q),t=twoBox(q);
  var ratio=o>t?Math.round(o/t)+' : 1':'1 : '+Math.round(t/o);
  if(Math.abs(o-t)<100)ratio='≈ 1 : 1';
  document.getElementById('ncb-payoff-readout').innerHTML=
    'At <em>q</em> = '+q.toFixed(2)+': One-box earns <strong>'+fmtD(o)+'</strong> · Two-box earns <strong>'+fmtD(t)+'</strong> · Ratio: <strong>'+ratio+'</strong>';
}
document.getElementById('ncb-q1').addEventListener('input',update);
})();
</script>

<p>The crossover is at $q \approx 0.5005$, essentially a coin flip. For <em>any</em> predictor even marginally better than chance, one-boxing yields a higher expected payoff. At $q = 0.99$, one-boxers average \$990,000 while two-boxers average \$11,000. Being “causally rational” makes you poorer. That’s uncomfortable to say the least.</p>

<h2 id="lens-2-algorithmic-correlation-and-self-reference">Lens 2: Algorithmic Correlation and Self-Reference</h2>

<p>The second lens comes from an unexpected direction: the logic of self-reference.</p>

<p>If the predictor is accurate because it <em>modeled</em> your decision process, say by running a simulation, analyzing your algorithm, or examining a sufficiently detailed model of your brain, then there are two instantiations of your decision-making procedure: the one in your head and the one the predictor evaluated. They aren’t causally connected (the predictor is done), but they are <em>logically</em> connected: they produce the same output because they implement the same function.</p>

<p>This is the insight behind Functional Decision Theory (FDT), proposed by Eliezer Yudkowsky and Nate Soares in 2017. FDT says: don’t ask what your action <em>causes</em> (CDT) or what it <em>provides evidence for</em> (EDT). Ask what the best <em>output of your decision algorithm</em> would be, across all instances where that algorithm is evaluated.</p>

\[\text{FDT: } \arg\max_a \sum_{\text{instances } i} U_i(a)\]

<p>Your algorithm is evaluated in two places: your head (where the output determines your action) and the predictor’s model (where the output determined the prediction). If your algorithm outputs “one-box,” you get \$1,000,000. If it outputs “two-box,” you get \$1,000. FDT one-boxes.</p>

<p>This connects to the <strong>Self-Sampling Assumption (SSA)</strong> in anthropic reasoning. Bostrom’s SSA says you should reason as if you’re randomly selected from your reference class of observers. In Newcomb’s problem, the analogous move is reasoning as if you’re randomly selected from the set of all instantiations of your algorithm. The predictor’s model of you and the actual you are both running the same code: different instantiations, same function.</p>

<p>The structural parallel is pretty much straightforward:</p>

<table>
  <thead>
    <tr>
      <th>Anthropic reasoning</th>
      <th>Newcomb’s problem</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Which observer am I?</td>
      <td>Which instance of my algorithm is this?</td>
    </tr>
    <tr>
      <td>Reference class of observers</td>
      <td>Reference class of algorithm evaluations</td>
    </tr>
    <tr>
      <td>My observations are evidence about the world</td>
      <td>My decision is evidence about the prediction</td>
    </tr>
  </tbody>
</table>

<p>This framing dissolves the paradox, but at the price of accepting a notion of “logical causation” that most decision theorists find metaphysically suspect. The predictor’s model and your brain don’t share any causal arrows. The correlation comes from mathematical identity, not physical interaction. Whether that should count as a reason to act is itself an open question.</p>

<p>This can be also seen also with a simple monte-carlo style simulation: We initialize a population of 500 agents, each carrying a single parameter: its one-box probability $p \in [0,1]$, drawn uniformly at random. Each generation proceeds in three steps. First, every agent plays 20 rounds of Newcomb’s game against a predictor with accuracy $q$; the agent one-boxes with probability $p$ in each round and its fitness is its average payoff. Second, we form the next generation by sampling 500 agents with replacement, proportional to fitness (roulette-wheel selection). Third, each offspring’s $p$ is perturbed by Gaussian noise with standard deviation 0.03, clipped to $[0,1]$. This is pretty much a textbook evolutionary algorithm: selection amplifies high-payoff strategies, mutation explores nearby variants. The key question is whether the population converges to one-boxing or two-boxing, and how fast.</p>

<div class="ncb-widget">
<div class="ncb-evo-grid">
<div>
<canvas id="ncb-evo-hist" style="width:100%;height:300px;display:block;"></canvas>
</div>
<div>
<div style="height:300px;">
<canvas id="ncb-evo-line"></canvas>
</div>
</div>
</div>
<div class="ncb-stats">
<div class="ncb-stat">Generation: <span class="ncb-val" id="ncb-evo-gen">0</span> / 80</div>
<div class="ncb-stat">Avg <em>p</em>: <span class="ncb-val" id="ncb-evo-avgp">0.500</span></div>
<div class="ncb-stat">Avg payoff: <span class="ncb-val" id="ncb-evo-pay">—</span></div>
</div>
<div class="ncb-controls">
<div class="ncb-control">
<label>Predictor accuracy: <strong><span id="ncb-evo-q-val">0.95</span></strong></label>
<input type="range" id="ncb-evo-q" min="51" max="99" step="1" value="95" />
</div>
<div class="ncb-control">
<label>Speed: <strong><span id="ncb-evo-speed-val">120</span></strong> ms/gen</label>
<input type="range" id="ncb-evo-speed" min="30" max="500" step="10" value="120" />
</div>
<div class="ncb-control" style="flex-direction:row;gap:8px;align-items:flex-end;">
<button class="ncb-btn" id="ncb-evo-play">▶ Play</button>
<button class="ncb-btn" id="ncb-evo-step">Step</button>
<button class="ncb-btn" id="ncb-evo-reset">Reset</button>
</div>
</div>
</div>

<script>
(function(){
var POP=500,ROUNDS=20,MAX_GEN=80,MUT=0.03,BINS=20,M=1000000,K=1000;
var pop,gen,hist,q,running,timer,speed;
var histCvs=document.getElementById('ncb-evo-hist');
var histCtx;

function setupCanvas(){
  var dpr=window.devicePixelRatio||1;
  var rect=histCvs.getBoundingClientRect();
  histCvs.width=rect.width*dpr;histCvs.height=rect.height*dpr;
  histCtx=histCvs.getContext('2d');histCtx.scale(dpr,dpr);
  histCvs._w=rect.width;histCvs._h=rect.height;
}

function gauss(){var u=0,v=0;while(!u)u=Math.random();while(!v)v=Math.random();return Math.sqrt(-2*Math.log(u))*Math.cos(2*Math.PI*v)}
function avg(a){var s=0;for(var i=0;i<a.length;i++)s+=a[i];return s/a.length}

function playN(ob,qq){
  var correct=Math.random()<qq;
  var predOB=correct?ob:!ob;
  var boxB=predOB?M:0;
  return ob?boxB:K+boxB;
}

function init(){
  q=parseInt(document.getElementById('ncb-evo-q').value)/100;
  speed=parseInt(document.getElementById('ncb-evo-speed').value);
  pop=new Array(POP);for(var i=0;i<POP;i++)pop[i]=Math.random();
  gen=0;hist=[{g:0,p:avg(pop)}];
  stop();
  setupCanvas();
  updateDisplay();
  resetLine();
}

function runGen(){
  if(gen>=MAX_GEN){stop();return}
  var fit=new Float64Array(POP);
  for(var i=0;i<POP;i++){var t=0;for(var r=0;r<ROUNDS;r++){var ob=Math.random()<pop[i];t+=playN(ob,q)}fit[i]=t/ROUNDS}
  var tF=0;for(var i=0;i<POP;i++)tF+=fit[i];
  var cum=new Float64Array(POP);cum[0]=fit[0]/tF;
  for(var i=1;i<POP;i++)cum[i]=cum[i-1]+fit[i]/tF;
  var np=new Array(POP);
  for(var i=0;i<POP;i++){
    var rv=Math.random(),lo=0,hi=POP-1;
    while(lo<hi){var mid=(lo+hi)>>1;if(cum[mid]<rv)lo=mid+1;else hi=mid}
    var c=pop[lo]+gauss()*MUT;np[i]=Math.max(0,Math.min(1,c));
  }
  pop=np;gen++;
  var ap=avg(pop);
  var avgPay=0;for(var i=0;i<POP;i++){var ob=Math.random()<pop[i];avgPay+=playN(ob,q)}avgPay/=POP;
  hist.push({g:gen,p:ap,pay:avgPay});
  updateDisplay();
}

function drawHist(){
  var w=histCvs._w,h=histCvs._h,ctx=histCtx;
  ctx.clearRect(0,0,w,h);
  var bins=new Array(BINS).fill(0);
  for(var i=0;i<POP;i++){var b=Math.floor(pop[i]*BINS);if(b>=BINS)b=BINS-1;bins[b]++}
  var mx=0;for(var i=0;i<BINS;i++)if(bins[i]>mx)mx=bins[i];
  if(mx<1)mx=1;
  var mg={t:28,r:16,b:38,l:40};
  var pw=w-mg.l-mg.r,ph=h-mg.t-mg.b,bw=pw/BINS;
  ctx.fillStyle='#374151';ctx.font='bold 12px sans-serif';ctx.textAlign='center';
  ctx.fillText('Population Distribution, Generation '+gen,w/2,16);
  ctx.strokeStyle='#e5e7eb';ctx.lineWidth=1;
  for(var i=1;i<=4;i++){var y=mg.t+ph-ph*i/4;ctx.beginPath();ctx.moveTo(mg.l,y);ctx.lineTo(mg.l+pw,y);ctx.stroke()}
  for(var i=0;i<BINS;i++){
    var bh=bins[i]/mx*ph,x=mg.l+i*bw,y=mg.t+ph-bh;
    var t=(i+.5)/BINS;
    var cr=Math.round(220*(1-t)+37*t),cg=Math.round(38*(1-t)+99*t),cb=Math.round(38*(1-t)+235*t);
    ctx.fillStyle='rgb('+cr+','+cg+','+cb+')';
    ctx.beginPath();
    var rad=Math.min(3,bh/2);
    if(bh>0){
      ctx.moveTo(x+1,mg.t+ph);ctx.lineTo(x+1,y+rad);
      ctx.quadraticCurveTo(x+1,y,x+1+rad,y);
      ctx.lineTo(x+bw-1-rad,y);
      ctx.quadraticCurveTo(x+bw-1,y,x+bw-1,y+rad);
      ctx.lineTo(x+bw-1,mg.t+ph);
    }
    ctx.fill();
  }
  ctx.strokeStyle='#9ca3af';ctx.lineWidth=1;
  ctx.beginPath();ctx.moveTo(mg.l,mg.t);ctx.lineTo(mg.l,mg.t+ph);ctx.lineTo(mg.l+pw,mg.t+ph);ctx.stroke();
  ctx.fillStyle='#6b7280';ctx.font='11px sans-serif';ctx.textAlign='center';
  for(var i=0;i<=4;i++){var v=(i*.25).toFixed(2);var x=mg.l+i*.25*pw;ctx.fillText(v,x,mg.t+ph+14)}
  ctx.fillText('One-box probability (p)',w/2,mg.t+ph+30);
  ctx.textAlign='right';
  for(var i=0;i<=4;i++){var v=Math.round(mx*i/4);var y=mg.t+ph-ph*i/4;ctx.fillText(v,mg.l-5,y+4)}
  var ap=avg(pop);
  ctx.strokeStyle='#f59e0b';ctx.lineWidth=2;ctx.setLineDash([4,4]);
  var ax=mg.l+ap*pw;
  ctx.beginPath();ctx.moveTo(ax,mg.t);ctx.lineTo(ax,mg.t+ph);ctx.stroke();ctx.setLineDash([]);
  ctx.fillStyle='#f59e0b';ctx.font='bold 11px sans-serif';ctx.textAlign=ap>0.5?'right':'left';
  ctx.fillText('avg='+ap.toFixed(3),ax+(ap>0.5?-4:4),mg.t+10);
}

var lineChart;
function resetLine(){
  if(lineChart)lineChart.destroy();
  lineChart=new Chart(document.getElementById('ncb-evo-line'),{
    type:'scatter',
    data:{datasets:[{data:[{x:0,y:hist[0].p}],showLine:true,borderColor:'#2563eb',borderWidth:2,pointRadius:0,tension:.3,fill:false}]},
    options:{responsive:true,maintainAspectRatio:false,animation:false,
      scales:{x:{type:'linear',min:0,max:MAX_GEN,title:{display:true,text:'Generation',font:{size:13}},ticks:{stepSize:10}},
              y:{min:0,max:1,title:{display:true,text:'Avg one-box probability',font:{size:13}}}},
      plugins:{legend:{display:false},tooltip:{enabled:false}}
    },
    plugins:[{id:'refline',afterDraw:function(ch){
      var yA=ch.scales.y,ctx=ch.ctx,py=yA.getPixelForValue(0.5);
      ctx.save();ctx.strokeStyle='#d1d5db';ctx.lineWidth=1;ctx.setLineDash([4,4]);
      ctx.beginPath();ctx.moveTo(ch.scales.x.left,py);ctx.lineTo(ch.scales.x.right,py);ctx.stroke();ctx.restore();
    }}]
  });
}

function updateDisplay(){
  drawHist();
  if(lineChart){
    lineChart.data.datasets[0].data=hist.map(function(h){return{x:h.g,y:h.p}});
    lineChart.update('none');
  }
  document.getElementById('ncb-evo-gen').textContent=gen;
  document.getElementById('ncb-evo-avgp').textContent=avg(pop).toFixed(3);
  var lastPay=hist.length>1?hist[hist.length-1].pay:null;
  document.getElementById('ncb-evo-pay').innerHTML=lastPay?'<span>$</span>'+(lastPay/1000).toFixed(0)+'K':'—';
}

function start(){
  if(running)return;running=true;
  document.getElementById('ncb-evo-play').textContent='⏸ Pause';
  timer=setInterval(function(){runGen();if(gen>=MAX_GEN)stop()},speed);
}
function stop(){
  running=false;document.getElementById('ncb-evo-play').textContent='▶ Play';
  if(timer){clearInterval(timer);timer=null}
}

document.getElementById('ncb-evo-play').addEventListener('click',function(){running?stop():start()});
document.getElementById('ncb-evo-step').addEventListener('click',function(){stop();runGen()});
document.getElementById('ncb-evo-reset').addEventListener('click',init);
document.getElementById('ncb-evo-q').addEventListener('input',function(){
  q=parseInt(this.value)/100;document.getElementById('ncb-evo-q-val').textContent=q.toFixed(2);
});
document.getElementById('ncb-evo-speed').addEventListener('input',function(){
  speed=parseInt(this.value);document.getElementById('ncb-evo-speed-val').textContent=speed;
  if(running){clearInterval(timer);timer=setInterval(function(){runGen();if(gen>=MAX_GEN)stop()},speed)}
});
window.addEventListener('resize',function(){if(!running){setupCanvas();drawHist()}});
init();
})();
</script>

<p>When the predictor is 70% accurate or better, populations evolve toward near-pure one-boxing within ~20 generations ($p &gt; 0.9$). Even at $q = 0.51$, barely better than a coin, there’s a slow drift upward. Evolution doesn’t care about causal philosophy; it follows the payoff gradient.</p>

<h2 id="lens-3-data-vs-counterfactuals">Lens 3: Data vs. Counterfactuals</h2>

<p>The third lens is the most empirical. Suppose you’re not a philosopher but a statistician. You observe 10,000 people play Newcomb’s game. The data is unambiguous: one-boxers average \$990,000. Two-boxers average \$11,000. The observational conditional $E[\text{payoff} \mid \text{action} = a]$, the average payoff grouped by observed action, overwhelmingly favors one-boxing.</p>

<p>The two-boxer’s defense is a counterfactual: “Those one-boxers <em>would have</em> gotten \$1,001,000 if they had two-boxed.” This claim may be true. But it is unobservable. You never see the same person both one-box and two-box under the same prediction. This is the <strong>fundamental problem of causal inference</strong>: the impossibility of observing both potential outcomes for the same unit. In fact, the structure is identical to treatment effect estimation, where the same gap between observed and counterfactual outcomes makes naively comparing treated and untreated groups unreliable:</p>

<table>
  <thead>
    <tr>
      <th>Causal inference</th>
      <th>Newcomb’s problem</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Treatment assignment</td>
      <td>Your action</td>
    </tr>
    <tr>
      <td>Outcome under treatment</td>
      <td>Payoff if one-box</td>
    </tr>
    <tr>
      <td>Outcome under control</td>
      <td>Payoff if two-box</td>
    </tr>
    <tr>
      <td>Confounder</td>
      <td>Disposition $\theta$</td>
    </tr>
    <tr>
      <td>Fundamental problem</td>
      <td>Same person can’t do both</td>
    </tr>
  </tbody>
</table>

<p>The expected payoffs can be computed in closed form. Plotting them as a function of the predictor’s accuracy makes the gap visible, and shows exactly where it hides:</p>

<div class="ncb-widget">
<div class="ncb-chart-wrap" style="height:360px;">
<canvas id="ncb-gap-chart"></canvas>
</div>
<div class="ncb-controls">
<div class="ncb-control">
<label>Highlight accuracy: <strong><span id="ncb-q3-val">0.99</span></strong></label>
<input type="range" id="ncb-q3" min="95" max="99" step="1" value="99" />
</div>
</div>
<div class="ncb-readout" id="ncb-gap-readout"></div>
</div>

<script>
(function(){
var M=1000000,K=1000;
function oneObs(q){return q*M}
function twoObs(q){return(1-q)*(M+K)+q*K}
function counter(q){return q*M+K}
function fmtD(n){return'<span>$</span>'+n.toLocaleString('en-US',{maximumFractionDigits:0})}

var qs=[];for(var i=95;i<=100;i+=0.5)qs.push(i/100);
var d1=qs.map(function(q){return{x:q,y:oneObs(q)/1000}});
var d3=qs.map(function(q){return{x:q,y:counter(q)/1000}});
var d2=qs.map(function(q){return{x:q,y:twoObs(q)/1000}});
var currentQ=0.99;

var chart=new Chart(document.getElementById('ncb-gap-chart'),{
  type:'scatter',
  data:{datasets:[
    {label:'One-box (observed)',data:d1,yAxisID:'y',showLine:true,borderColor:'#2563eb',borderWidth:2.5,pointRadius:0,tension:0,order:2},
    {label:'One-box (counterfactual)',data:d3,yAxisID:'y',showLine:true,borderColor:'#16a34a',borderWidth:2,borderDash:[6,4],pointRadius:0,tension:0,
     fill:{target:0,above:'rgba(22,163,106,0.18)'},order:1},
    {label:'Two-box (observed)',data:d2,yAxisID:'y1',showLine:true,borderColor:'#dc2626',borderWidth:2.5,pointRadius:0,tension:0,order:3},
    {label:'_hl1',data:[{x:currentQ,y:oneObs(currentQ)/1000},{x:currentQ,y:counter(currentQ)/1000}],
     yAxisID:'y',pointRadius:7,pointBackgroundColor:['#2563eb','#16a34a'],pointBorderColor:'#fff',pointBorderWidth:2,showLine:false,order:0},
    {label:'_hl2',data:[{x:currentQ,y:twoObs(currentQ)/1000}],
     yAxisID:'y1',pointRadius:7,pointBackgroundColor:['#dc2626'],pointBorderColor:'#fff',pointBorderWidth:2,showLine:false,order:0}
  ]},
  options:{
    responsive:true,maintainAspectRatio:false,animation:false,
    scales:{
      x:{type:'linear',min:0.95,max:1.00,title:{display:true,text:'Predictor accuracy (q)',font:{size:13}},
         ticks:{callback:function(v){return v.toFixed(2)},stepSize:0.01}},
      y:{type:'linear',position:'left',min:945,max:1005,
         title:{display:true,text:'One-box payoff ($K)',font:{size:13},color:'#2563eb'},
         ticks:{color:'#2563eb',callback:function(v){return'$'+v+'K'}},
         grid:{drawOnChartArea:true}},
      y1:{type:'linear',position:'right',min:0,max:60,
          title:{display:true,text:'Two-box payoff ($K)',font:{size:13},color:'#dc2626'},
          ticks:{color:'#dc2626',callback:function(v){return'$'+v+'K'}},
          grid:{drawOnChartArea:false}}
    },
    plugins:{
      legend:{position:'top',labels:{usePointStyle:true,pointStyle:'line',font:{size:12},
        filter:function(item){return item.text.charAt(0)!=='_'}}},
      tooltip:{enabled:false},
      filler:{propagate:true}
    }
  },
  plugins:[{
    id:'gapLabel',
    afterDraw:function(ch){
      var xA=ch.scales.x,yA=ch.scales.y,ctx=ch.ctx;
      var px=xA.getPixelForValue(currentQ);
      ctx.save();ctx.strokeStyle='#f59e0b';ctx.lineWidth=1.5;ctx.setLineDash([4,4]);
      ctx.beginPath();ctx.moveTo(px,yA.top);ctx.lineTo(px,yA.bottom);ctx.stroke();
      var y1=yA.getPixelForValue(oneObs(currentQ)/1000);
      var y2=yA.getPixelForValue(counter(currentQ)/1000);
      if(Math.abs(y1-y2)>2){
        ctx.strokeStyle='#16a34a';ctx.lineWidth=2;ctx.setLineDash([]);
        ctx.beginPath();ctx.moveTo(px+12,y1);ctx.lineTo(px+12,y2);ctx.stroke();
        ctx.beginPath();ctx.moveTo(px+9,y1);ctx.lineTo(px+15,y1);ctx.stroke();
        ctx.beginPath();ctx.moveTo(px+9,y2);ctx.lineTo(px+15,y2);ctx.stroke();
        ctx.fillStyle='#16a34a';ctx.font='bold 11px sans-serif';ctx.textAlign='left';
        ctx.fillText('$1K gap',px+18,(y1+y2)/2+4);
      }
      ctx.restore();
    }
  }]
});

function update(){
  var q=parseInt(document.getElementById('ncb-q3').value)/100;
  currentQ=q;
  document.getElementById('ncb-q3-val').textContent=q.toFixed(2);
  chart.data.datasets[3].data=[{x:q,y:oneObs(q)/1000},{x:q,y:counter(q)/1000}];
  chart.data.datasets[4].data=[{x:q,y:twoObs(q)/1000}];
  chart.update('none');
  var gap=counter(q)-oneObs(q);
  document.getElementById('ncb-gap-readout').innerHTML=
    'At <em>q</em> = '+q.toFixed(2)+': One-boxers earn <strong>'+fmtD(oneObs(q))+
    '</strong> · Two-boxers earn <strong>'+fmtD(twoObs(q))+
    '</strong> · Counterfactual: <strong>'+fmtD(counter(q))+
    '</strong><br>The <span style="color:#16a34a;font-weight:600">green band</span> is the unobservable gap: <strong>'+fmtD(gap)+
    '</strong>. You\'d need to observe the same person doing both to see it.';
}
document.getElementById('ncb-q3').addEventListener('input',update);
update();
})();
</script>

<p>The top dashed line is the two-boxer’s counterfactual claim: one-boxers <em>would have</em> earned \$1,001,000 if they had grabbed both boxes with the same prediction. That line is real, and invisible. At $q = 0.99$, the gap between the observed one-box payoff (\$990,000) and the counterfactual (\$991,000) is exactly \$1,000. You’d need to observe the same person doing both to see it. You can’t.</p>

<p>If you’re a frequentist who follows the data, you one-box. If you’re a structural modeler who trusts your causal graph over the observed conditional, you two-box. Neither is obviously wrong. They’re optimizing different things: one follows the joint distribution the agent is <em>embedded</em> in, the other follows the causal structure the agent can <em>manipulate</em>.</p>

<div class="callout-anecdote">
<strong>Remark.</strong> This tension shows up everywhere, not just in philosophical musing. <br /><br />

A hospital observes that patients who receive a new drug have better outcomes, but sicker patients were more likely to be prescribed it; the raw conditional favors the drug, but the causal effect might be zero. <br /><br />

A company sees that employees who attend a leadership program get promoted faster, but the same ambition that drives attendance also drives promotion; conditioning on attendance overstates the program's value. <br /><br />

A spam filter flags emails based on features that spammers also select for; blocking those emails changes what spammers send next, invalidating the very distribution the filter was trained on. <br /><br />

In each case the question is the same: do you trust the pattern in the data, or the causal story behind it?
</div>

<h2 id="lens-4-oblivious-vs-adaptive-adversaries">Lens 4: Oblivious vs. Adaptive Adversaries</h2>

<p>The fourth lens comes from online learning and the theory of multi-armed bandits. In this setting, a learner repeatedly chooses actions, and an adversary determines the losses. The central question is: <em>what kind of adversary are you facing?</em></p>

<p>An <strong>oblivious</strong> adversary commits to the entire loss sequence before the game begins. It doesn’t see or react to the learner’s actions. Against an oblivious adversary, the natural performance measure is <strong>swap regret</strong>: “Holding the loss sequence fixed, could I have earned more by switching my action?” If yes, you have regret. The optimal response is straightforward: play the dominant action, since the environment won’t change.</p>

<p>An <strong>adaptive</strong> adversary observes the learner’s policy (or past behavior) and adjusts losses accordingly. Against an adaptive adversary, swap regret is misleading: the losses <em>would have been different</em> under a different policy. The right measure becomes <strong>policy regret</strong>: “Would a different <em>policy</em> have produced better outcomes, accounting for how the adversary would have responded to that policy?”</p>

<p>Linking this back to Newcomb’s problem:</p>

<table>
  <thead>
    <tr>
      <th>Online learning</th>
      <th>Newcomb’s problem</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Learner’s policy $\pi$</td>
      <td>Your decision algorithm</td>
    </tr>
    <tr>
      <td>Adversary’s loss sequence</td>
      <td>Box contents</td>
    </tr>
    <tr>
      <td>Oblivious adversary</td>
      <td>CDT: contents are fixed</td>
    </tr>
    <tr>
      <td>Adaptive adversary</td>
      <td>EDT/FDT: contents respond to your policy</td>
    </tr>
    <tr>
      <td>Swap regret</td>
      <td>“I’d get \$1,000 more by switching to two-box”</td>
    </tr>
    <tr>
      <td>Policy regret</td>
      <td>“A two-box <em>policy</em> leads to an empty Box B”</td>
    </tr>
  </tbody>
</table>

<p>CDT treats the predictor as oblivious. The money is placed, the game state is fixed, and two-boxing is the dominant action, exactly the swap-regret argument. The two-boxer says: “Holding Box B fixed, I always get \$1,000 more.”</p>

<p>EDT and FDT treat the predictor as adaptive. The predictor responded to your algorithm, so the box contents are a <em>function</em> of your policy. Switching from a one-box policy to a two-box policy doesn’t just change your action; it changes the adversary’s response. The one-boxer says: “A two-box <em>policy</em> faces an empty Box B.”</p>

<p>The precise role of anticipation, i.e., the adversary’s ability to foresee not just the learner’s policy but also their realization of randomness, has been studied in detail in the online learning setting (see e.g., <a href="https://arxiv.org/abs/2101.11443">Pokutta and Xu, 2021</a>, where we looked at this in particular in the context of robust optimization). When the adversary can anticipate the learner’s random coin flips, even randomized strategies lose their hedging value. The Newcomb predictor, with 99% accuracy, sits squarely in this regime: it anticipates not just your policy but your execution.</p>

<p>This reframing is clarifying because the online learning community has precise theorems about when each regret notion applies. Against a truly oblivious adversary, swap regret is tight and achievable. Against an adaptive adversary, minimizing swap regret can be catastrophically wrong; you need policy regret, which accounts for how the environment co-adapts. The entire Newcomb debate reduces to a classification question: <strong>is the predictor oblivious or adaptive?</strong></p>

<p>The answer, of course, depends on your ontology. If you believe the box contents are a physical fact determined before your choice (oblivious), CDT follows. If you believe the predictor adapted to your decision algorithm and the contents are therefore policy-dependent (adaptive), one-boxing follows. The bandit framework doesn’t resolve the paradox, but it reveals its skeleton: the same structural ambiguity that separates oblivious from adaptive adversaries in online learning separates two-boxers from one-boxers in Newcomb’s problem.</p>

<h2 id="the-mixed-strategy-flipping-a-coin">The Mixed Strategy: Flipping a Coin</h2>

<p>The first three lenses assume a deterministic chooser (the bandit lens already allows randomized policies, but there the focus was on regret notions, not on what randomization does to the predictor). So what if you flip a biased coin: one-box with probability $p$, two-box with probability $1-p$?</p>

<p>This breaks the predictor. A predictor that’s 99% accurate against deterministic strategies can’t beat $\max(p, 1-p)$ against a genuinely random (but biased) coin (i.e., private randomness as the boxes have been set up already). The predictor’s best response is simple: predict “one-box” (and fill Box B) if $p \geq 0.5$, predict “two-box” (and leave Box B empty) if $p &lt; 0.5$. Under this best response, the predictor’s effective accuracy is</p>

\[q^*(p) = \max(p, 1-p)\]

<p>which hits its minimum of $50\%$ at $p = 0.5$: a fair coin reduces the “99% accurate” predictor to a coin flip. The expected payoff works out to:</p>

\[E[\text{payoff}] = \begin{cases} M + (1-p) \cdot K &amp; \text{if } p &gt; 0.5 \text{ (predictor fills B)} \\ (1-p) \cdot K &amp; \text{if } p &lt; 0.5 \text{ (predictor empties B)} \end{cases}\]

<p>This creates a phase transition, a million-dollar cliff:</p>

<div class="ncb-widget">
<div class="ncb-chart-wrap" style="height:380px;">
<canvas id="ncb-mixed-chart"></canvas>
</div>
<div class="ncb-controls">
<div class="ncb-control">
<label>One-box probability: <strong><span id="ncb-p4-val">0.51</span></strong></label>
<input type="range" id="ncb-p4" min="0" max="100" step="1" value="51" />
</div>
</div>
<div class="ncb-readout" id="ncb-mixed-readout"></div>
</div>

<script>
(function(){
var M=1000000,K=1000;
function payoff(p){return p>0.5?M+(1-p)*K:p<0.5?(1-p)*K:M+0.5*K}
function predAcc(p){return Math.max(p,1-p)}
function fmtD(n){return'<span>$</span>'+n.toLocaleString('en-US',{maximumFractionDigits:0})}

var lowD=[],highD=[],cliff=[],accD=[];
for(var i=0;i<=98;i++){var p=i/200;lowD.push({x:p,y:(1-p)*K/1000})}
lowD.push({x:0.499,y:0.501*K/1000});
cliff.push({x:0.499,y:0.501*K/1000});
cliff.push({x:0.501,y:(M+0.499*K)/1000});
highD.push({x:0.501,y:(M+0.499*K)/1000});
for(var i=102;i<=200;i++){var p=i/200;highD.push({x:p,y:(M+(1-p)*K)/1000})}
for(var i=0;i<=200;i++){var p=i/200;accD.push({x:p,y:Math.max(p,1-p)*100})}

var currentP=0.51;

var chart=new Chart(document.getElementById('ncb-mixed-chart'),{
  type:'scatter',
  data:{datasets:[
    {label:'Predictor empties Box B',data:lowD,showLine:true,borderColor:'#dc2626',borderWidth:2.5,pointRadius:0,tension:0,fill:false},
    {label:'Predictor fills Box B',data:highD,showLine:true,borderColor:'#2563eb',borderWidth:2.5,pointRadius:0,tension:0,fill:false},
    {label:'Phase transition',data:cliff,showLine:true,borderColor:'#9ca3af',borderWidth:1.5,borderDash:[5,4],pointRadius:0,tension:0},
    {label:'Predictor accuracy',data:accD,yAxisID:'y1',showLine:true,borderColor:'#9333ea',borderWidth:1.5,borderDash:[4,3],pointRadius:0,tension:0},
    {label:'_pos',data:[{x:currentP,y:payoff(currentP)/1000}],pointRadius:8,pointBackgroundColor:'#f59e0b',pointBorderColor:'#fff',pointBorderWidth:2,showLine:false}
  ]},
  options:{
    responsive:true,maintainAspectRatio:false,animation:false,
    scales:{
      x:{type:'linear',min:0,max:1,title:{display:true,text:'One-box probability (p)',font:{size:13}},
         ticks:{callback:function(v){return v.toFixed(1)}}},
      y:{type:'linear',position:'left',title:{display:true,text:'Expected payoff ($K)',font:{size:13}},min:0,max:1100,
         ticks:{callback:function(v){return v>=1000?'$'+v/1000+'M':'$'+v+'K'}},
         grid:{drawOnChartArea:true}},
      y1:{type:'linear',position:'right',min:50,max:100,
          title:{display:true,text:'Predictor accuracy (%)',font:{size:13},color:'#9333ea'},
          ticks:{color:'#9333ea',callback:function(v){return v+'%'}},
          grid:{drawOnChartArea:false}}
    },
    plugins:{
      legend:{position:'top',labels:{usePointStyle:true,pointStyle:'line',font:{size:12},
        filter:function(item){return item.text.charAt(0)!=='_'&&item.text!=='Phase transition'}}},
      tooltip:{enabled:false}
    }
  },
  plugins:[{
    id:'pLine',
    afterDraw:function(ch){
      var xA=ch.scales.x,yA=ch.scales.y,ctx=ch.ctx;
      var px=xA.getPixelForValue(currentP);
      ctx.save();ctx.strokeStyle='#f59e0b';ctx.lineWidth=1.5;ctx.setLineDash([4,4]);
      ctx.beginPath();ctx.moveTo(px,yA.top);ctx.lineTo(px,yA.bottom);ctx.stroke();
      var tx=xA.getPixelForValue(0.5);
      ctx.strokeStyle='#d1d5db';ctx.lineWidth=1;ctx.setLineDash([3,3]);
      ctx.beginPath();ctx.moveTo(tx,yA.top);ctx.lineTo(tx,yA.bottom);ctx.stroke();
      ctx.fillStyle='#9ca3af';ctx.font='11px sans-serif';ctx.textAlign='center';
      ctx.fillText('p = 0.5',tx,yA.bottom+30);
      ctx.restore();
    }
  }]
});

function update(){
  var p=parseInt(document.getElementById('ncb-p4').value)/100;
  currentP=p;
  document.getElementById('ncb-p4-val').textContent=p.toFixed(2);
  chart.data.datasets[4].data=[{x:p,y:payoff(p)/1000}];
  chart.update('none');
  var pay=payoff(p),acc=predAcc(p);
  var resp=p>=0.5?'Fill Box B (predicts one-box)':'Empty Box B (predicts two-box)';
  var vs1=pay-M,vs0=pay-K;
  document.getElementById('ncb-mixed-readout').innerHTML=
    '<strong>Strategy:</strong> One-box with probability '+p.toFixed(2)+', two-box with probability '+(1-p).toFixed(2)+
    '<br><strong>Predictor best response:</strong> '+resp+' · Effective accuracy: '+(acc*100).toFixed(1)+'%'+
    '<br><strong>Expected payoff:</strong> '+fmtD(pay)+
    (p>=0.5?' (vs. pure one-box '+fmtD(M)+': '+(vs1>=0?'+':'')+fmtD(vs1)+')':
            ' (vs. pure two-box '+fmtD(K)+': '+(vs0>=0?'+':'')+fmtD(vs0)+')');
}
document.getElementById('ncb-p4').addEventListener('input',update);
update();
})();
</script>

<p>At $p = 0.49$, the predictor leaves Box B empty and you average \$510. At $p = 0.51$, the predictor fills Box B and you average \$1,000,490. A 2% shift in coin bias produces a \$999,980 jump in expected payoff.</p>

<p>The optimal mixed strategy is $p$ just above $0.5$: one-box slightly more than half the time. This earns approximately \$1,000,500, which <em>beats</em> pure one-boxing (\$1,000,000) by \$500. You get the million (the predictor fills Box B because you’re majority one-box) while occasionally grabbing the extra \$1,000 when the coin lands on two-box; it is a bit like reaping the extra \$1,000 from an occasional counterfactual switch.</p>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>Expected payoff</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Pure two-box ($p = 0$)</td>
      <td>\$1,000</td>
    </tr>
    <tr>
      <td>Pure one-box ($p = 1$)</td>
      <td>\$1,000,000</td>
    </tr>
    <tr>
      <td>Optimal mixed ($p \approx 0.51$)</td>
      <td>\$1,000,490</td>
    </tr>
  </tbody>
</table>

<p>But the optimal mixed strategy is fragile in a way that pure one-boxing is not. It depends on the predictor having a sharp threshold at $p = 0.5$ and not being able to anticipate the randomization itself. A predictor that models coin-flipping agents might demand $p &gt; 0.9$ (i.e., the predictor’s stated accuracy also determines which $p$ are admissible) before filling Box B, and then the optimal response shifts to $p$ just above 0.9, recovering less of the two-boxing bonus. In the limit where the predictor demands certainty (i.e., full anticipation of the random outcome of the coin flip), you’re back to pure one-boxing.</p>

<p>In some sense, mixed strategies reveal that Newcomb’s problem is really a <em>game</em> between the chooser and the predictor, and the payoff landscape has the structure of a game-theoretic discontinuity. In particular, the predictor isn’t just a feature of the environment; it’s a player.</p>

<h2 id="the-fixed-point">The Fixed Point</h2>

<p>There’s a final mathematical observation that cuts across all four lenses. The predictor’s accuracy creates a self-referential loop: the prediction depends on your reasoning, which depends on what you expect the predictor predicted, and so on… At equilibrium, this must be a <strong>fixed point</strong>: your strategy $\sigma$ and the predictor’s model $\hat{\sigma}$ must satisfy $\hat{\sigma} = \sigma$.</p>

<p>At any such fixed point, the “I’ll trick the predictor” intuition behind two-boxing is unstable. If your strategy is to two-box, the predictor knows, and Box B is empty. If your strategy is to one-box, the predictor knows, and Box B is full. You can’t deviate profitably <em>because the prediction already accounts for your reasoning about the prediction</em>. The fixed point is self-enforcing.</p>

<p>There is a useful game-theoretic way to see this. In the original setup, the move order is: predictor fills boxes (move 1), then you choose (move 2). Two-boxing exploits last-mover advantage: you observe a fixed game state and pick the dominant action. But as the predictor’s accuracy $q \to 1$, the effective move order <em>inverts</em>. A near-perfect predictor reacts to your strategy as if it moved <em>after</em> you, not before. The temporal sequence stays the same (boxes first, choice second), but the strategic sequence flips: the predictor’s move is now essentially a best response to yours. At $q = 1$ the game is equivalent to one where you commit to a strategy first and the predictor fills the boxes second. In that game, one-boxing is obviously correct and two-boxing is obviously foolish.</p>

<p>This is why Newcomb’s problem feels so different from ordinary strategic interaction. In a standard game, you choose against a fixed opponent. In Newcomb’s problem, you choose against a mirror. The four lenses, causal, algorithmic, empirical, adversarial, are four ways of formalizing what it means to make a decision when the universe has already priced in the fact that you’re going to make it.</p>

<div class="callout">
<strong>Takeaway.</strong> The real disagreement between CDT, EDT, and FDT is not about utility calculations; it is about whether the predictor is <em>anticipatory</em>. If the prediction is fixed before you deliberate, two-boxing strictly dominates. If the predictor anticipates your decision procedure, one-boxing is the only stable strategy. Newcomb's problem persists because each decision theory hard-codes a different answer to this question, and the problem statement is carefully silent on which one is right.
</div>

<h2 id="references">References</h2>

<p>[N] Nozick, R. (1969). Newcomb’s Problem and Two Principles of Choice. In N. Rescher et al. (eds.), <em>Essays in Honor of Carl G. Hempel</em>, pp. 114-146. D. Reidel, Dordrecht. <a href="https://danielhoek.com/wp-content/uploads/2020/02/Nozick-Newcombs-Problem-and-Two-Principles-of-Choice.pdf">PDF</a></p>

<p>[P] Pearl, J. (2009). <em>Causality: Models, Reasoning, and Inference</em>. 2nd ed. Cambridge University Press.</p>

<p>[YS] Yudkowsky, E. &amp; Soares, N. (2017). Functional Decision Theory: A New Theory of Instrumental Rationality. <a href="https://arxiv.org/abs/1710.05060">arXiv:1710.05060</a></p>

<p>[B] Bostrom, N. (2002). <em>Anthropic Bias: Observation Selection Effects in Science and Philosophy</em>. Routledge.</p>

<p>[CBL] Cesa-Bianchi, N. &amp; Lugosi, G. (2006). <em>Prediction, Learning, and Games</em>. Cambridge University Press.</p>

<p>[PX] Pokutta, S. &amp; Xu, H. (2021). Adversaries in Online Learning Revisited: with applications in Robust Optimization and Adversarial training. <a href="https://arxiv.org/abs/2101.11443">arXiv:2101.11443</a></p>

<p>[GH] Gibbard, A. &amp; Harper, W. (1978). Counterfactuals and Two Kinds of Expected Utility. In C.A. Hooker, J.J. Leach &amp; E.F. McClennen (eds.), <em>Foundations and Applications of Decision Theory, Vol. II</em>, pp. 125-162. D. Reidel, Dordrecht.</p>

<p>[SEP] Stanford Encyclopedia of Philosophy. <a href="https://plato.stanford.edu/entries/decision-causal/">Causal Decision Theory</a>.</p>

<p>[PP] PhilPeople. <a href="https://survey2020.philpeople.org/survey/results/4886">2020 PhilPapers Survey: Newcomb’s Problem</a>.</p>

<p>[O] Oesterheld, C. (2017). <a href="https://casparoesterheld.com/2017/06/27/a-survey-of-polls-on-newcombs-problem/">A Survey of Polls on Newcomb’s Problem</a>.</p>

<p>[W] Wikipedia. <a href="https://en.wikipedia.org/wiki/Newcomb%27s_problem">Newcomb’s Problem</a>.</p>]]></content><author><name>Sebastian Pokutta</name></author><category term="random" /><category term="decision-theory" /><category term="game-theory" /><category term="philosophy" /><category term="causality" /><category term="online-learning" /><summary type="html"><![CDATA[TL;DR: Newcomb’s paradox — should you take one box or two? — splits rational decision-makers almost evenly. There are four natural mathematical frameworks (causal inference, algorithmic self-reference, statistical counterfactuals, and online learning) that give different answers, and the disagreement reveals deep structural tensions in what it means to choose rationally.]]></summary></entry><entry><title type="html">Do LLM Outputs Mirror Their Internal Semantic Maps? A Large-Scale Behavioral Probing Study</title><link href="http://www.pokutta.com/blog/research/2026/03/06/neural-semantic-geometry.html" rel="alternate" type="text/html" title="Do LLM Outputs Mirror Their Internal Semantic Maps? A Large-Scale Behavioral Probing Study" /><published>2026-03-06T00:00:00+01:00</published><updated>2026-03-06T00:00:00+01:00</updated><id>http://www.pokutta.com/blog/research/2026/03/06/neural-semantic-geometry</id><content type="html" xml:base="http://www.pokutta.com/blog/research/2026/03/06/neural-semantic-geometry.html"><![CDATA[<p><em>TL;DR: How faithfully does an LLM’s text output reflect the semantic geometry encoded in its hidden states? Forced-choice behavioral probing recovers substantially more internal similarity structure than open-ended generation, and behavioral features improve prediction of unseen hidden-state similarities above lexical and cross-model baselines.</em></p>

<!--more-->

<p><em>Written by <a href="https://schiekiera.github.io/">Louis Schiekiera</a>.</em></p>

<h2 id="the-core-question">The core question</h2>

<p>Cognitive scientists have long inferred semantic structure from observable behavior: show someone the word <em>dog</em>, record what they associate with it (<em>cat</em>, <em>leash</em>, <em>bark</em>), repeat across many cues, and the resulting response patterns sketch an approximate map of an otherwise hidden meaning system (De Deyne et al., 2019). LLMs offer a model system to test this logic. Unlike human participants, a language model’s internal representations are directly accessible alongside its behavioral output. So we can ask a sharper question: <strong>when we probe an LLM with word-association tasks, how much of its hidden-state semantic geometry actually shows up in the responses it produces?</strong></p>

<p>That is the focus of our recent preprint, <a href="https://arxiv.org/abs/2602.00628"><em>From Associations to Activations</em></a> led by <a href="https://schiekiera.github.io/">Louis Schiekiera</a>. Rather than comparing model behavior to human norms, we compare each model’s behavior to its <em>own</em> layerwise hidden states—treating the model as both the subject and the ground truth.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/neural-semantic-geometry/conceptual.svg" alt="Conceptual overview of the framework" style="width:80%;" />
    <p style="font-size: small; font-style: italic;">Figure 1: Overview of the approach. A shared vocabulary feeds two pipelines: (i) layerwise hidden-state extraction produces a hidden-state similarity matrix, and (ii) behavioral association tasks (forced choice or free association) yield a behavioral similarity matrix. Representational similarity analysis (RSA) then quantifies how well the two geometries match.</p>
</div>

<h2 id="experimental-setup-at-a-glance">Experimental setup at a glance</h2>

<h3 id="models-under-study">Models under study</h3>

<p>We tested eight instruction-tuned decoder-only transformers spanning 7B to 14B parameters: Falcon3, Gemma-2, Llama-3.1, Mistral-7B, Mistral-Nemo, Phi-4, Qwen2.5, and rnj-1. All experiments share a single 5,000-noun vocabulary drawn from the SUBTLEX-US frequency list (Brysbaert et al., 2012).</p>

<h3 id="two-ways-to-elicit-semantic-behavior">Two ways to elicit semantic behavior</h3>

<p>We borrowed two classic paradigms from psycholinguistics.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/neural-semantic-geometry/both_paradigms.svg" alt="Forced choice and free association paradigms" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 2: Illustration of the two behavioral paradigms. In forced choice (left), a cue word is paired with a candidate set and the model picks the most related items. In free association (right), the model generates associates from scratch. Both produce cue–response count matrices whose row-wise cosine similarities define behavioral semantic geometries.</p>
</div>

<p><strong>Forced choice (FC).</strong> Each cue appears with 16 candidate words; the model selects exactly two that are most semantically related. A deterministic shuffle of the remaining vocabulary produces 313 unique candidate sets per cue.</p>

<p><strong>Free association (FA).</strong> Each cue is presented alone and the model generates five single-word associates. We repeat this across 126 stochastic runs per cue to accumulate stable response distributions.</p>

<p>Responses from both paradigms are aggregated into sparse cue–response count matrices. We apply positive pointwise mutual information (PPMI) reweighting to down-weight globally frequent responses, then compute cue–cue similarity via cosine between PPMI-weighted row vectors. Altogether, the dataset spans more than <strong>17.5 million trials</strong> across both paradigms and all eight models.</p>

<h3 id="extracting-hidden-state-geometry">Extracting hidden-state geometry</h3>

<p>For every model and every word in the vocabulary, we pulled layerwise hidden states under four contextual embedding strategies:</p>

<ul>
  <li><strong>Averaged</strong> — the word embedded in 50 naturally occurring C4 sentences (Raffel et al., 2020), hidden states averaged across contexts (Bommasani et al., 2020).</li>
  <li><strong>Meaning</strong> — a fixed definitional prompt (<em>“What is the meaning of the word {w}?”</em>).</li>
  <li><strong>Task (FC)</strong> — the word embedded in the forced-choice instruction prompt, minus the candidate list.</li>
  <li><strong>Task (FA)</strong> — the word embedded in the free-association instruction prompt.</li>
</ul>

<p>Cosine similarity between mean-centered layerwise vectors (Ethayarajh et al., 2019) yields a hidden-state similarity matrix for each model, layer, and extraction strategy.</p>

<h3 id="reference-baselines">Reference baselines</h3>

<p>Three external baselines anchor the comparison: <strong>FastText</strong> static word vectors (Bojanowski et al., 2017), <strong>BERT</strong> contextual embeddings (Devlin et al., 2019), and a <strong>cross-model consensus</strong> geometry that averages hidden-state similarities from all <em>other</em> models—motivated by evidence for a shared semantic subspace across architectures (Huh et al., 2024).</p>

<h3 id="how-we-measure-alignment">How we measure alignment</h3>

<p>We employed three complementary metrics:</p>

<ol>
  <li>
    <p><strong>RSA</strong> (Kriegeskorte et al., 2008; Nili et al., 2014) — Pearson correlation between vectorized upper-triangular entries of the hidden-state and reference similarity matrices, computed per layer.</p>
  </li>
  <li>
    <p><strong>Nearest-neighbor overlap</strong> ($\mathrm{NN@}k$) — fraction of shared $k$-nearest neighbors between hidden-state and reference similarity spaces.</p>
  </li>
  <li>
    <p><strong>Held-out-words ridge regression</strong> — can behavioral similarity predict hidden-state similarities for words the model never saw during training of the regression? This tests generalization beyond lexical baselines and cross-model consensus.</p>
  </li>
</ol>

<h2 id="key-findings">Key findings</h2>

<h3 id="constrained-tasks-recover-far-more-internal-structure">Constrained tasks recover far more internal structure</h3>

<p>The gap between paradigms is substantial. Forced-choice behavior aligns with hidden-state geometry far more strongly than free association—consistently, across every model and evaluation metric.</p>

<p>Under the best extraction strategy (Task FC), mean RSA reaches $r = .463$ for forced choice versus only $r = .199$ for free association. Even the weakest FC condition (Averaged extraction, $r = .346$) outperforms the strongest FA condition.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/neural-semantic-geometry/rsa_nn_grid_1x2.svg" alt="Summary RSA and nearest-neighbor overlap results" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 3: Aggregate alignment across models. Left: RSA correlation by layer. Right: nearest-neighbor overlap as a function of neighborhood size $k$ (log scale). Forced-choice behavior (green) tracks hidden-state structure far more closely than free association (red). Cross-model consensus (black) sets the ceiling.</p>
</div>

<p>Why does FC win so decisively? Its controlled candidate sets force every response to emerge from an explicit comparison, concentrating observations onto shared supports and producing a denser, less noisy cue–response matrix (Roads et al., 2021). Free association, by contrast, disperses probability mass across a long tail of idiosyncratic responses, yielding sparser vectors with lower signal-to-noise for recovering geometric structure.</p>

<h3 id="extraction-context-shifts-where-alignment-peaks">Extraction context shifts where alignment peaks</h3>

<p>The choice of how hidden states are extracted determines <em>which layers</em> show the strongest match.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/neural-semantic-geometry/rsa_line_plot_1x2_grid_fc_fa.svg" alt="Layerwise RSA for FC and FA under different extraction strategies" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 4: Layerwise RSA profiles under different extraction strategies. Task-aligned and meaning-focused prompts peak at earlier, mid-depth layers. Averaging over natural contexts shifts the peak to later layers.</p>
</div>

<p>Task-aligned and meaning-based prompts push the model into a comparable semantically focused processing mode, and peak alignment appears at earlier to mid-depth layers—consistent with evidence that core lexical-semantic representations crystallize in intermediate transformer blocks. Averaging over diverse natural contexts, by contrast, mixes senses and topics, diluting the word-level signal and shifting alignment peaks toward the final layers.</p>

<h3 id="the-paradigm-advantage-holds-across-all-eight-models">The paradigm advantage holds across all eight models</h3>

<p>Model-by-model heatmaps confirm that the FC superiority is universal, though its magnitude varies with architecture:</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/neural-semantic-geometry/rsa_fc_fa_2x4_grid.svg" alt="RSA heatmap across models" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 5: Per-model RSA heatmaps. Each panel contrasts forced-choice (left sub-panel) and free-association (right sub-panel) behavioral similarity against hidden states, broken down by extraction strategy and summarized across layers.</p>
</div>

<h3 id="behavior-predicts-hidden-structure-on-unseen-words">Behavior predicts hidden structure on unseen words</h3>

<p>The held-out regression provides the most stringent test. After controlling for FastText, BERT, and cross-model consensus, adding FC behavioral similarity still improves mean test $R^2$ by $+.022$; FA adds a marginal $+.002$. The full model achieves mean $R^2 = .587$ (baseline: $.569$), peaking at $R^2 = .844$ for Llama-3.1-8B-Instruct.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/neural-semantic-geometry/rr_model_performance_grid_2x4.svg" alt="Ridge regression performance across models" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 6: Held-out ridge regression results for all eight models. Bold values indicate $R^2$ for the full predictor set (behavioral + baselines); parenthetical values show the baseline without behavioral features.</p>
</div>

<p>This means that behavioral probing captures something about a model’s internal semantic organization that lexical vectors and cross-model structure alone do not—especially when the behavioral measurement is carefully constrained.</p>

<h2 id="broader-implications">Broader implications</h2>

<h3 id="black-box-interpretability">Black-box interpretability</h3>

<p>When logits and activations are unavailable, behavioral probing remains an important path to interpretability. Forced-choice paradigms are especially promising: their constrained response sets act as structured measurement instruments that concentrate informative signal.</p>

<h3 id="lessons-for-cognitive-science">Lessons for cognitive science</h3>

<p>Our fully transparent LLM setup lets us rigorously test a foundational cognitive-science assumption—that structured behavior is constrained by, and therefore partially reveals, internal states. The sharp FC–FA divergence demonstrates that <em>whether</em> behavior reveals internal structure depends critically on the measurement protocol. Open-ended tasks are not inherently less informative; they simply distribute responses too thinly for cosine-based geometry recovery. Protocol design is itself a variable.</p>

<h3 id="a-shared-semantic-substrate">A shared semantic substrate</h3>

<p>One of the most important observations is the strength of cross-model consensus. Similarity structure aggregated from the other seven models explains a large share of variance in any target model’s hidden-state geometry, lending further support to the hypothesis of a common, low-dimensional semantic subspace across diverse LLM architectures (Huh et al., 2024).</p>

<h2 id="references">References</h2>

<ul>
  <li>
    <p>Bojanowski, P., Grave, E., Joulin, A., &amp; Mikolov, T. (2017). Enriching word vectors with subword information. <em>Transactions of the Association for Computational Linguistics, 5</em>, 135–146. <a href="https://doi.org/10.1162/tacl_a_00051">doi:10.1162/tacl_a_00051</a></p>
  </li>
  <li>
    <p>Bommasani, R., Davis, K., &amp; Cardie, C. (2020). Interpreting pretrained contextualized representations via reductions to static embeddings. In <em>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</em> (pp. 4758–4781). <a href="https://doi.org/10.18653/v1/2020.acl-main.431">doi:10.18653/v1/2020.acl-main.431</a></p>
  </li>
  <li>
    <p>Brysbaert, M., New, B., &amp; Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. <em>Behavior Research Methods, 44</em>(4), 991–997. <a href="https://doi.org/10.3758/s13428-012-0190-4">doi:10.3758/s13428-012-0190-4</a></p>
  </li>
  <li>
    <p>De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., &amp; Storms, G. (2019). The Small World of Words: English word association norms for over 12,000 cue words. <em>Behavior Research Methods, 51</em>(3), 987–1006. <a href="https://doi.org/10.3758/s13428-018-1115-7">doi:10.3758/s13428-018-1115-7</a></p>
  </li>
  <li>
    <p>Devlin, J., Chang, M.-W., Lee, K., &amp; Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In <em>Proceedings of NAACL-HLT 2019</em> (pp. 4171–4186). <a href="https://doi.org/10.18653/v1/N19-1423">doi:10.18653/v1/N19-1423</a></p>
  </li>
  <li>
    <p>Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. <em>arXiv preprint</em>. <a href="https://arxiv.org/abs/1909.00512">arxiv:1909.00512</a></p>
  </li>
  <li>
    <p>Huh, M., Cheung, B., Wang, T., &amp; Isola, P. (2024). The platonic representation hypothesis. <em>arXiv preprint</em>. <a href="https://arxiv.org/abs/2405.07987">arxiv:2405.07987</a></p>
  </li>
  <li>
    <p>Kriegeskorte, N., Mur, M., &amp; Bandettini, P. A. (2008). Representational similarity analysis—connecting the branches of systems neuroscience. <em>Frontiers in Systems Neuroscience, 2</em>, 4. <a href="https://doi.org/10.3389/neuro.06.004.2008">doi:10.3389/neuro.06.004.2008</a></p>
  </li>
  <li>
    <p>Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., &amp; Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. <em>PLoS Computational Biology, 10</em>(4), e1003553. <a href="https://doi.org/10.1371/journal.pcbi.1003553">doi:10.1371/journal.pcbi.1003553</a></p>
  </li>
  <li>
    <p>Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., &amp; Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. <em>Journal of Machine Learning Research, 21</em>(140), 1–67. <a href="http://jmlr.org/papers/v21/20-074.html">jmlr.org</a></p>
  </li>
  <li>
    <p>Roads, B. D., &amp; Love, B. C. (2021). Enriching ImageNet with human similarity judgments and psychological embeddings. In <em>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em> (pp. 3547–3557). <a href="https://doi.org/10.1109/CVPR46437.2021.00355">doi:10.1109/CVPR46437.2021.00355</a></p>
  </li>
</ul>]]></content><author><name>Louis Schiekiera</name></author><category term="research" /><category term="interpretability" /><category term="representation-learning" /><category term="llm-behavior" /><category term="semantic-geometry" /><summary type="html"><![CDATA[TL;DR: How faithfully does an LLM’s text output reflect the semantic geometry encoded in its hidden states? Forced-choice behavioral probing recovers substantially more internal similarity structure than open-ended generation, and behavioral features improve prediction of unseen hidden-state similarities above lexical and cross-model baselines.]]></summary></entry><entry><title type="html">Between Theory and Reality: How Schools Grapple with Heterogeneity and Where AI Fits</title><link href="http://www.pokutta.com/blog/research/2025/12/15/school-ai.html" rel="alternate" type="text/html" title="Between Theory and Reality: How Schools Grapple with Heterogeneity and Where AI Fits" /><published>2025-12-15T00:00:00+01:00</published><updated>2025-12-15T00:00:00+01:00</updated><id>http://www.pokutta.com/blog/research/2025/12/15/school-ai</id><content type="html" xml:base="http://www.pokutta.com/blog/research/2025/12/15/school-ai.html"><![CDATA[<p><em>TL;DR: Rising classroom heterogeneity and workload make AI in schools inevitable; with FACET we explore how evidence-based, teacher-centered AI can support meaningful differentiation and AI literacy without replacing human judgment.</em></p>

<!--more-->

<p><em>Written by Jana Gonnermann-Müller.</em></p>

<h2 id="recognizing-the-need-for-ai-in-schools">Recognizing the Need for AI in Schools</h2>

<p>On November 26, school principals from primary and secondary schools gathered with regional authorities for the annual <em>KI-Fachtag der Schulen</em>. The theme this year: <strong>Artificial Intelligence in Education</strong>.</p>

<p>The spotlight was firmly on concrete use cases: how Artificial intelligence (AI) can support schools today, but also on the fear that AI is used without verification or cross-checking, potentially leading to a loss of skill and knowledge. At the center of the discussion was a dual question. On the one hand, how do we ensure that AI use becomes meaningful, so that it augments, rather than replaces, core competencies, which includes fostering AI literacy, critical thinking, and verification skills so that students learn not simply to consume AI outputs, but to evaluate, challenge, and integrate them in ways that support long-term skill development. On the other hand, how can AI help schools address structural pressures such as rising workload due to classroom heterogeneity and teacher shortages?</p>

<p>As part of the event, our team from the <a href="https://www.zib.de/iol/">IOL Lab</a> at the <a href="https://www.zib.de/">Zuse Institute Berlin</a> was invited to give a talk and to lead two hands-on workshops. In the keynote, we outlined the structural pressures that make AI integration in schools necessary: demographic change, Germany’s lag in technological adoption, and the growing information overload that increasingly requires students to critically evaluate and interpret information sources. In such a context, AI literacy and critical thinking become foundational competences. The workshop discussions quickly revealed just how acute these pressures have become. School leaders described rising heterogeneity in their classrooms, the growing need for differentiated instruction, and the challenges of ensuring meaningful and responsible AI use in everyday teaching. Their questions and concerns underscored the urgency of developing approaches that are not only technically feasible but also pedagogically sound. Yet meaningful AI integration depends on supporting teachers with practical, research-grounded tools that address for example diverse learning needs without adding to their workload.</p>

<p>This post situates these discussions within a broader research context, where we see our work as one puzzle piece in a much larger effort: working directly with schools to understand their needs, translate them into researchable questions, and offer evidence-based opportunities to address them. We therefore collaborate with schools to empirically examine where AI can support teaching, such as in differentiation under increasing heterogeneity, where it cannot, and how it must be designed to strengthen rather than undermine students’ skill acquisition. By co-developing and rigorously evaluating an AI-supported tool with practitioners, we aim to move the conversation away from emotion-driven expectations and fears and toward an evidence-informed understanding of what works in real classroom conditions.</p>

<h2 id="the-bigger-picture-what-ai-and-school-encompasses">The Bigger Picture: What ‘AI and School’ Encompasses</h2>

<p>Schools constitute the first structured environment in which young people interact with broader social and technological systems. As such, they are expected to cultivate foundational competencies, such as critical thinking, judgment and collaboration, while simultaneously preparing students for rapidly evolving technological conditions. Contemporary education systems therefore face a multi-layered mandate: to enable engagement with AI, support cognitive and socio-emotional development, and contribute to the reduction rather than the reproduction of socio-economic disparities. Within this broader mandate, AI emerges not as an optional add-on but as an integral part of the technological landscape and decision-making contexts, students will have to navigate. The central question is thus not whether AI should be present in schools, but <strong>how</strong> it can be integrated in ways that reinforce, rather than erode, core cognitive and analytical skills.</p>

<p>We outline that ‘AI in schools’ comprises two interdependent domains. The first is <em>education about AI</em>, encompassing digital literacy, data literacy, and increasingly AI literacy. These competencies enable students to interpret uncertainty, understand model behavior, and critically evaluate algorithmic outputs—skills that underpin agency in AI-mediated environments.</p>

<p>The second is <em>education with AI</em>, referring to the use of AI tools within teaching and learning processes. In this domain, AI can help address structural challenges such as teacher shortages, the need for differentiated instruction, and unequal access to tutoring. Practical examples include teacher-facing support for generating differentiated materials, student-facing tutoring systems that may mitigate socio-economic disparities, and in-class assistants that scaffold reasoning without displacing human pedagogical judgment.</p>

<p>From this perspective, the objective is not AI adoption per se, but competence-oriented integration, ensuring that students develop AI literacy, critical analysis, and robust domain skills, while teachers receive effective, research-grounded support to manage rising workload and heterogeneity without compromising didactical quality.</p>

<p>While public discussions about AI in education often remain abstract, empirical research highlights several concrete structural challenges. One emerging issue is the need to handle the massive increase of (AI-generated) information, which requires students to learn verification and critical evaluation of information. A second, persistent challenge is rising heterogeneity within classrooms <a href="https://doi.org/10.1080/13670050.2021.1981821">[Siepmann et al. (2023). Attention to diversity in German CLIL classrooms: multi-perspective research on students’ and teachers’ perceptions.International Journal of Bilingual Education and Bilingualism]</a>. Students in schools differ substantially in prior knowledge, linguistic background, cognitive profiles, motivational orientations, and emotional needs, patterns documented widely in international research and in German data from the <a href="https://deutsches-schulportal.de/bildungswesen/iqb-bildungstrend-die-wichtigsten-ergebnisse/#die-wichtigsten-ergebnisse-zum-iqb-bildungstrend-2024">IQB Bildungstrend 2024</a>. Many classrooms include both high-achieving students and learners requiring significant support, including those with reading and spelling difficulties or ADHD, whose prevalence has increased in recent years <a href="https://www.nature.com/articles/d41586-025-03855-2">[Pearson (2025). ADHD diagnoses are growing. What’s going on?. Nature]</a>. Educational theory has long shown that addressing such diversity requires differentiated instruction <a href="https://www.scirp.org/reference/referencespapers?referenceid=2055060">[Tomlinson (2014). The Differentiated Classroom: Responding to the Needs of All Learners. 2nd Edition, ASCD, Alexandria]</a>. This entails providing tasks at varying levels of complexity, offering scaffolded support, giving individualized hints and stepwise explanations, and supplying feedback aligned with learners’ needs. Motivational research further demonstrates that effective instruction must integrate cognitive challenge with emotional support, self-efficacy building, and relevance cues, as cognitive and affective processes are tightly intertwined <a href="https://doi.org/10.1007/s11618-010-0113-z">[Pietsch (2010). Evaluation von Unterrichtsstandards. Zeitschrift für Erziehungswissenschaften]</a>.</p>

<p>Yet teachers face what research describes as an <em>implementation gap</em>: the discrepancy between pedagogical requirements and what is feasible given limited time, class sizes, and workload <a href="https://www.bosch-stiftung.de/de/publikation/deutsches-schulbarometer-lehrkraefte-2025">[Jude (2025). Deutsches Schulbarometer Lehrkräfte 2025]</a>. Although differentiated instruction is theoretically well understood, its practical implementation is difficult. Creating multiple versions of tasks, adjusting scaffolds, and providing targeted feedback for diverse learner profiles is time-intensive, and most available materials still assume an ‘average learner’, a construct increasingly disconnected from classroom reality.</p>

<p>Other countries are already responding to these pressures in structured and systematic ways: In the United States, <a href="https://de.khanacademy.org/"><em>Khan Academy’s Khanmigo</em></a> uses large language models (LLMs) to provide individualized guidance, task variation, and adaptive hints. In China, <a href="https://squirrelai.com/employs"><em>Squirrel AI</em></a> diagnostic engines that map knowledge gaps and generate highly personalized learning paths for learners. Singapore integrates adaptive learning systems directly into its national Student Learning Space under the <a href="https://www.moe.gov.sg/education-in-sg/educational-technology-journey/edtech-masterplan"><em>EdTech Masterplan 2030</em></a>, enabling teachers to deliver levelled tasks and automated feedback aligned with curriculum structures.</p>

<h2 id="our-approach-the-facet-framework">Our approach: The FACET Framework</h2>

<p>Against the backdrop of rising heterogeneity, motivational disparities, and persistent teacher shortages, our work on the FACET aims to contribute an evidence-based component to the debate about AI integration in German schools. <a href="https://arxiv.org/abs/2508.11401">FACET is a research framework</a> designed to systematically examine how AI can support the described need for differentiation under real classroom constraints. Its overarching goal is to help teachers create differentiated teaching materials for diverse learner groups and, at the same time, to generate empirical insights into when AI meaningfully supports teaching and learning, where it falls short, and how it must be designed so that it strengthens rather than undermines skill acquisition.</p>

<p>The FACET is implemented as a multi-agent system with four interconnected layers:</p>

<ol>
  <li>
    <p><strong>Learner agents</strong> simulate student behavior based on profiles that teachers themselves can define and instantiate—reflecting varying prior knowledge, low motivation, reading and writing difficulties, ADHD-related challenges, or other characteristics observed in their classes. These agents attempt tasks, reveal misconceptions, and produce reasoning traces and emotional cues. As tasks, teachers can upload their own materials or rely on tasks predefined by curriculum.</p>
  </li>
  <li>
    <p>The <strong>assessment agent</strong> analyzes how these simulated learners interact with instructional materials—whether uploaded by the teacher or prescribed by the curriculum. It evaluates both their reasoning processes and their affective responses to provide the basis for adapting the materials.</p>
  </li>
  <li>
    <p>The <strong>generator agent</strong> creates differentiated teaching materials based on these diagnostics. This includes levelled tasks aligned with the curriculum’s ‘areas of competence’, scaffolded steps, hints, and motivational feedback tailored to each simulated learner profile. This layer integrates curriculum structures as well as diagnostic and didactical concepts, allowing the system to identify where learners with different profiles are likely to experience cognitive or motivational difficulties.</p>
  </li>
  <li>
    <p>The <strong>evaluator agent</strong> reviews the generated output along dimensions such as didactical coherence, clarity, creativity, and suitability for the specified learners. Teachers can then inspect, adjust, or reject the materials as they see fit. They can also download the finalized materials as Word, PDF, or LaTeX documents.</p>
  </li>
</ol>

<p>The FACET architecture is not intended to replace teachers. Instead, it provides structured starting points for differentiation, aiming to reduce workload while preserving pedagogical control.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/ai-school/Facet_LandingPage.png" alt="FACET 1" style="width:49%;" />
    <img src="http://www.pokutta.com/blog/assets/ai-school/Facet_1.png" alt="FACET 2" style="width:49%;" />
    <p style="font-size: small; font-style: italic;">Figure 1: FACET's landing page and a screenshot of the worksheet generator</p>
</div>

<h2 id="facet-meets-reality-insights-from-practice">FACET Meets Reality: Insights From Practice</h2>

<p>During the ‘KI-Fachtag der Schulen’, parts of our team — Konstantin Fackeldey, Jana Gonnermann-Müller, and Nicolas Leins — conducted two workshops with around 30 school principals to test the FACET under real-world conditions. The discussions provided a clear picture of the pressures schools face and offered feedback that will directly inform the next stages of FACET’s development.</p>

<p>Principals consistently emphasized the urgency of supporting differentiation in increasingly heterogeneous classrooms. Many reported rising numbers of students with reading and spelling difficulties, varying language proficiencies, and large performance gaps within the same class. Teachers report that one way they try to address differences in learning pace is by allowing faster-learning students to move on to new topics while slower-learning students remain with the current one. However, this forced form of differentiation, driven by the lack of time to create differentiated materials for a shared topic, makes working as a unified class group difficult. As a result, the class ends up working on different topics, or with some students become bored, while others still need significantly more time to complete their tasks.</p>

<p>In addition, inclusion schools in particular expressed strong interest, noting that the current staffing conditions make meaningful differentiation nearly impossible. As one principal described: ‘We have so many different children in our schools … we are labeled an inclusion school, yet we only have one teacher for an entire class. We don’t know how we’re supposed to meet all children’s needs.’ The possibility of generating differentiated materials tailored to specific learner profiles resonated strongly. Principals already aknowledged the quality of FACET’s outputs — ‘much more thoughtfully constructed than what we can produce ourselves under time pressure’— while also highlighting important requirements for classroom use. At the same time, they stressed that differentiated materials must be aligned with the curriculum and that teachers need to integrate FACET’s outputs into the broader workflow of lesson planning and classroom management. Importantly, this real-world feedback is crucial for ensuring that FACET evolves in line with the actual needs of teachers. It underscored that any AI-supported tool must be tightly coordinated with curricular structures and flexible enough to fit into existing teaching practices. Principals also contributed new use cases, such as using FACET as an AI-in-class assistant that scaffold material for diverse students, which we had not previously considered. In sum, these insights are invaluable. They allow us to refine FACET not as an abstract technological experiment but as a research framework developed with schools and oriented toward the real demands of everyday teaching. Many principals expressed interest in long-term testing, and we look forward to continuing this collaborative process as FACET evolves.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/ai-school/Fachtag_Talk.jpg" alt="KI Fachtag" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 2: Sebastian Pokutta delivering a keynote at the KI Fachtag on AI in Schools (<a href="https://www.ki-fachtag-schulen.de/2025/de/programm/KI-Fachtag-2025/" target="_blank" rel="noopener">more information</a>).</p>
</div>

<h2 id="the-weizenbaum-debate-on-ai-in-schools">The Weizenbaum Debate on AI in Schools</h2>

<p>Just days earlier, on November 18, our team took part in the <a href="https://www.weizenbaum-institut.de/news/detail/welche-ki-gehoert-ins-klassenzimmer-rueckblick-auf-die-weizenbaum-debate/">4th Weizenbaum Debate</a>, a packed and lively evening at the Quatsch Comedy Club that brought together researchers, teachers, and students to explore what AI in the school of the future should look like. We were invited to join the debate, sharing insights from research on the FACET and discussing how AI can be meaningfully integrated into everyday teaching. Again, we discussed pressing questions, such as when AI genuinely supports learning and when it slips into mere ‘cognitive offloading’, how generative AI must be designed so that it strengthens understanding rather than undermining it, and what competencies students need to use AI in a self-determined, responsible way. Teachers and students challenged long-standing assumptions about exams, resources, and the role of human educators, grounding the debate in lived reality.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/ai-school/WI_Debate.jpeg" alt="Weizenbaum Debate" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 3: Jana Gonnermann-Müller on stage at the Weizenbaum Debate on AI in Schools (<a href="https://www.weizenbaum-institut.de/events/weizenbaum-debate-ki-gehoert-ins-klassenzimmer/" target="_blank" rel="noopener">more information)</a>.</p>
</div>

<h2 id="the-bigger-picture---why-all-of-this-matters">The Bigger Picture - Why All of This Matters</h2>

<p>The underlying core challenge is structural: without scalable support for differentiation, rising learner heterogeneity will continue to outstrip schools’ capacity and deepen educational inequality. Recent assessments already show widening gaps in competencies, motivation, and socio-economic background—pressures intensified by persistent teacher shortages, where <em>education with AI</em> can offer targeted relief. At the same time, students must learn to navigate environments saturated with information, misinformation, and rapidly changing knowledge, underscoring the need for <em>education about AI</em>. Frameworks like FACET cannot solve these systemic issues, but they can offer research solutions that ease critical bottlenecks and free teachers to focus on what cannot be automated: cultivating critical thinking, guiding inquiry, and preparing students to participate responsibly in an AI-shaped society.</p>

<p>Interested in our FACET? Read our paper on FACET at <a href="https://arxiv.org/abs/2508.11401">arxiv.org/abs/2508.11401</a> or reach out to us anytime.
This research by Konstantin Fackeldey, Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, and Sebastian Pokutta is part of our ongoing work of the <a href="https://iol.zib.de/research/iol-human.html/">Humans and AI</a> research thrust, which is part of the IOL Lab at the Zuse Institute Berlin.</p>]]></content><author><name>Jana Gonnermann-Müller</name></author><category term="research" /><category term="ai" /><category term="education" /><summary type="html"><![CDATA[TL;DR: Rising classroom heterogeneity and workload make AI in schools inevitable; with FACET we explore how evidence-based, teacher-centered AI can support meaningful differentiation and AI literacy without replacing human judgment.]]></summary></entry><entry><title type="html">SCIP Optimization Suite 10.0: Exact Solving, Better Decompositions, and a More Productive Ecosystem</title><link href="http://www.pokutta.com/blog/research/2025/12/02/scip-10.html" rel="alternate" type="text/html" title="SCIP Optimization Suite 10.0: Exact Solving, Better Decompositions, and a More Productive Ecosystem" /><published>2025-12-02T00:00:00+01:00</published><updated>2025-12-02T00:00:00+01:00</updated><id>http://www.pokutta.com/blog/research/2025/12/02/scip-10</id><content type="html" xml:base="http://www.pokutta.com/blog/research/2025/12/02/scip-10.html"><![CDATA[<p><em>TL;DR: SCIP Optimization Suite 10.0 brings a numerically exact solving mode for rational MILPs, noticeable performance gains for MILP/MINLP, stronger presolving and symmetry handling, better heuristics and conflict analysis, IIS detection, and major updates to GCG, PaPILO, PySCIPOpt, and MIP-DD.</em></p>

<!--more-->

<p><em>Written by <a href="https://iol.zib.de/team/dominik-kamp.html">Dominik Kamp</a>, <a href="https://gionimexi.com/">Gioni Mexi</a>, and <a href="https://www.pokutta.com">Sebastian Pokutta</a>.</em></p>

<style>
  .callout {
    border-left: 4px solid #2563eb;
    border-right: 4px solid #2563eb;
    background: #f5f7ff;
    padding: 12px 16px;
    margin: 1em 0;
  }
</style>

<h2 id="introduction">Introduction</h2>

<p>The SCIP Optimization Suite 10.0 is now available. The new release updates the entire stack:</p>

<ul>
  <li><strong>SCIP</strong> 10.0 (core solver)</li>
  <li><strong>SoPlex</strong> 8.0 (LP solver)</li>
  <li><strong>PaPILO</strong> 3.0 (presolving library)</li>
  <li><strong>GCG</strong> 4.0 (automatic decomposition solver)</li>
  <li><strong>Zimpl</strong> 3.7 (modeling language)</li>
  <li><strong>UG</strong> 1.0 (parallel framework)</li>
  <li>plus <strong>SCIP-SDP</strong>, <strong>PySCIPOpt</strong>, <strong>MIP-DD</strong>, and the new <strong>PBSolver</strong> application.</li>
</ul>

<p>At a high level:</p>

<ul>
  <li>SCIP 10.0 is <strong>faster and more robust</strong> on both MILPs and MINLPs than 9.x, with the largest gains on harder instances.</li>
  <li>A new <strong>exact solving mode</strong> can solve rational MILPs without numerical tolerances and produce verifiable certificates.</li>
  <li><strong>IIS detection</strong> for MIPs is now integrated directly into SCIP, enabling users to extract irreducible infeasible subsystems for debugging and model analysis.</li>
  <li>Decomposition, presolving, symmetry handling, and conflict analysis all received substantial upgrades.</li>
  <li>The ecosystem around SCIP (GCG, PaPILO, SCIP-SDP, interfaces, and tooling) has matured further.</li>
</ul>

<p>For full technical details see the <a href="https://optimization-online.org/2025/11/the-scip-optimization-suite-10-0/">SCIP 10.0 release report</a>.</p>

<h2 id="faster-and-more-reliable-core-solver">Faster and More Reliable Core Solver</h2>

<h3 id="performance-what-changes-in-practice">Performance: What Changes in Practice?</h3>

<p>The team benchmarked SCIP 10.0 against SCIP 9.0 and 9.2.4 on large MILP/MINLP test sets (MIPLIB, COR@L, MINLPLib).</p>

<ul>
  <li>On MILPs, SCIP 10.0 is <strong>about 4% faster</strong> overall than 9.2.4, with <strong>up to ~10% speed-ups</strong> on harder instances (≥ 100–1000 seconds).</li>
  <li>On MINLPs, the gains are larger: <strong>≈9% faster on average</strong> and <strong>20%+ speed-ups</strong> on the hardest problems, while also solving more instances within the time limit.</li>
</ul>

<p>This may sound incremental, but for real workloads (e.g., nightly planning runs, large-scale research experiments) a 5–20% speed-up with improved robustness delivers tangible, hassle-free gains simply by upgrading.</p>

<h3 id="numerically-exact-solving-mode-for-milps">Numerically Exact Solving Mode for MILPs</h3>

<p>The headline feature of SCIP 10.0 is a <strong>numerically exact solving mode</strong> for rational MILPs:</p>

<ul>
  <li>MILP data (MPS/LP/CIP/OPB/ZIMPL) can be read <strong>in exact rational arithmetic</strong>.</li>
  <li>SCIP maintains floating-point and rational views and uses a hybrid strategy:
    <ul>
      <li>Safe dual bounds and cut generation via directed rounding.</li>
      <li>Exact LP solves via SoPlex or QSopt_ex only when necessary.</li>
    </ul>
  </li>
  <li>You can log a <strong>VIPR certificate</strong> that captures the full branch-and-bound proof. Its correctness can be verified with a C++ proof checker included in the SCIP Optimization Suite VIPR repository, or with a formally verified checker built on CakeML/HOL4 for maximum rigor.</li>
</ul>

<p>Why this matters:</p>

<ul>
  <li>For <strong>safety-critical</strong> or <strong>audited</strong> applications, you can now <em>prove</em> optimality of MILPs with rational data instead of trusting floating-point tolerances.</li>
  <li>For <strong>research</strong>, this is a playground for exact algorithms, certified MIP technology, and proof logging workflows.</li>
</ul>

<p>The trade-off is performance: exact mode is currently roughly <strong>3–4× slower</strong> than a comparable floating-point configuration and ~<strong>7–10×</strong> slower than the default configuration, depending on the test set. The fundamental benefit is that the exact objective bounds are actually guaranteed.</p>

<h2 id="extended-presolving-symmetry-handling-and-cuts">Extended Presolving, Symmetry Handling, and Cuts</h2>

<h3 id="smarter-detection-of-implied-integrality">Smarter Detection of Implied Integrality</h3>

<p>SCIP 10.0 adds a new <strong>Total Unimodularity-based implied integrality detector</strong> that uses network submatrices to infer when some variables can be seen as integers even if they are declared continuous.</p>

<ul>
  <li>On MIPLIB 2017, it detects implied integrality for <strong>~19% of variables</strong> on average, compared to ~3% before.</li>
</ul>

<p>This extra structure can feed into branching, cutting, and propagation, and is particularly relevant for models with network-like structure. It’s not yet enabled by default, but it’s an important building block for future performance gains (and fun for people who like to tinker with advanced settings).</p>

<h3 id="better-symmetry-handling-including-reflections">Better Symmetry Handling (Including Reflections)</h3>

<p>SCIP’s symmetry machinery has taken a major step forward:</p>

<ul>
  <li><strong>Reflection symmetries</strong> (e.g., flipping binary variables 0↔1 or reflecting coordinates) are better detected and exploited.</li>
  <li>Schreier-Sims cuts, orbitopes, double-lex matrices, and small but effective symmetry handling inequalities have all been extended to handle these reflections.</li>
</ul>

<p>For highly structured models (graph coloring, packing, disk packing, many combinatorial designs), this can significantly reduce search by eliminating symmetric parts of the tree.</p>

<h3 id="cut-based-conflict-analysis-and-flower-cuts">Cut-Based Conflict Analysis and Flower Cuts</h3>

<p>Two new pieces in the cut/propagation story:</p>

<ul>
  <li><strong>Cut-based conflict analysis</strong>: instead of analyzing conflicts purely via implication graphs, SCIP 10.0 can operate directly on linear inequalities, which can theoretically provide exponentially stronger reasoning than classical SAT-style conflict analysis, which is based on the resolution proof system and CNF encodings.</li>
  <li><strong>Flower inequalities</strong>: a new separator for products of nonnegative variables (e.g., logical ANDs) generates so-called <em>k-flower</em> inequalities over a multilinear hypergraph representation. The implementation focuses on k=1 or 2 neighboring edges, for efficiency, and is near-neutral when no multilinear structure is present, but can be a noticeable win on affected instances.</li>
</ul>

<p>In practice, both features help SCIP <strong>learn more from infeasibilities and nonlinear structure</strong>, translating into fewer nodes and more solved instances.</p>

<h2 id="heuristics-branching-benders-and-explainability">Heuristics, Branching, Benders, and Explainability</h2>

<p>Several “everyday” parts of the solver got smarter:</p>

<ul>
  <li><strong>New decomposition-aware heuristics</strong>: two primal heuristics (including a kernel search variant) exploit user-provided decompositions to search promising subspaces first. This is particularly useful when you already know the structure of your problem (e.g., time periods, locations, or scenario blocks).</li>
  <li><strong>Improved branching strategies</strong>: reliability pseudocost branching was made safe for exact mode and further tuned for floating-point runs. This mainly shows up as more stable performance on hard instances.</li>
  <li><strong>Enhanced Benders’ decomposition framework</strong>: more flexible ways to define master and subproblems and better automatic detection of linking variables. This lowers the barrier to using Benders’ decomposition in real models.</li>
  <li><strong>Infeasibility explanations (IIS)</strong>: a new tool can compute irreducible infeasible subsystems, giving more interpretable explanations of why a model is infeasible. This is useful for model debugging and communicating with non-optimization stakeholders.</li>
  <li><strong>CONOPT interface</strong>: you can now use CONOPT as the NLP solver. It’s not uniformly faster than Ipopt, but on the hardest MINLPs it yields substantial reductions in runtime and node count, and can solve some instances that stall with Ipopt.</li>
</ul>

<p>Taken together, these features make SCIP 10.0 not just faster, but also <strong>better at telling you what’s going on when things fail</strong>.</p>

<h2 id="ecosystem-updates-gcg-papilo-scip-sdp-soplex-ug-zimpl">Ecosystem Updates: GCG, PaPILO, SCIP-SDP, SoPlex, UG, Zimpl</h2>

<h3 id="gcg-40-decomposition-solver">GCG 4.0: Decomposition Solver</h3>

<p>GCG’s new release focuses on usability and performance:</p>

<ul>
  <li>Harmonized <strong>Apache 2.0</strong> licensing with SCIP.</li>
  <li>New <strong>GCG object</strong> to simplify the C API and unify access to original/master models.</li>
  <li>Easier integration of external pricing solvers (HiGHS, Cliquer-based specialized pricing).</li>
  <li><strong>Parallel pricing</strong> enabled by default (with a parameter to control thread count).</li>
  <li>New <strong>IPColGen</strong> primal matheuristic for set covering/packing/partitioning master problems.</li>
  <li>Decomposition scores refactored into plugins for easier experimentation.</li>
</ul>

<p>In short: if you’re doing branch-cut-and-price/Dantzig-Wolfe, GCG 4.0 is a more pleasant and capable environment.</p>

<h3 id="papilo-30-faster-presolving-with-less-memory">PaPILO 3.0: Faster Presolving with Less Memory</h3>

<p>PaPILO 3.0 brings:</p>

<ul>
  <li>Significant improvements in performance and memory usage for the <strong>dominated columns</strong> presolver.</li>
  <li>A new <strong>parallel clique-merging</strong> presolver that extends and cleans up clique structures more efficiently.</li>
</ul>

<p>These low-level improvements are easy to overlook, but they directly affect large instances where presolving used to dominate runtime or exceed memory.</p>

<h3 id="scip-sdp-soplex-80-ug-and-zimpl">SCIP-SDP, SoPlex 8.0, UG, and Zimpl</h3>

<ul>
  <li><strong>SCIP-SDP 4.4.0</strong> updates to SCIP 10.0 and adds the ability to export original and transformed MISDPs in <strong>CBF</strong> format for use with the Conic Benchmark Library.</li>
  <li><strong>SoPlex 8.0</strong> is a major version bump reflecting build system and API changes; it remains the default LP solver and is central for exact LP solving in the new exact mode.</li>
  <li><strong>UG framework</strong> now includes an application for parallel Pseudo-Boolean solving with FiberSCIP, featuring 2024 PB competition–tuned settings and optional DIMACS-style log output.</li>
  <li><strong>Zimpl 3.7.0</strong> adds support for permutations (<code class="language-plaintext highlighter-rouge">permutate(A)</code>) and better handling of implied integral variables, which are now recognized by SCIP.</li>
</ul>

<h2 id="interfaces-developer-tooling-and-pbsolver">Interfaces, Developer Tooling, and PBSolver</h2>

<h3 id="interfaces-pyscipopt-rust-and-more">Interfaces: PySCIPOpt, Rust, and More</h3>

<p>Interface highlights:</p>

<ul>
  <li><strong>PySCIPOpt</strong> now supports <strong>matrix variables</strong> based on <code class="language-plaintext highlighter-rouge">numpy.ndarray</code>, making it much more natural to write matrix-centric models (think: SDP-like structures, network flows, control). The documentation and tutorials have also been significantly improved, including exercises and “recipes” for common modeling tasks.</li>
  <li>The <strong>Rust interface (russcip)</strong> has a more ergonomic and type-safe API with builder-style variable/constraint creation and safe access to separators and constraint handlers.</li>
  <li>Other interfaces (Matlab, AMPL, SCIPpp C++, PySoPlex, PyGCGOpt, PaPILO’s Julia interface) were updated and documented.</li>
</ul>

<p>From a user perspective, this makes it easy to access the full solver functionality in your favorite language.</p>

<h3 id="mip-dd-20-delta-debugging-for-mip-solvers">MIP-DD 2.0: Delta Debugging for MIP Solvers</h3>

<p><strong>MIP-DD</strong> is the first open-source, solver-independent delta debugger for MIP solvers.</p>

<ul>
  <li>It automatically shrinks a failing instance while preserving the bug, often down to models with just a handful of variables and constraints.</li>
  <li>Version 2.0 adapts modification batch sizes automatically, limits solving effort intelligently, and supports both real and exact solving modes of SCIP and SoPlex.</li>
</ul>

<p>If you develop solvers or serious extension plugins, MIP-DD is a powerful way to turn pathological “customer instances” into minimal test cases, while also removing sensitive information.</p>

<h3 id="pbsolver-dedicated-pseudo-boolean-application">PBSolver: Dedicated Pseudo-Boolean Application</h3>

<p>SCIP 10.0 introduces <strong>PBSolver</strong>, a SCIP-based application tailored to Pseudo-Boolean optimization and the Pseudo-Boolean Competition format.</p>

<ul>
  <li>Emits competition-compliant DIMACS-style logs and solution lines.</li>
  <li>Handles OPB/WBO instances out of the box, with parameters to control input limits.</li>
  <li>SCIP/FiberSCIP-based solvers using these features won several categories of the 2024 PB competition.</li>
</ul>

<p>If you work with Pseudo-Boolean benchmarks or SAT+PB hybrids, PBSolver gives you a supported, competition-ready entry point.</p>

<h2 id="availability-and-getting-started">Availability and Getting Started</h2>

<h3 id="core-distribution">Core Distribution</h3>

<p>The SCIP Optimization Suite 10.0 (SCIP, SoPlex, PaPILO, GCG, Zimpl, UG, SCIP-SDP, interfaces, and applications) is available as usual via the project website and GitHub repositories. Licensing is now more unified:</p>

<ul>
  <li><strong>SCIP 10.0, SoPlex 8.0, PaPILO 3.0, GCG 4.0</strong> under <strong>Apache 2.0</strong>.</li>
  <li><strong>Zimpl 3.7.0 and UG 1.0</strong> under <strong>LGPL</strong>.</li>
</ul>

<p>You can download the latest release from the <a href="https://www.scipopt.org/">SCIP Optimization Suite website</a> or from our <a href="https://github.com/scipopt">GitHub repositories</a>.</p>

<h3 id="new-docker-images-web-service-and-jupyter-lab">New Docker Images: Web Service and Jupyter Lab</h3>

<p>To make the suite easier to adopt in teaching, prototyping, and production, we provide two official Docker images (described in detail on the dedicated Docker page):</p>

<ul>
  <li>
    <p><strong><code class="language-plaintext highlighter-rouge">scip-webservice</code></strong>: a FastAPI-based web service for solving problem instances in common formats (e.g., .lp, .mps, .cip, .cnf, .fzn, .nl, .opb, .osil, .pip, .wbo, .zpl) via HTTP.</p>

    <ul>
      <li>Single-command startup, with endpoints for status, uploads, and auto-generated API docs.</li>
      <li>Environment variables to control concurrency, retention, and upload limits; Docker flags for CPU/memory caps.</li>
      <li>Designed so you can go from “no SCIP installed” to a working REST API in a minute.</li>
    </ul>
  </li>
  <li>
    <p><strong><code class="language-plaintext highlighter-rouge">scip-jupyterlab</code></strong>: a pre-configured Jupyter Lab environment with</p>

    <ul>
      <li>SCIP 10.0, PySCIPOpt 6.0, and Python 3.11,</li>
      <li>plus a standard data-science stack (NumPy, Pandas, Matplotlib, scikit-learn, etc.).</li>
      <li>Mount a local directory into <code class="language-plaintext highlighter-rouge">/app</code> and you have a ready-to-use notebook environment for teaching, demoing, or exploratory modeling.</li>
    </ul>
  </li>
</ul>

<p>Think of these as the modern replacement for the older “Dockerized SCIP for Teaching” setup: less friction, more batteries included.</p>

<p>(For details, examples, and recommended resource settings, see the <a href="/blog/pages/scip/scip-teaching-webservice.html">SCIP Optimization Suite 10 Docker page</a>)</p>

<h3 id="who-should-upgrade">Who Should Upgrade?</h3>

<ul>
  <li><strong>Practitioners</strong> get a faster, more reliable solver with better decomposition and PB capabilities, plus easier deployment via Docker.</li>
  <li><strong>Researchers</strong> get exact solving, proof logging, extended symmetry and presolving, and richer interfaces.</li>
  <li><strong>Teachers</strong> can run everything from a browser via the Jupyter Lab image, avoiding per-student installation pain.</li>
</ul>

<p>If you experiment with the new exact mode, MIP-DD, or the Docker images and find interesting use cases (or rough edges), feedback is very welcome on <a href="https://github.com/scipopt">GitHub</a> or <a href="gitlabgit+integer-scipoptsuite-support-3311-issue-@zib.de">via email</a>; this release is meant as both a robust workhorse and a platform for the next wave of MIP/MINLP research.</p>]]></content><author><name>Dominik Kamp, Gioni Mexi, Sebastian Pokutta</name></author><category term="research" /><category term="optimization" /><category term="scip" /><category term="mip" /><category term="minlp" /><summary type="html"><![CDATA[TL;DR: SCIP Optimization Suite 10.0 brings a numerically exact solving mode for rational MILPs, noticeable performance gains for MILP/MINLP, stronger presolving and symmetry handling, better heuristics and conflict analysis, IIS detection, and major updates to GCG, PaPILO, PySCIPOpt, and MIP-DD.]]></summary></entry><entry><title type="html">2025 Nobel Prize in Economics: Innovation, Creative Destruction, and Sustainable Growth — and What It Means for Germany</title><link href="http://www.pokutta.com/blog/random/2025/10/14/economics-nobel-and-germany.html" rel="alternate" type="text/html" title="2025 Nobel Prize in Economics: Innovation, Creative Destruction, and Sustainable Growth — and What It Means for Germany" /><published>2025-10-14T01:00:00+02:00</published><updated>2025-10-14T01:00:00+02:00</updated><id>http://www.pokutta.com/blog/random/2025/10/14/economics-nobel-and-germany</id><content type="html" xml:base="http://www.pokutta.com/blog/random/2025/10/14/economics-nobel-and-germany.html"><![CDATA[<p><em>TL;DR: The 2025 Nobel Prize in Economics honors Joel Mokyr, Philippe Aghion, and Peter Howitt for explaining how innovation drives sustained growth. Mokyr identifies the historical preconditions that allow innovation to accumulate; Aghion and Howitt formalize how creative destruction underpins modern growth. Together, their work clarifies why growth cannot be taken for granted—and what kinds of policies Germany now needs to secure its economic future.</em></p>

<!--more-->

<style>
  .callout {
    border-left: 4px solid #2563eb;
    border-right: 4px solid #2563eb;
    background: #f5f7ff;
    padding: 12px 16px;
    margin: 1em 0;
  }
</style>

<h2 id="introduction">Introduction</h2>

<p>The 2025 Nobel Prize in Economic Sciences was awarded jointly to <strong>Joel Mokyr</strong>, <strong>Philippe Aghion</strong>, and <strong>Peter Howitt</strong> for their fundamental work on innovation-driven growth.</p>

<p>Mokyr’s historical analysis explains why sustained growth emerged only under specific institutional and cultural conditions, while Aghion and Howitt’s theoretical model shows how innovation endogenously fuels growth through the process of creative destruction.</p>

<p>Together, their work provides a unified narrative: innovation is the engine of prosperity, but its power depends critically on the societal and institutional environment that allows new ideas to flourish and old ones to fade.</p>

<h3 id="recent-news-that-complement-this-narrative">Recent News that complement this narrative</h3>

<p>Some recent news that appeared after this post was written but are relevant to the discussion:</p>

<ul>
  <li>The Economist: <a href="https://www.economist.com/europe/2025/10/02/how-europe-crushes-innovation">How Europe crushes innovation</a></li>
  <li>A European attempt to catch up: <a href="https://eurollm.io/">EuroLLM</a></li>
</ul>

<h2 id="mokyr-the-preconditions-for-sustained-growth">Mokyr: The Preconditions for Sustained Growth</h2>

<p>Joel Mokyr’s research asks a deceptively simple question: <em>Why did the Industrial Revolution happen when and where it did?</em> His answer lies not in the invention of particular machines, but in the emergence of an environment that made continuous innovation possible.</p>

<p>For much of human history, inventions appeared sporadically but failed to trigger self-sustaining growth. Mokyr identifies three essential conditions that finally broke that pattern.</p>

<p><strong>1. Science and Technology Co-Evolving</strong>. Growth accelerated when scientific understanding and practical engineering began to reinforce one another. The interplay between theory and application—seen in mechanics, thermodynamics, and chemistry—created a virtuous cycle of cumulative improvement.</p>

<p><strong>2. Engineering Capability and Mechanic Competence</strong>. Scientific knowledge alone was not enough. Societies also needed the ability to implement ideas at scale: skilled artisans, standardized components, effective apprenticeship systems, and eventually formal engineering education. This infrastructure enabled the transformation of insight into industrial capability.</p>

<p><strong>3. Openness to Disruption</strong>. Perhaps the most fragile condition was social and institutional openness to change. Continuous innovation means continuous displacement—of technologies, firms, and sometimes entire industries. Where vested interests could block such shifts, growth remained episodic.</p>

<h4 id="implications">Implications</h4>

<p>Mokyr’s message is timeless: <strong>innovation without diffusion, engineering capacity, or openness cannot sustain prosperity</strong>. Even advanced economies risk stagnation if they lose these enablers—whether through regulatory rigidity, skill erosion, or entrenched incumbency.</p>

<h2 id="aghion-howitt-creative-destruction">Aghion-Howitt: Creative Destruction</h2>

<p>Aghion and Howitt revolutionized growth theory by bringing innovation <em>inside</em> the model. Instead of treating technological progress as an external force, they described it as a product of <strong>firms’ strategic choices</strong>—to invest, to compete, and to risk being replaced.</p>

<p>Their framework captures the Schumpeterian idea of <em>creative destruction</em>: progress occurs because new technologies displace old ones. This destruction is not a side effect: it is the very mechanism that keeps growth alive.</p>

<h4 id="the-core-mechanisms">The Core Mechanisms</h4>

<p><strong>1. Endogenous Innovation:</strong> Firms innovate when they expect profits from doing so. R&amp;D intensity depends on incentives, policy, and market structure.<br />
<strong>2. Competition and the Inverted-U:</strong> Innovation is most vibrant under moderate competition—too little protects monopolists; too much erodes potential gains.<br />
<strong>3. Entry, Exit, and Reallocation:</strong> Dynamic economies rely on turnover. New entrants challenge incumbents, and resources shift toward more productive uses.</p>

<p>Through this lens, growth becomes a process of continuous experimentation, selection, and renewal.</p>

<h4 id="policy-lessons">Policy Lessons</h4>

<p>Aghion and Howitt’s theory highlights the interdependence of <strong>R&amp;D policy</strong>, <strong>competition policy</strong>, and <strong>education/labor systems</strong>. Each shapes the incentives that determine whether an economy encourages the next generation of innovators; or protects the last.</p>

<h2 id="innovation-growth-and-sustainability">Innovation, Growth, and Sustainability</h2>

<p>Innovation is the only long-run source of growth. But innovation alone does not guarantee it. The engine must run within a system that supports both discovery and renewal. In Mokyr’s terms, science, technology, and social openness must align. In Aghion–Howitt’s framework, competition and institutional design must sustain incentives for continuous creative destruction. Sustainable prosperity, then, is less about preserving existing strengths than about <strong>preserving the capacity for renewal</strong>—the ability to replace outdated ideas, technologies, and firms with better ones.</p>

<h2 id="what-this-says-about-innovation-in-germany">What this says about innovation in Germany</h2>

<p>Even before the 2025 Nobel announcement, leading growth economists such as Philippe Aghion had warned that Europe, and Germany in particular, no longer fully satisfies the institutional and cultural conditions for innovation-driven growth. Aghion, Dewatripont, and Tirole argue that since the 1990s, Europe has failed to establish the environment required for disruptive innovation: a fragmented single market, underdeveloped venture-finance and risk-capital systems, and a policy culture favoring stability over experimentation. As a result, Europe risks becoming trapped in what they call a “middle-technology equilibrium”, strong in incremental improvements but weak in frontier innovation and technological leadership. These critiques directly echo the Nobel laureates’ insights: innovation must operate within an institutional system that tolerates creative destruction and rewards renewal rather than preservation.</p>

<div class="callout">
  <strong>The German Situation in a Nutshell: Innovation, Renewal, and Sustainable Growth</strong><br /><br />
  Germany’s economic strength has long rested on engineering excellence, industrial depth, and institutional reliability. These pillars have produced remarkable prosperity through incremental innovation and quality leadership. Yet, as the recent Nobel Prize in Economics underscores, sustained growth in the twenty-first century depends increasingly on <em>creative destruction</em> — the continuous renewal of technologies, firms, and ideas through innovation. <br /> <br />

  To maintain global competitiveness and resilience, Germany must complement its traditional stability-oriented model with greater openness to disruption, faster diffusion of frontier knowledge, and stronger incentives for entrepreneurial risk-taking. This is not only an economic imperative but a demographic one: with an aging population and a steadily declining workforce, productivity growth is essential to counteract demographic headwinds. Since the end of the 2000s, Germany’s GDP per capita has remained largely flat, signaling that without renewed innovation dynamics, overall prosperity will not only stagnate but sharply decline. <br /> <br />

  As such a more agile regulation, a dynamic venture landscape, and a culture that sees change not as a threat but as the very source of sustainable prosperity is needed. Strengthening the interface between research institutions and the broader innovation ecosystem can play a decisive role in this transition.
</div>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/innovationG/NGDP-capita.png" alt="Commitment" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 1: Nominal GDP per capita (via <a href="https://datacommons.org/tools/visualization#visType%3Dtimeline%26place%3Dcountry%2FDEU___country%2FUSA___country%2FSGP___country%2FJPN___country%2FFRA___country%2FESP%26placeType%3DAdministrativeArea1%26sv%3D%7B%22dcid%22%3A%22Amount_EconomicActivity_GrossDomesticProduction_Nominal%22%2C%22pc%22%3A%221%22%7D" target="_blank" rel="noopener">datacommons.org</a>).</p>
</div>

<h3 id="the-historical-layer-from-stability-to-dynamism">The Historical Layer: From Stability to Dynamism</h3>

<p>Postwar Germany built its prosperity on <strong>incremental innovation and stability</strong>. The country perfected the art of refining existing technologies rather than replacing them. This model—anchored in engineering excellence, vocational training, and social partnership—created high productivity and resilience but limited tolerance for disruption.</p>

<p>In Mokyr’s terms, Germany excels in <em>mechanical competence</em> but underinvests in <em>openness to disruptive change</em>. Institutions, regulations, and social norms that once guaranteed continuity now risk locking the system into the past.</p>

<blockquote>
  <p><strong>Key Tension:</strong> Germany’s model maximized stability and diffusion, not frontier innovation. The very strengths that drove its success is now slowing down innovation and adaptation.</p>
</blockquote>

<h3 id="the-structural-layer-inertia-vs-innovation-dynamics">The Structural Layer: Inertia vs. Innovation Dynamics</h3>

<p>Aghion &amp; Howitt’s growth framework emphasizes continuous entry of new innovators and exit of outdated firms. Germany’s industrial ecosystem, however, tends to favor incumbents and gradualism.</p>

<table>
  <thead>
    <tr>
      <th><strong>Mechanism (Aghion–Howitt)</strong></th>
      <th><strong>Tendency in Germany</strong></th>
      <th><strong>Implication</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Entry of new, innovative firms</td>
      <td>Low startup rate, complex regulation, conservative finance</td>
      <td>Slower technological frontier shift</td>
    </tr>
    <tr>
      <td>Exit of outdated firms</td>
      <td>Social and political resistance (e.g. subsidies, bailouts)</td>
      <td>Delayed reallocation of resources</td>
    </tr>
    <tr>
      <td>Competition intensity</td>
      <td>Often moderate-to-low in core sectors (automotive, energy, finance)</td>
      <td>Weaker innovation pressure</td>
    </tr>
    <tr>
      <td>R&amp;D structure</td>
      <td>Strong in applied engineering, weaker in digital and AI</td>
      <td>Imbalance in innovation types</td>
    </tr>
    <tr>
      <td>Labor mobility</td>
      <td>Low due to firm loyalty and vocational specialization</td>
      <td>Slow diffusion of new ideas</td>
    </tr>
  </tbody>
</table>

<p>This structure is <strong>highly optimized for incremental improvement</strong>: excellent at perfecting combustion engines or machine tools, but less agile when the frontier shifts toward AI, software, and renewable systems.</p>

<blockquote>
  <p><strong>In short:</strong> Germany is a world leader in innovation <em>within</em> existing paradigms, but struggles to innovate <em>across</em> them.</p>
</blockquote>

<h3 id="the-policy-layer-institutions-and-incentives">The Policy Layer: Institutions and Incentives</h3>

<p>The policy environment reflects the same equilibrium: one optimized for stability and incremental improvement rather than experimentation and renewal.</p>

<ul>
  <li><strong>Finance:</strong> Germany’s bank-centered system channels capital primarily to established firms with tangible assets and proven track records. Venture and growth finance remain limited in volume and depth compared to the U.S. or China, constraining the scale-up of young innovative firms. Public financing instruments, though substantial, often emphasize risk minimization and compliance over agility and experimentation.</li>
  <li><strong>Regulation:</strong> Complex approval procedures, fragmented jurisdictions, and slow digital administration increase transaction costs and delay the deployment of new technologies. Even pilot projects or regulatory sandboxes face multi-year approval timelines, reducing the incentive for entrepreneurial experimentation.</li>
  <li><strong>Public R&amp;D:</strong> Institutional research networks such as Fraunhofer, Helmholtz, and Max Planck as well as excellent universities deliver outstanding science and applied engineering, but translation into scalable high-growth enterprises remains the weak link. Structural barriers between academia and entrepreneurship, ranging from intellectual-property rules to career incentives, limit the spillover of frontier research into the private sector.</li>
  <li><strong>Competition Policy:</strong> Designed to safeguard fairness and prevent monopolies, it sometimes has the side effect of preserving incumbents. High regulatory thresholds and sectoral protection dilute entry pressure and reduce the dynamism of domestic markets, particularly in energy, finance, and telecommunications.</li>
</ul>

<p>These features supported stability, quality, and long-term employment for decades. Yet today they <strong>constrain</strong> <strong>dynamism</strong>, <strong>speed</strong>, and <strong>adaptability</strong>—precisely the qualities that define success in innovation-driven economies. In Aghion’s terms, the current policy mix <strong>underweights entry dynamism</strong>: the continual churning of firms that fuels creative destruction and renewal. Sustained growth, however, requires <strong>institutional agility</strong>: policies that reward experimentation, accept short-term disruption, and tolerate failure as part of the innovation process.</p>

<h3 id="why-it-becomes-a-problem">Why It Becomes a Problem</h3>

<p>The laureates’ work helps explain why Germany’s model now faces mounting structural headwinds. What once guaranteed stability and prosperity now risks impeding adaptation to a faster, more fluid global innovation landscape.</p>

<ul>
  <li><strong>Acceleration of technological cycles:</strong> Innovation now unfolds at software speed, while Germany’s institutions and industrial processes still move at mechanical speed. The product cycles of AI, digital platforms, and biotech evolve in months, not decades, demanding a pace of response the traditional system struggles to match.</li>
  <li><strong>Shift from physical to digital:</strong> Value creation increasingly depends on algorithms, data, and networks rather than production volume or material precision. Germany’s comparative advantage in manufacturing excellence thus erodes unless accompanied by digital capability and data-driven innovation.</li>
  <li><strong>Global competition for innovation:</strong> Ecosystems in the United States, China, and South Korea combine capital depth, entrepreneurial risk appetite, and scale in ways Europe and Germany have not replicated. These ecosystems attract top global talent and absorb frontier innovations faster, creating a widening gap in technological leadership.</li>
  <li><strong>Over-embedded incumbency:</strong> Political, social, and financial systems continue to shield existing industries—automotive, energy, finance—from disruption. This “incumbent bias” prevents reallocation of capital and talent to emerging sectors and delays renewal.</li>
  <li><strong>Hesitance to adopt external innovation:</strong> Beyond domestic inertia, Germany also tends to underutilize external technological breakthroughs. Imported digital and AI tools are often viewed through a compliance or risk lens rather than as catalysts for reinvention. This reluctance to integrate global frontier technologies into local production chains further widens the innovation gap.</li>
</ul>

<p>The result is a growing mismatch between the <strong>velocity of global technological change</strong> and the <strong>institutional response capacity</strong> of Germany’s economic model. Without structural renewal, the innovation–growth link weakens, turning <em>creative destruction</em> into mere <em>creative delay</em>.</p>

<p>In Aghion’s terms, Germany risks falling into a <strong>middle-technology equilibrium</strong>: a state of high competence in established industries coexisting with stagnation at the technological frontier. In the face of demographic decline and a shrinking labor force, productivity growth through innovation is no longer optional; it is the only path to sustaining prosperity.</p>

<p>Institutional agility, policy experimentation, and a proactive innovation agenda are therefore not economic luxuries but existential necessities. Or put it in the famous words attributed to W. Edwards Deming:</p>

<blockquote>
  <p>“It is not necessary to change. Survival is not mandatory.” — <a href="https://en.wikipedia.org/wiki/W._Edwards_Deming">W. Edwards Deming</a></p>
</blockquote>

<h3 id="what-compatibility-would-require">What Compatibility Would Require</h3>

<p>To realign with an innovation-driven growth path, Germany must evolve from a <strong>stability model</strong> to a <strong>renewal model</strong>. One that sustains its strengths while restoring openness, experimentation, and competition.</p>

<ol>
  <li><strong>Cultural Shift:</strong> Move from risk avoidance to curiosity and experimentation.</li>
  <li><strong>Dynamic Competition:</strong> Strengthen antitrust tools, encourage market entry, and curb excessive concentration.</li>
  <li><strong>Faster Diffusion:</strong> Create translational institutions between academia and startups; reduce administrative friction.</li>
  <li><strong>Deep Capital Markets:</strong> Expand venture and scale-up finance; modernize taxation of employee equity.</li>
  <li><strong>Talent and Skills Renewal:</strong> Extend the dual education model to digital, AI, and systems engineering; simplify international recruitment.</li>
  <li><strong>State as Catalyst:</strong> Use public procurement and infrastructure investment to create domestic lead markets for innovative solutions.</li>
  <li><strong>Institutional Agility:</strong> Embed experimentation into policy, via sandboxes, adaptive regulation, and outcome-based funding.</li>
</ol>

<blockquote>
  <p>Sustainable prosperity is no longer about preserving existing strengths, but about <strong>preserving the capacity for transformation</strong>.</p>
</blockquote>

<h2 id="afterword">Afterword</h2>

<p>Several years ago, in February 2020, I was flying back to Germany from Atlanta and my flight was delayed and it became clear that the pilot won’t be able to make up the time. So I logged into the onboard wifi, opened my (then still) Twitter app and DMed the @Delta twitter account as I had done numerous times before. No 10 minutes later I had my connecting flight rebooked; even though the connection was on a non-SkyTeam carrier. No hassle, no waiting in line, no waiting in line again. It. Just. Worked.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/innovationG/deltaInteraction.gif" alt="Commitment" style="width:99%;" />
    <p style="font-size: small; font-style: italic;">Figure 2: Innovation in action: Delta flight rebooking midair in 8 minutes.</p>
</div>

<h2 id="references">References</h2>

<ul>
  <li>Nobel Prize in Economic Sciences 2025 — Press release <a href="https://www.nobelprize.org/prizes/economic-sciences/2025/press-release/">https://www.nobelprize.org/prizes/economic-sciences/2025/press-release/</a></li>
  <li>Nobel Prize in Economic Sciences 2025 — Popular information <a href="https://www.nobelprize.org/prizes/economic-sciences/2025/popular-information/">https://www.nobelprize.org/prizes/economic-sciences/2025/popular-information/</a></li>
  <li>Reuters — “Trio win Nobel economics prize for work on innovation, growth and creative destruction” <a href="https://www.reuters.com/world/mokyr-aghion-howitt-win-2025-nobel-economics-prize-2025-10-13/">https://www.reuters.com/world/mokyr-aghion-howitt-win-2025-nobel-economics-prize-2025-10-13/</a></li>
  <li>The Guardian — “Nobel economics prize: technology-driven growth (Mokyr, Aghion, Howitt)” <a href="https://www.theguardian.com/business/2025/oct/13/nobel-economics-prize-technology-joel-mokyr-philippe-aghion-peter-howitt">https://www.theguardian.com/business/2025/oct/13/nobel-economics-prize-technology-joel-mokyr-philippe-aghion-peter-howitt</a></li>
  <li>Financial Times coverage <a href="https://www.ft.com/content/9b845160-f44b-4865-94f7-1bf21e25596e">https://www.ft.com/content/9b845160-f44b-4865-94f7-1bf21e25596e</a></li>
  <li>Le Monde (English) — Philippe Aghion: “The key factor of economic power is technological leadership” <a href="https://www.lemonde.fr/en/international/article/2025/10/13/2025-nobel-winner-philippe-aghion-the-key-factor-of-economic-power-is-technological-leadership_6746390_4.html">https://www.lemonde.fr/en/international/article/2025/10/13/2025-nobel-winner-philippe-aghion-the-key-factor-of-economic-power-is-technological-leadership_6746390_4.html</a></li>
  <li>Philippe Aghion, Mathias Dewatripont &amp; Jean Tirole (2024). <em>“Can Europe Create an Innovation Economy?”</em> Project Syndicate, October 2024. <a href="https://www.project-syndicate.org/commentary/europe-falling-behind-us-innovation-technology-what-to-do-about-it-by-philippe-aghion-et-al-2024-10">https://www.project-syndicate.org/commentary/europe-falling-behind-us-innovation-technology-what-to-do-about-it-by-philippe-aghion-et-al-2024-10</a></li>
  <li>VoxEU/CEPR (2024). <em>“Reforming Innovation Policy to Help the EU Escape the Middle-Technology Trap.”</em> April 2024. <a href="https://cepr.org/voxeu/columns/reforming-innovation-policy-help-eu-escape-middle-technology-trap">https://cepr.org/voxeu/columns/reforming-innovation-policy-help-eu-escape-middle-technology-trap</a></li>
  <li>Financial Times (2025). <em>“Economics Nobel Prize Awarded for Explaining Innovation-Driven Growth.”</em> October 2025. <a href="https://www.ft.com/content/9b845160-f44b-4865-94f7-1bf21e25596e">https://www.ft.com/content/9b845160-f44b-4865-94f7-1bf21e25596e</a></li>
</ul>]]></content><author><name>Sebastian Pokutta</name></author><category term="random" /><category term="economics" /><category term="innovation" /><category term="growth" /><summary type="html"><![CDATA[TL;DR: The 2025 Nobel Prize in Economics honors Joel Mokyr, Philippe Aghion, and Peter Howitt for explaining how innovation drives sustained growth. Mokyr identifies the historical preconditions that allow innovation to accumulate; Aghion and Howitt formalize how creative destruction underpins modern growth. Together, their work clarifies why growth cannot be taken for granted—and what kinds of policies Germany now needs to secure its economic future.]]></summary></entry><entry><title type="html">Committing to Secrets via Hashing</title><link href="http://www.pokutta.com/blog/hashed-commitments/" rel="alternate" type="text/html" title="Committing to Secrets via Hashing" /><published>2025-09-28T01:00:00+02:00</published><updated>2025-09-28T01:00:00+02:00</updated><id>http://www.pokutta.com/blog/hashed-commitments</id><content type="html" xml:base="http://www.pokutta.com/blog/hashed-commitments/"><![CDATA[<p><em>TL;DR: Cryptography is often thought of in terms of encryption (hiding messages) or signatures  (proving authenticity). But there’s another fundamental building block that underpins many protocols, from digital lotteries to zero-knowledge proofs: the commitment. A commitment is a way to promise something now without revealing it, yet in such a way that you cannot change your mind later. It can be efficiently implemented via hashing.</em></p>

<!--more-->

<p>It is the digital equivalent of sealing a message in an envelope and handing it to someone: they cannot see the contents until you open it, but once it is sealed, you cannot replace what is inside.</p>

<h2 id="the-problem-promises-without-revealing">The Problem: Promises Without Revealing</h2>

<p>There are many situations where the “sealing away” of a message might be useful. For example, imagine these situations:</p>

<ol>
  <li>You want to prove later that you already knew the solution to a problem, but you do not want to reveal it today.</li>
  <li>You want to place a bet on the outcome of an event but do not want your opponent to know which side you picked in advance.</li>
  <li>You are running a digital lottery and need to show that the winning number was (random but) fixed before the draw and not manipulated after the draw.</li>
  <li>Or more nerdy, you want to convince someone you found a valid graph coloring without revealing the coloring itself (a classic zero-knowledge proof setting).</li>
</ol>

<p>All of these require a way to commit to a secret so that two key properties are satisfied:</p>

<ul>
  <li><em>Hiding.</em> Nobody learns the secret until you reveal it.</li>
  <li><em>Binding.</em> Once you have committed, you cannot change your mind.</li>
</ul>

<p>That is precisely what <em>commitment schemes</em> do.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/hashed_commitments/hashed_message.png" alt="Commitment" style="width:85%;" />
    <p style="font-size: small; font-style: italic;">Figure 1: A completely useless AI generated image of a hashed commitment.</p>
</div>

<h2 id="a-quick-primer-on-hash-functions">A Quick Primer on Hash Functions</h2>

<p>The simplest commitment schemes rely on cryptographic hash functions such as SHA-256. A hash function <code class="language-plaintext highlighter-rouge">H</code> takes an input of arbitrary length and produces a fixed-size digest. Good cryptographic hashes satisfy several key properties:</p>

<ul>
  <li><em>Deterministic.</em> The same input always produces the same output.</li>
  <li><em>Preimage resistant.</em> Given a hash <code class="language-plaintext highlighter-rouge">h</code>, it is infeasible to find any input <code class="language-plaintext highlighter-rouge">m</code> such that <code class="language-plaintext highlighter-rouge">H(m)=h</code>.</li>
  <li><em>Second preimage resistant.</em> Given an input <code class="language-plaintext highlighter-rouge">m</code>, it is infeasible to find a different <code class="language-plaintext highlighter-rouge">m'</code> with the same hash.</li>
  <li><em>Collision resistant.</em> It is infeasible to find any two distinct inputs <code class="language-plaintext highlighter-rouge">m_1 != m_2</code> with <code class="language-plaintext highlighter-rouge">H(m_1)=H(m_2)</code>.</li>
</ul>

<p>Hashes are one-way functions: you can go from message to hash easily, but not back. This makes them perfect for commitments. Here “infeasible” is meant in the sense of computational complexity theory, i.e., it takes a lot of computational resources and time. While hash functions are designed to be collision-resistant they might still have collisions simply because we digest arbitrary-length messages into fixed-size digests. This means in particular it is infeasible (in the real sense) to invert a hash function simply because the preimage is not unique; we may find one preimage (if we are extremely lucky) but there is no guarantee that this is the one that was used to compute the hash.</p>

<h2 id="secret-commitments-the-basic-idea">Secret Commitments: The Basic Idea</h2>

<p>The simplest commitment is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C = H(message)
</code></pre></div></div>

<p>Later, you reveal <code class="language-plaintext highlighter-rouge">message</code>, and anyone can verify that <code class="language-plaintext highlighter-rouge">H(message) == C</code>.</p>

<p>However, this naive approach leaks some information. If <code class="language-plaintext highlighter-rouge">message</code> comes from a small set (like <code class="language-plaintext highlighter-rouge">"YES"</code> or <code class="language-plaintext highlighter-rouge">"NO"</code>), anyone can brute-force all possibilities and see which one matches <code class="language-plaintext highlighter-rouge">C</code>. That breaks the hiding property.</p>

<h3 id="hashing-with-salt">Hashing with Salt</h3>

<p>To prevent this, we add randomness. The usual fix is to prepend a random salt to the message before hashing:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C = H(salt || message)
</code></pre></div></div>

<ul>
  <li><code class="language-plaintext highlighter-rouge">salt</code> is a random value (e.g., 128 bits) that makes every commitment (assuming no collision) unique, even if the message is common.</li>
  <li>Without knowing <code class="language-plaintext highlighter-rouge">salt</code>, an attacker cannot precompute hash tables or brute-force easily.</li>
</ul>

<p>Later, you reveal <code class="language-plaintext highlighter-rouge">(salt, message)</code>. Anyone can check that <code class="language-plaintext highlighter-rouge">H(salt || message) == C</code>. For example here is a hash from a <a href="https://www.pokutta.com/hashes/">hash tool</a> that uses SHA-256 with a salt and an additional domain separation string to hash effectively <code class="language-plaintext highlighter-rouge">H(domain || salt || message)</code>; see also the appendix below for a lightweight verification script.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>e708b9946bee40db71df3486d2ea663f69ffb10a9ed8cabcd01b95aac4cc8028
</code></pre></div></div>

<p>with full commitment record:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"commitment"</span><span class="p">:</span><span class="w"> </span><span class="s2">"e708b9946bee40db71df3486d2ea663f69ffb10a9ed8cabcd01b95aac4cc8028"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"salt"</span><span class="p">:</span><span class="w"> </span><span class="s2">"29ea6ad30117dfeb82928cd5b38c4c1a"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"domain"</span><span class="p">:</span><span class="w"> </span><span class="s2">"commit-v1:"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"message"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Tomorrow it is going to rain."</span><span class="p">,</span><span class="w">
  </span><span class="nl">"mode"</span><span class="p">:</span><span class="w"> </span><span class="s2">"hash"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"timestamp"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2025-09-28T17:57:29.790Z"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The hash alone is worthless but binds the commitment. Once the salt and message are revealed, anyone can check that <code class="language-plaintext highlighter-rouge">H(salt || message) == C</code>.</p>

<p>However, now we face an (important) design choice:</p>

<ol>
  <li><em>Publish the salt immediately.</em> This guarantees <em>stronger binding</em>. You cannot change it later to try to force a collision but at the same time, while not being able to use precomputed tables, attackers can start brute-forcing the message; expensive but might be successful if the actual message is known to come from a small list of messages, e.g., “YES” or “NO”.</li>
  <li><em>Keep the salt secret until reveal.</em> This provides <em>stronger hiding</em>. Attackers cannot even start guessing until you reveal it but at the same time you may be accused of cheating with a new salt <code class="language-plaintext highlighter-rouge">s</code> and new message <code class="language-plaintext highlighter-rouge">messageNew</code> so that <code class="language-plaintext highlighter-rouge">H(s || messageNew) == C</code>; super unlikely but possible.</li>
</ol>

<p>In practice, both are considered secure if the salt is long and random and it depends on the application: (a) publishing it makes accusations of “cheating with a new salt” impossible, (b) hiding it makes attacks against short message lists impossible.</p>

<h3 id="verifier-supplied-randomness-stronger-binding-if-needed">Verifier-supplied randomness: stronger binding if needed</h3>

<p>We can even go a step further and use additional verifier-supplied (or supplied via a public beacon) randomness <code class="language-plaintext highlighter-rouge">r</code> to compute the commitment. This provides even stronger binding than the basic scheme with salt: the committer cannot influence <code class="language-plaintext highlighter-rouge">r</code> and the salt still hides the message. The reasoning is one of those subtle but fundamental cryptography points which becomes clear once you think about who controls the randomness and when.</p>

<p>In the basic scheme you control both <code class="language-plaintext highlighter-rouge">salt</code> and <code class="language-plaintext highlighter-rouge">message</code> before committing. By contrast, if the verifier (or a randomness beacon) first provides a random challenge <code class="language-plaintext highlighter-rouge">r</code>, you now compute the commitment as</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C = H(r || salt || message)
</code></pre></div></div>

<p>Why is this stronger?</p>

<ul>
  <li>You cannot “grind” over many salts anymore. With <code class="language-plaintext highlighter-rouge">r</code> fixed by someone else, you lose the ability to search over huge numbers of salts to pick a “nice” one with special properties.</li>
  <li>Binding is strengthened even against powerful adversaries. Without <code class="language-plaintext highlighter-rouge">r</code>, binding relies on <em>collision resistance</em> alone. With <code class="language-plaintext highlighter-rouge">r</code>, the committer faces <em>preimage resistance under a fixed prefix</em>, which is much harder to game retroactively, even if someone could manufacture collisions offline.</li>
  <li>It provides context and time binding. Different <code class="language-plaintext highlighter-rouge">r</code> values yield different commitments for the same message. If <code class="language-plaintext highlighter-rouge">r</code> is tied to a timestamp, block hash, or VRF output, the commitment could not have been generated before <code class="language-plaintext highlighter-rouge">r</code> existed.</li>
  <li>It removes the “choose your own salt” accusation. Since you do not control all randomness, a verifier need not trust that you sampled <code class="language-plaintext highlighter-rouge">salt</code> honestly.</li>
</ul>

<p>Analogy (lottery tickets):</p>

<ul>
  <li>Without <code class="language-plaintext highlighter-rouge">r</code>: you pick your own ticket numbers and could generate many until you like one.</li>
  <li>With <code class="language-plaintext highlighter-rouge">r</code>: the machine draws a random number you must include; advance “ticket shopping” no longer helps.</li>
</ul>

<p>Summary:</p>

<table>
  <thead>
    <tr>
      <th>Property</th>
      <th><code class="language-plaintext highlighter-rouge">H(salt || message)</code></th>
      <th><code class="language-plaintext highlighter-rouge">H(r || salt || message)</code></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Who controls randomness?</td>
      <td>Committer</td>
      <td>Verifier &amp; Committer</td>
    </tr>
    <tr>
      <td>Binding (practical)</td>
      <td>Strong (collision resistance)</td>
      <td>Stronger (fixed-prefix preimage)</td>
    </tr>
    <tr>
      <td>Hiding</td>
      <td>Strong (if salt kept secret)</td>
      <td>Same</td>
    </tr>
    <tr>
      <td>Resistance to salt grinding</td>
      <td>❌</td>
      <td>✅</td>
    </tr>
    <tr>
      <td>Time/context binding</td>
      <td>❌</td>
      <td>✅</td>
    </tr>
    <tr>
      <td>Trust model</td>
      <td>Some trust in committer</td>
      <td>No trust needed</td>
    </tr>
  </tbody>
</table>

<p>Such stronger guarantees that use verifier-provided nonces or public randomness sources eliminate additional theoretical avenues for manipulation and are foundational to advanced protocols like verifiable delay functions, blockchain randomness beacons, and secure multiparty computation.</p>

<h3 id="hmac-mode-adding-a-secret-key">HMAC Mode: Adding a Secret Key</h3>

<p>Another variant uses an HMAC (Hash-based Message Authentication Code):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C = HMAC(key, message)
</code></pre></div></div>

<p>Here <code class="language-plaintext highlighter-rouge">key</code> is a secret shared between two parties. The result:</p>

<ul>
  <li>Only someone with the key can generate a valid commitment.</li>
  <li>Only someone with the key can verify it.</li>
</ul>

<p><em>General structure of HMAC (as a function of <code class="language-plaintext highlighter-rouge">H</code>).</em> At a high level, HMAC wraps a hash <code class="language-plaintext highlighter-rouge">H</code> in an inner/outer composition with two fixed byte patterns (often called “pads”) and the secret key:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HMAC_H(key, msg) = H( (key_xor_opad) || H( (key_xor_ipad) || msg ) )
</code></pre></div></div>

<p>where <code class="language-plaintext highlighter-rouge">key_xor_ipad</code> and <code class="language-plaintext highlighter-rouge">key_xor_opad</code> are the secret key combined with two fixed constants (one for the inner hash, one for the outer hash). This gives message authentication on top of <code class="language-plaintext highlighter-rouge">H</code> without exposing the raw key, and it prevents simple extension attacks against Merkle–Damgård hashes.</p>

<p><strong>Remark (why HMAC is different than hashing with a private salt).</strong> Using a secret salt like <code class="language-plaintext highlighter-rouge">H(secret || msg)</code> or <code class="language-plaintext highlighter-rouge">H(msg || secret)</code> is not the same as HMAC:</p>

<ul>
  <li>A private-salt hash is typically vulnerable to length-extension/structural issues with Merkle–Damgård hashes (this includes SHA-256); HMAC’s inner/outer pads provide domain separation that defeats these.</li>
  <li>HMAC has strong, widely accepted security guarantees (MAC/PRF) under assumptions on <code class="language-plaintext highlighter-rouge">H</code>, whereas a naive private-salt construction does not.</li>
  <li>Verification differs: with a private salt, revealing the salt for verification <em>weakens reuse</em> (this affects the reusability requirements of the key); with HMAC, verifiers share a key and tags reveal nothing about the raw key.</li>
</ul>

<p><strong>What is length extension? A simple example.</strong></p>

<p>Suppose you try to authenticate with <code class="language-plaintext highlighter-rouge">tag = H(secret || msg)</code> using SHA-256. An attacker who sees <code class="language-plaintext highlighter-rouge">msg</code> and <code class="language-plaintext highlighter-rouge">tag</code> can compute a valid tag for an extended message <code class="language-plaintext highlighter-rouge">msg || padding || extra</code> without knowing <code class="language-plaintext highlighter-rouge">secret</code>:</p>

<ul>
  <li>They reconstruct SHA-256’s padding for the unknown <code class="language-plaintext highlighter-rouge">secret || msg</code> length.</li>
  <li>They continue the hash state from <code class="language-plaintext highlighter-rouge">tag</code> to process <code class="language-plaintext highlighter-rouge">extra</code> and produce a forged tag for the longer message.</li>
</ul>

<p><em>Result.</em> The attacker forges <code class="language-plaintext highlighter-rouge">tag'</code> for <code class="language-plaintext highlighter-rouge">msg' = msg || padding || extra</code> that will verify under the naive scheme with <code class="language-plaintext highlighter-rouge">secret</code>. This is a serious problem if the <code class="language-plaintext highlighter-rouge">secret</code> should be reused as key. HMAC’s inner/outer construction prevents this kind of extension because the outer hash re-keys and finalizes the computation in a way that cannot be resumed by an attacker. In the context of commitments, this is not a problem if a fresh <code class="language-plaintext highlighter-rouge">secret</code>, i.e., a <code class="language-plaintext highlighter-rouge">salt</code> is used for each commitment.</p>

<p>Note that HMAC is not strictly a commitment scheme; it is primarily for authenticity, but it is useful when commitments must be verifiable only by authorized parties.</p>

<h2 id="applications-where-commitments-matter">Applications: Where Commitments Matter</h2>

<p>Commitment schemes may seem abstract, but they have applications in many real-world protocols.</p>

<h3 id="zero-knowledge-proofs-graph-coloring">Zero-Knowledge Proofs: Graph Coloring</h3>

<p>One of the most elegant applications of commitments is in zero-knowledge proofs. These are interactive protocols that let a prover convince a verifier that a certain statement is true without revealing anything beyond the truth of the statement itself. A canonical example uses the NP-complete Graph 3-Coloring problem.</p>

<p><strong>The Setting.</strong></p>

<ul>
  <li>The prover (P) knows a valid 3-coloring of a graph $G=(V,E)$ but does not want to reveal it to the verifier (V).</li>
  <li>The verifier (V) wants to be convinced that P indeed knows such a coloring.</li>
</ul>

<p><strong>Protocol: Iterative Zero-Knowledge Proof for 3-Coloring.</strong></p>

<ol>
  <li>Commitment Phase:
    <ul>
      <li>P randomly permutes the color labels (e.g., swaps “red,” “green,” and “blue”).</li>
      <li>
        <p>For each vertex $v_i$, P commits to the permuted color with a salt:</p>

\[C_i = H(\text{salt}_i \parallel \pi(c(v_i)))\]
      </li>
      <li>Salts are freshly sampled for each round (and for each vertex); do not reuse salts across rounds.</li>
      <li>P sends all commitments ${C_i}$ to V.</li>
    </ul>
  </li>
  <li>Challenge Phase:
    <ul>
      <li>V picks a random edge $e=(u,v)$ and sends it as a challenge.</li>
    </ul>
  </li>
  <li>Opening Phase:
    <ul>
      <li>P reveals the committed colors and salts for $u$ and $v$.</li>
      <li>V verifies:
        <ul>
          <li>The openings match the original commitments.</li>
          <li>$c’(u) \neq c’(v)$.</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Repeat:
    <ul>
      <li>Steps 1–3 are repeated $k$ times with fresh random permutations and fresh salts each round.</li>
    </ul>
  </li>
</ol>

<p>Each round convinces V a little more. If P does not have a valid coloring, the probability of cheating successfully falls exponentially with $k$.</p>

<p><strong>Remark: Why Shuffle Every Round?</strong> Without the random permutation step (and fresh salts), repeated openings would gradually leak the full coloring. The shuffling ensures that even after many rounds, no information beyond “adjacent nodes differ” is revealed.</p>

<p><strong>Complexity Perspective.</strong> The 3-coloring problem is NP-complete: verifying a proposed coloring is easy (polynomial time), but finding one is believed to be hard. This protocol does not make solving NP-complete problems easier, but it demonstrates a powerful result:</p>

<blockquote>
  <p>Every language in NP has a zero-knowledge proof system (under standard cryptographic assumptions).</p>
</blockquote>

<p>This is a foundational insight in modern cryptography. The graph-coloring protocol is a canonical constructive example.</p>

<h3 id="betting-and-predictions">Betting and Predictions</h3>

<p>Commitments are common in prediction markets or bets. You commit to “Team A will win” by publishing <code class="language-plaintext highlighter-rouge">C = H(salt || prediction)</code> now. After the match, you reveal the prediction. Because the commitment was binding, nobody can accuse you of changing your guess after the fact.</p>

<h3 id="timestamped-knowledge-proofs">Timestamped Knowledge Proofs</h3>

<p>Researchers or inventors can commit to a discovery before publication. Later, revealing the original text proves they knew it at the earlier time. Useful for priority claims and intellectual property.</p>

<h3 id="digital-lotteries-and-random-draws">Digital Lotteries and Random Draws</h3>

<p>A lottery organizer commits to a random seed before ticket sales close: <code class="language-plaintext highlighter-rouge">C = H(seed)</code>. After the sale, they reveal the seed, and everyone can verify that the winning draw was predetermined and not manipulated after the fact.</p>

<h3 id="authenticity-and-integrity-checks">Authenticity and Integrity Checks</h3>

<p>If you want to authenticate data without revealing it, you can commit to its hash in advance. Later, when the data is disclosed, others can verify it has not changed. This is basically what the MD5 checksums did for downloads in the past.</p>

<h2 id="hashes-are-not-encryption">Hashes Are Not Encryption</h2>

<p>One final note: a hash is not encryption.</p>

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>Hash Function</th>
      <th>Encryption</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Purpose</td>
      <td>One-way fingerprint of data</td>
      <td>Reversible hiding of data</td>
    </tr>
    <tr>
      <td>Direction</td>
      <td>One-way (irreversible)</td>
      <td>Two-way (reversible with key)</td>
    </tr>
    <tr>
      <td>Recover plaintext?</td>
      <td>❌ Impossible</td>
      <td>✅ Yes, with key</td>
    </tr>
  </tbody>
</table>

<p>You cannot “decrypt” a hash. This one-way property is precisely what makes hashes ideal for commitments.</p>

<h2 id="references">References</h2>

<p>[BR] Bellare, M., &amp; Rogaway, P. (2005). Introduction to modern cryptography. Ucsd Cse, 207, 207.</p>

<p>[KL] Katz, J., &amp; Lindell, Y. (2007). Introduction to modern cryptography: principles and protocols. Chapman and hall/CRC.</p>

<p>[RFC2104] RFC 2104: HMAC: Keyed-Hashing for Message Authentication</p>

<h2 id="appendix-lightweight-verification-script-python">Appendix: Lightweight Verification Script (Python)</h2>

<p>Below is a minimal verifier you can use to check commitments encoded as JSON. It supports both SHA-256 and HMAC-SHA256 modes; you can also use it to create commitments by using the computed hash.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="sh">"""</span><span class="s">Lightweight commitment verification script.</span><span class="sh">"""</span>

<span class="kn">import</span> <span class="n">sys</span>
<span class="kn">import</span> <span class="n">os</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">import</span> <span class="n">hashlib</span>
<span class="kn">import</span> <span class="n">hmac</span>
<span class="kn">from</span> <span class="n">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="k">def</span> <span class="nf">hex_to_bytes</span><span class="p">(</span><span class="n">hex_str</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Convert hex string to bytes.</span><span class="sh">"""</span>
    <span class="k">return</span> <span class="nb">bytes</span><span class="p">.</span><span class="nf">fromhex</span><span class="p">(</span><span class="n">hex_str</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">bytes_to_hex</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Convert bytes to hex string.</span><span class="sh">"""</span>
    <span class="k">return</span> <span class="n">data</span><span class="p">.</span><span class="nf">hex</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">verify_commitment</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">hmac_key</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Verify a commitment from JSON data.</span><span class="sh">"""</span>

    <span class="c1"># Extract components
</span>    <span class="n">commitment</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="sh">'</span><span class="s">commitment</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">salt_hex</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="sh">'</span><span class="s">salt</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">domain</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="sh">'</span><span class="s">domain</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">message</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="sh">'</span><span class="s">message</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">mode</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">mode</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">hash</span><span class="sh">'</span><span class="p">)</span>
    <span class="n">timestamp</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">)</span>

    <span class="c1"># Convert salt to bytes
</span>    <span class="n">salt</span> <span class="o">=</span> <span class="nf">hex_to_bytes</span><span class="p">(</span><span class="n">salt_hex</span><span class="p">)</span>

    <span class="c1"># Encode domain and message to UTF-8 bytes
</span>    <span class="n">domain_bytes</span> <span class="o">=</span> <span class="n">domain</span><span class="p">.</span><span class="nf">encode</span><span class="p">(</span><span class="sh">'</span><span class="s">utf-8</span><span class="sh">'</span><span class="p">)</span>
    <span class="n">message_bytes</span> <span class="o">=</span> <span class="n">message</span><span class="p">.</span><span class="nf">encode</span><span class="p">(</span><span class="sh">'</span><span class="s">utf-8</span><span class="sh">'</span><span class="p">)</span>

    <span class="c1"># Recreate the data that was hashed
</span>    <span class="n">data_bytes</span> <span class="o">=</span> <span class="n">salt</span> <span class="o">+</span> <span class="n">domain_bytes</span> <span class="o">+</span> <span class="n">message_bytes</span>

    <span class="c1"># Hash the data
</span>    <span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="sh">'</span><span class="s">hmac</span><span class="sh">'</span><span class="p">:</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">hmac_key</span><span class="p">:</span>
            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">❌ HMAC mode requires key. Use --key parameter or set HMAC_KEY environment variable</span><span class="sh">"</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">False</span><span class="p">,</span> <span class="bp">None</span>

        <span class="c1"># HMAC-SHA256 verification
</span>        <span class="n">computed</span> <span class="o">=</span> <span class="n">hmac</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">hmac_key</span><span class="p">.</span><span class="nf">encode</span><span class="p">(</span><span class="sh">'</span><span class="s">utf-8</span><span class="sh">'</span><span class="p">),</span> <span class="n">data_bytes</span><span class="p">,</span> <span class="n">hashlib</span><span class="p">.</span><span class="n">sha256</span><span class="p">).</span><span class="nf">digest</span><span class="p">()</span>
        <span class="n">computed_hex</span> <span class="o">=</span> <span class="nf">bytes_to_hex</span><span class="p">(</span><span class="n">computed</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># SHA-256 hash
</span>        <span class="n">computed</span> <span class="o">=</span> <span class="n">hashlib</span><span class="p">.</span><span class="nf">sha256</span><span class="p">(</span><span class="n">data_bytes</span><span class="p">).</span><span class="nf">digest</span><span class="p">()</span>
        <span class="n">computed_hex</span> <span class="o">=</span> <span class="nf">bytes_to_hex</span><span class="p">(</span><span class="n">computed</span><span class="p">)</span>

    <span class="c1"># Verify
</span>    <span class="n">is_valid</span> <span class="o">=</span> <span class="n">computed_hex</span> <span class="o">==</span> <span class="n">commitment</span>

    <span class="k">return</span> <span class="n">is_valid</span><span class="p">,</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">commitment</span><span class="sh">'</span><span class="p">:</span> <span class="n">commitment</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">computed</span><span class="sh">'</span><span class="p">:</span> <span class="n">computed_hex</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">message</span><span class="sh">'</span><span class="p">:</span> <span class="n">message</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">domain</span><span class="sh">'</span><span class="p">:</span> <span class="n">domain</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">mode</span><span class="sh">'</span><span class="p">:</span> <span class="n">mode</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">:</span> <span class="n">timestamp</span>
    <span class="p">}</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="sh">"""</span><span class="s">Main verification function.</span><span class="sh">"""</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Parse command line arguments
</span>        <span class="n">hmac_key</span> <span class="o">=</span> <span class="bp">None</span>
        <span class="k">if</span> <span class="nf">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="sh">'</span><span class="s">--key</span><span class="sh">'</span> <span class="ow">and</span> <span class="nf">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">2</span><span class="p">:</span>
                <span class="n">hmac_key</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
            <span class="k">elif</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="nf">startswith</span><span class="p">(</span><span class="sh">'</span><span class="s">--key=</span><span class="sh">'</span><span class="p">):</span>
                <span class="n">hmac_key</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">6</span><span class="p">:]</span>  <span class="c1"># Remove '--key=' prefix
</span>
        <span class="c1"># Also check environment variable
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">hmac_key</span><span class="p">:</span>
            <span class="n">hmac_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">HMAC_KEY</span><span class="sh">'</span><span class="p">)</span>

        <span class="c1"># Read JSON from stdin
</span>        <span class="n">json_input</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">stdin</span><span class="p">.</span><span class="nf">read</span><span class="p">().</span><span class="nf">strip</span><span class="p">()</span>

        <span class="k">if</span> <span class="ow">not</span> <span class="n">json_input</span><span class="p">:</span>
            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">❌ No JSON input provided</span><span class="sh">"</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Usage: echo </span><span class="sh">'</span><span class="s">PASTE_JSON_HERE</span><span class="sh">'</span><span class="s"> | python3 verify_commitment.py [--key YOUR_KEY]</span><span class="sh">"</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">   or: export HMAC_KEY=your_key</span><span class="sh">"</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">   or: echo </span><span class="sh">'</span><span class="s">PASTE_JSON_HERE</span><span class="sh">'</span><span class="s"> | python3 verify_commitment.py --key=your_key</span><span class="sh">"</span><span class="p">)</span>
            <span class="k">return</span>

        <span class="c1"># Parse JSON
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">loads</span><span class="p">(</span><span class="n">json_input</span><span class="p">)</span>
        <span class="k">except</span> <span class="n">json</span><span class="p">.</span><span class="n">JSONDecodeError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">❌ Invalid JSON: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
            <span class="k">return</span>

        <span class="c1"># Verify commitment
</span>        <span class="n">is_valid</span><span class="p">,</span> <span class="n">details</span> <span class="o">=</span> <span class="nf">verify_commitment</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">hmac_key</span><span class="p">)</span>

        <span class="c1"># Output results
</span>        <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="se">\n\n</span><span class="sh">"</span><span class="p">)</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Verification: </span><span class="si">{</span><span class="sh">'</span><span class="s">✅ VERIFIED</span><span class="sh">'</span> <span class="k">if</span> <span class="n">is_valid</span> <span class="k">else</span> <span class="sh">'</span><span class="s">❌ NOT VERIFIED</span><span class="sh">'</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">details</span><span class="p">:</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Mode: </span><span class="si">{</span><span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">mode</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Domain: </span><span class="si">{</span><span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">domain</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Message: </span><span class="si">{</span><span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">message</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

            <span class="k">if</span> <span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">]:</span>
                <span class="k">try</span><span class="p">:</span>
                    <span class="n">dt</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="nf">fromisoformat</span><span class="p">(</span><span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">].</span><span class="nf">replace</span><span class="p">(</span><span class="sh">'</span><span class="s">Z</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">+00:00</span><span class="sh">'</span><span class="p">))</span>
                    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Timestamp: </span><span class="si">{</span><span class="n">dt</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="sh">'</span><span class="s">%Y-%m-%d %H</span><span class="si">:</span><span class="o">%</span><span class="n">M</span><span class="si">:</span><span class="o">%</span><span class="n">S</span> <span class="n">UTC</span><span class="sh">'</span><span class="s">)</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
                <span class="k">except</span><span class="p">:</span>
                    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Timestamp: </span><span class="si">{</span><span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

            <span class="k">if</span> <span class="ow">not</span> <span class="n">is_valid</span> <span class="ow">and</span> <span class="n">details</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">computed</span><span class="sh">'</span><span class="p">):</span>
                <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Expected: </span><span class="si">{</span><span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">commitment</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
                <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Computed: </span><span class="si">{</span><span class="n">details</span><span class="p">[</span><span class="sh">'</span><span class="s">computed</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

    <span class="k">except</span> <span class="nb">KeyboardInterrupt</span><span class="p">:</span>
        <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="se">\n</span><span class="s">⚠️  Interrupted</span><span class="sh">"</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">❌ Error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>]]></content><author><name>Sebastian Pokutta</name></author><category term="random" /><category term="cryptography" /><category term="commitments" /><category term="hash" /><summary type="html"><![CDATA[TL;DR: Cryptography is often thought of in terms of encryption (hiding messages) or signatures (proving authenticity). But there’s another fundamental building block that underpins many protocols, from digital lotteries to zero-knowledge proofs: the commitment. A commitment is a way to promise something now without revealing it, yet in such a way that you cannot change your mind later. It can be efficiently implemented via hashing.]]></summary></entry><entry><title type="html">Little’s Law and Conference Reviewing: the Queueing Perspective</title><link href="http://www.pokutta.com/blog/littles-law-for-conference-review-backlogs/" rel="alternate" type="text/html" title="Little’s Law and Conference Reviewing: the Queueing Perspective" /><published>2025-09-02T01:00:00+02:00</published><updated>2025-09-02T01:00:00+02:00</updated><id>http://www.pokutta.com/blog/little-conference</id><content type="html" xml:base="http://www.pokutta.com/blog/littles-law-for-conference-review-backlogs/"><![CDATA[<p><em>TL;DR: This is the queueing model perspective of the “paper pool” conference reviewing model with math and numbers based on Little’s Law. Think of it as supplementary material to the post on <a href="https://damaru2.github.io/general/queueing_to_publish_in_AI_or_CS/">David’s blog</a> on current ML/AI conferences; you might want to read that one first.</em></p>

<!--more-->

<p><em>Written by <a href="https://www.pokutta.com/">Sebastian Pokutta</a> in collaboration with <a href="https://damaru2.github.io/">David Martínez-Rubio</a>.</em></p>

<h2 id="littles-law-for-queueing-systems">Little’s law for queueing systems</h2>

<p>Little’s law states that the long-run average number $L$ of jobs in a stable system equals the arrival/throughput rate $\lambda$ times the average time $W$ a job spends in the system:</p>

\[\tag{Little'sLaw}
L = \lambda W.\]

<p><strong>Interpretation.</strong></p>

<ul>
  <li>$L$: average work-in-progress (jobs “in the system”)</li>
  <li>$\lambda$: average completion rate for the flow considered (arrivals = departures in steady state)</li>
  <li>$W$: average time in system per job</li>
</ul>

<p><strong>Assumptions.</strong> The assumptions are minimal: stability (finite averages, arrivals balance departures), clearly defined entry/exit points, and stationarity/ergodicity. Crucially, Little’s law is insensitive to the arrival/service distributions and scheduling discipline, and it also holds for stable subflows (e.g.,when splitting into different classes of jobs).</p>

<p><strong>Note.</strong> We stick to the notion of <em>hazard</em> as the per-step completion (conditional) probability as customary in queueing theory.</p>

<h2 id="summary-link-to-the-conference-reviewing-model">Summary: Link to the conference reviewing model</h2>

<p>In our setting, the “system” is the pool of papers under review across calls; arrivals are new submissions per call; exits are final decisions (accept or abandonment). We model each conference <em>call</em> as one time step and authors may resubmit their papers to the next call if rejected, i.e., these papers effectively stay in the system. If <em>$N$</em> new submissions arrive per call and each paper has an independent per-call <em>accept/complete hazard $p$</em>, then Little’s law gives:</p>

<p><strong>No giving up (infinite patience).</strong> In this case, each paper is accepted if it is completed, which happens at some point in time. Given hazard $p$, we have the mean time in system as $W = \frac{1}{p}$ and the expected pool level scales as $L = N/p$:</p>

\[W=\frac{1}{p}\ \text{calls},\qquad
L=\frac{N}{p},\qquad
\text{accepted per call}=\lambda=N.\]

<p>Key insight: the throughput (= number of accepted papers per call) <em>does not change</em> with $p$. Lowering $p$ just increases the in-system population $L$ (and thus reviewer load).</p>

<p><strong>Finite patience with papers being abandoned after $T$ calls.</strong> This is the case where papers are abandoned after $T$ calls if not accepted. We still have only one “class” of papers, but now there are two exit modes: <em>accept</em> or <em>abandon</em>.</p>

\[W_{\text{all}}=\frac{1-(1-p)^T}{p},\quad
L_{\text{all}}=N\frac{1-(1-p)^T}{p},\quad
\lambda_{\text{acc}}=N\big[1-(1-p)^T\big],\quad
\lambda_{\text{abn}}=N(1-p)^T.\]

\[W_{\text{acc}}=\frac{1}{p}-\frac{T(1-p)^T}{1-(1-p)^T},\quad
L_{\text{acc}}=N\!\left(\frac{1-(1-p)^T}{p}-T(1-p)^T\right),\quad
L_{\text{abn}}=N\,T(1-p)^T.\]

<p>Key insight: acceptances rise only modestly with $p$, but backlog and review volume drop sharply. This model is more realistic but still has only one class of papers.</p>

<p><strong>Quality buckets: class-dependent hazards and patience.</strong> When submissions split into quality classes (e.g., “great/average/bad”) with different per-call completion hazards $p_i$ and potentially different patience limits $T_i$, Little’s law applies to each class separately:</p>

\[W_i=\frac{1-(1-p_i)^{T_i}}{p_i},\quad
L_i=\lambda_i W_i,\quad
\lambda^{\text{acc}}_i=\lambda_i\big[1-(1-p_i)^{T_i}\big],\]

<p>where $\lambda_i = N\pi_i$ is the arrival rate for class $i$ with share $\pi_i$. The backlog and acceptance shares become</p>

\[\text{BacklogShare}_i=\frac{L_i}{\sum_j L_j},\quad
\text{AcceptShare}_i=\frac{\lambda^{\text{acc}}_i}{\sum_j \lambda^{\text{acc}}_j}.\]

<p>Key insight: reweighting hazards across classes mainly shifts the <em>composition</em> of acceptances and backlog, while total throughput remains roughly fixed at $N$ (since $\sum_i \lambda_i = N$).</p>

<p><strong>Slow growth in arrivals ($g$ multiplicative increase per call).</strong> We can immediately also estimate the impact of growth in the arrival rate. Over a window roughly equal to the mean time in system $W\approx 1/p$, a sustained arrival growth $g$ compounds into an approximate $g \cdot W \approx g/p$ increase in the <em>pool level</em> (rule-of-thumb).</p>

<p><strong>Caveats, Assumptions, and Gotchas.</strong></p>

<ul>
  <li>The hazard $p$ bundles many realities (desk reject, area-chair filtering, author revision time). If those change, so does $W$.</li>
  <li>Little’s law reveals immediately that <em>steady-state acceptances follow arrivals</em>, not the per-call acceptance hazard.</li>
  <li>If resubmissions between different venues happen with delays, one can model them as additional stages; Little’s law still applies stage-wise.</li>
  <li>If arrivals are highly seasonal (e.g., co-located deadlines), replace the $g/p$ heuristic with the exact weighted-sum solution or simulate.</li>
</ul>

<h2 id="littles-law-in-action">Little’s Law in Action</h2>

<p>Below we visualize Little’s law with a simple funnel model, where papers arrive at the top and then go through the system, eventually being accepted or abandoned.</p>

<div id="littles-law-funnel"></div>

<style>
  .llf-wrap { box-sizing: border-box; color:#111827; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; }
  .llf-col { display:flex; flex-direction:column; gap:16px; }
  .llf-card { border:1px solid #e5e7eb; border-radius: 14px; box-shadow: 0 1px 2px rgba(0,0,0,0.04); background: #fff; }
  .llf-pad { padding: 16px; }
  .llf-heading { font-weight:700; font-size: 20px; margin: 4px 0 8px; }
  .llf-subtle { color:#6b7280; font-size: 13px; }
  .llf-row { display:flex; align-items:center; justify-content:space-between; margin-bottom: 10px; }
  .llf-label { font-weight:600; }
  .llf-val { color:#374151; font-weight:600; }
  .llf-input { width: 100%; }
  .llf-btn { padding: 8px 14px; border-radius: 10px; background:#fff; border:1px solid #d1d5db; cursor:pointer; }
  .llf-btn:hover { background:#f9fafb; }
  .llf-grid-2 { display:grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-top: 8px; }
  .llf-chip { border-radius: 12px; padding: 10px; }
  .llf-chip-indigo { background:#eef2ff; border:1px solid #c7d2fe; color:#1e3a8a; }
  .llf-chip-emerald { background:#ecfdf5; border:1px solid #a7f3d0; color:#065f46; }
  .llf-chip-rose { background:#fff1f2; border:1px solid #fecdd3; color:#9f1239; }
  .llf-chip-amber { background:#fffbeb; border:1px solid #fde68a; color:#92400e; grid-column: span 2; }
  .llf-stats { display:grid; grid-template-columns: repeat(2, 1fr); gap: 10px; padding: 14px; background:#f9fafb; border-top:1px solid #e5e7eb; }
  @media (min-width: 640px) { .llf-stats { grid-template-columns: repeat(4, 1fr); } }
  .llf-stat { background:#fff; border:1px solid #e5e7eb; border-radius: 12px; padding: 10px; }
  .llf-stat .k { font-size: 12px; color:#6b7280; }
  .llf-stat .v { font-size: 18px; font-weight: 700; }
  .llf-tip { color:#6b7280; font-size:12px; }
  canvas.llf-canvas { display:block; width: 100%; height: auto; }
  .llf-controls-grid { display:grid; grid-template-columns: 1fr; gap: 12px; }
  @media (min-width: 640px) { .llf-controls-grid { grid-template-columns: 1fr 1fr; } }
  details.llf-adv summary { cursor:pointer; font-weight:600; }
  details.llf-adv > div { margin-top:8px; }
</style>

<script>
(function(){
  // Utility
  const clamp = (v, lo, hi) => Math.max(lo, Math.min(hi, v));
  const rand = (a=0, b=1) => a + Math.random()*(b-a);

  // Number formatters
  const fmt = new Intl.NumberFormat('en-US', { maximumFractionDigits: 2 });
  const fmt0 = new Intl.NumberFormat('en-US', { maximumFractionDigits: 0 });

  // Physics constants
  const DOT_RADIUS = 4.5;
  const COLLISION_CELL = DOT_RADIUS * 2.5;

  // Simple spatial hash for collision resolution among pool dots
  function resolvePoolCollisions(g) {
    // Build grid (apply across the entire funnel, not only near the bottom)
    const grid = new Map();
    const keyOf = (i, j) => i + ':' + j;
    const toCell = (x, y) => [Math.floor(x / COLLISION_CELL), Math.floor(y / COLLISION_CELL)];
    for (const p of particles) {
      if (p.status !== 'pool') continue;
      const [ci, cj] = toCell(p.x, p.y);
      const k = keyOf(ci, cj);
      if (!grid.has(k)) grid.set(k, []);
      grid.get(k).push(p);
    }

    const minDist = DOT_RADIUS * 2;
    // Relaxation passes to settle overlaps
    for (let pass = 0; pass < state.collisionPasses; pass++) {
      for (const p of particles) {
        if (p.status !== 'pool') continue;
        const [ci, cj] = toCell(p.x, p.y);
        for (let di = -1; di <= 1; di++) {
          for (let dj = -1; dj <= 1; dj++) {
            const k = keyOf(ci + di, cj + dj);
            const cell = grid.get(k);
            if (!cell) continue;
            for (const q of cell) {
              if (q === p || q.status !== 'pool') continue;
              const dx = q.x - p.x;
              const dy = q.y - p.y;
              const d2 = dx*dx + dy*dy;
              if (d2 <= 0) continue;
              const d = Math.sqrt(d2);
              if (d < minDist) {
                const overlap = (minDist - d) * 0.5;
                const nx = dx / (d || 1);
                const ny = dy / (d || 1);
                // Push apart equally
                p.x -= nx * overlap;
                p.y -= ny * overlap;
                q.x += nx * overlap;
                q.y += ny * overlap;
                // Damp velocities slightly to help settling, but keep some motion to avoid sticking
                p.vx *= 0.9; p.vy *= 0.85;
                q.vx *= 0.9; q.vy *= 0.85;
              }
            }
          }
        }
        // Floor at bottom of funnel body
        if (p.y > g.botY - DOT_RADIUS) {
          p.y = g.botY - DOT_RADIUS;
          if (p.vy > 0) p.vy = 0;
          // floor friction
          p.vx *= state.floorFriction;
        }
        // Keep inside funnel after pushing
        const left = g.funnelLeftAt(p.y) + 6;
        const right = g.funnelRightAt(p.y) - 6;
        if (p.x < left) { p.x = left; p.vx = 0; }
        if (p.x > right) { p.x = right; p.vx = 0; }
      }
    }
  }

  // Build UI
  const mount = document.getElementById('littles-law-funnel');
  const wrap = document.createElement('div');
  wrap.className = 'llf-wrap';
  wrap.innerHTML = `
    <!--
    <div class="llf-pad">
      <div class="llf-heading">Little's Law Funnel — Interactive</div>
      <div class="llf-subtle">Arrivals per call N, per-call acceptance hazard p, patience T. Watch how the in-system population L ≈ N/p (or N(1-(1-p)^T)/p) grows as p decreases, while throughput stays tied to arrivals.</div>
    </div>
    -->

    <div class="llf-col llf-pad">
      <!-- 1) Funnel + measured status -->
      <div>
        <div class="llf-card" style="overflow:hidden;">
          <canvas id="llf_canvas" class="llf-canvas"></canvas>
          <div class="llf-stats">
            <div class="llf-stat">
              <div class="k">Measured \\( L \\) (papers)</div>
              <div class="v" id="llf_L_meas"></div>
            </div>
            <div class="llf-stat">
              <div class="k">Target \\( L \\) (papers)</div>
              <div class="v" id="llf_L_target"></div>
            </div>
            <div class="llf-stat">
              <div class="k">Theoretical \\( W \\) (calls)</div>
              <div class="v" id="llf_W_calls"></div>
            </div>
            <div class="llf-stat">
              <div class="k">\\( \\lambda_{\\mathrm{acc}} \\) meas. (papers/call)</div>
              <div class="v" id="llf_lambda_acc_meas"></div>
            </div>
            <div class="llf-stat">
              <div class="k">\\( \\lambda_{\\mathrm{acc}} \\) target (papers/call)</div>
              <div class="v" id="llf_lambda_acc_target"></div>
            </div>
            <div class="llf-stat" id="llf_lambda_abn_meas_card">
              <div class="k">\\( \\lambda_{\\mathrm{abn}} \\) meas. (papers/call)</div>
              <div class="v" id="llf_lambda_abn_meas"></div>
            </div>
            <div class="llf-stat" id="llf_lambda_abn_target_card">
              <div class="k">\\( \\lambda_{\\mathrm{abn}} \\) target (papers/call)</div>
              <div class="v" id="llf_lambda_abn_target"></div>
            </div>
            <div class="llf-stat" id="llf_qacc_stat_card">
              <div class="k">\\( q_{\\mathrm{acc}} = 1-(1-p)^T \\approx 1-e^{-pT} \\)</div>
              <div class="v" id="llf_qacc"></div>
            </div>
            <div class="llf-stat" style="display:none;">
              <div class="k">Dots in pool <span class="k" id="llf_dots_target_hint"></span></div>
              <div class="v" id="llf_dots_in_pool"></div>
            </div>
            <div class="llf-stat" style="display:none;">
              <div class="k">Scale</div>
              <div class="v" id="llf_scale"></div>
            </div>
          </div>
        </div>
      </div>

      <!-- 3) Sliders/controls BELOW everything -->
      <div>
        <div class="llf-controls-grid">
          <div class="llf-card llf-pad">
            <div class="llf-row"><span class="llf-label">Arrivals per call (N)</span><span class="llf-val" id="llf_N_val"></span></div>
            <input id="llf_N" class="llf-input" type="range" min="200" max="5000" step="50" />
          </div>
          <div class="llf-card llf-pad">
            <div class="llf-row"><span class="llf-label">Per-call acceptance hazard (p)</span><span class="llf-val" id="llf_p_val"></span></div>
            <input id="llf_p" class="llf-input" type="range" min="0.05" max="0.70" step="0.01" />
          </div>
          <div class="llf-card llf-pad" id="llf_T_card">
            <div class="llf-row"><span class="llf-label">Patience T (calls)</span><span class="llf-val" id="llf_T_val"></span></div>
            <input id="llf_T" class="llf-input" type="range" min="3" max="9" step="1" />
          </div>
          <div class="llf-card llf-pad">
            <div class="llf-row">
              <label class="llf-label" for="llf_inf">Infinite patience (no giving up)</label>
              <input id="llf_inf" type="checkbox" />
            </div>
          </div>
          <div class="llf-card llf-pad">
            <div class="llf-row"><span class="llf-label">Calls per second (animation speed)</span><span class="llf-val" id="llf_cps_val"></span></div>
            <input id="llf_cps" class="llf-input" type="range" min="0.5" max="3" step="0.1" />
          </div>
          <div class="llf-card llf-pad">
            <div class="llf-row">
              <label class="llf-label" for="llf_scale_auto">Auto-scale dots</label>
              <input id="llf_scale_auto" type="checkbox" />
            </div>
            <div class="llf-row" style="margin-top:8px;justify-content:flex-start;gap:8px;">
              <button id="llf_rescale" class="llf-btn">Re-scale now</button>
            </div>
          </div>
          <div class="llf-card llf-pad">
            <details class="llf-adv" id="llf_adv">
              <summary>Advanced physics</summary>
              <div>
                <div class="llf-row"><span class="llf-label">Gravity base</span><span class="llf-val" id="llf_grav_base_val"></span></div>
                <input id="llf_grav_base" class="llf-input" type="range" min="0" max="200" step="5" />
                <div class="llf-row"><span class="llf-label">Gravity depth-scale</span><span class="llf-val" id="llf_grav_depth_val"></span></div>
                <input id="llf_grav_depth" class="llf-input" type="range" min="0" max="300" step="5" />
                <div class="llf-row"><span class="llf-label">MinDown base</span><span class="llf-val" id="llf_min_base_val"></span></div>
                <input id="llf_min_base" class="llf-input" type="range" min="0" max="120" step="5" />
                <div class="llf-row"><span class="llf-label">MinDown depth-scale</span><span class="llf-val" id="llf_min_depth_val"></span></div>
                <input id="llf_min_depth" class="llf-input" type="range" min="0" max="200" step="5" />
                <div class="llf-row"><span class="llf-label">Max vy cap</span><span class="llf-val" id="llf_vy_cap_val"></span></div>
                <input id="llf_vy_cap" class="llf-input" type="range" min="60" max="300" step="10" />
                <div class="llf-row"><span class="llf-label">Slide gain max</span><span class="llf-val" id="llf_slide_gain_val"></span></div>
                <input id="llf_slide_gain" class="llf-input" type="range" min="0" max="2" step="0.05" />
                <div class="llf-row"><span class="llf-label">Slide clamp</span><span class="llf-val" id="llf_slide_clamp_val"></span></div>
                <input id="llf_slide_clamp" class="llf-input" type="range" min="10" max="120" step="5" />
                <div class="llf-row"><span class="llf-label">Jitter magnitude</span><span class="llf-val" id="llf_jitter_val"></span></div>
                <input id="llf_jitter" class="llf-input" type="range" min="0" max="40" step="1" />
                <div class="llf-row"><span class="llf-label">Stuck vy threshold</span><span class="llf-val" id="llf_stuck_vy_val"></span></div>
                <input id="llf_stuck_vy" class="llf-input" type="range" min="0" max="40" step="1" />
                <div class="llf-row"><span class="llf-label">Nudge down (when stuck)</span><span class="llf-val" id="llf_nudge_val"></span></div>
                <input id="llf_nudge" class="llf-input" type="range" min="0" max="80" step="2" />
                <div class="llf-row"><span class="llf-label">Collision passes</span><span class="llf-val" id="llf_coll_pass_val"></span></div>
                <input id="llf_coll_pass" class="llf-input" type="range" min="0" max="8" step="1" />
                <div class="llf-row"><span class="llf-label">Floor friction</span><span class="llf-val" id="llf_floor_fric_val"></span></div>
                <input id="llf_floor_fric" class="llf-input" type="range" min="0.70" max="0.99" step="0.01" />
                <div class="llf-row"><span class="llf-label">Init vx factor</span><span class="llf-val" id="llf_init_vx_val"></span></div>
                <input id="llf_init_vx" class="llf-input" type="range" min="0" max="0.5" step="0.01" />
                <div class="llf-row"><span class="llf-label">Init vy min</span><span class="llf-val" id="llf_init_vymin_val"></span></div>
                <input id="llf_init_vymin" class="llf-input" type="range" min="10" max="200" step="5" />
                <div class="llf-row"><span class="llf-label">Init vy max</span><span class="llf-val" id="llf_init_vymax_val"></span></div>
                <input id="llf_init_vymax" class="llf-input" type="range" min="20" max="300" step="5" />
              </div>
            </details>
          </div>
          <div class="llf-card llf-pad">
            <div class="llf-row" style="gap:10px;">
              <button id="llf_reset" class="llf-btn">Reset</button>
            </div>
          </div>
        </div>

        <div class="llf-tip" style="margin-top:8px;">
          Tip: Lower \\( p \\) while keeping \\( N \\) fixed → backlog \\( L \\) becomes \\( \\approx N/p \\); throughput stays near arrivals once system settles. With finite \\( T \\), acceptance fraction is \\( 1-(1-p)^T \\approx 1-e^{-pT} \\); the term of actual impact is \\( p \\cdot T \\).
        </div>
      </div>
    </div>
  `;
  mount.appendChild(wrap);
  // Trigger MathJax to typeset math in newly inserted labels (if available)
  try {
    if (window.MathJax && window.MathJax.typesetPromise) {
      window.MathJax.typesetPromise([wrap]);
    } else if (window.MathJax && window.MathJax.typeset) {
      window.MathJax.typeset([wrap]);
    }
  } catch (e) {}

  // Element refs
  const el = id => wrap.querySelector('#' + id);
  const canvas = el('llf_canvas');
  const NInput = el('llf_N');
  const pInput = el('llf_p');
  const TInput = el('llf_T');
  const infInput = el('llf_inf');
  const cpsInput = el('llf_cps');
  const scaleAutoInput = el('llf_scale_auto');
  // Advanced controls
  const gravBaseInput = el('llf_grav_base');
  const gravDepthInput = el('llf_grav_depth');
  const minBaseInput = el('llf_min_base');
  const minDepthInput = el('llf_min_depth');
  const vyCapInput = el('llf_vy_cap');
  const slideGainInput = el('llf_slide_gain');
  const slideClampInput = el('llf_slide_clamp');
  const jitterInput = el('llf_jitter');
  const stuckVyInput = el('llf_stuck_vy');
  const nudgeInput = el('llf_nudge');
  const collPassInput = el('llf_coll_pass');
  const floorFricInput = el('llf_floor_fric');
  const initVxInput = el('llf_init_vx');
  const initVyMinInput = el('llf_init_vymin');
  const initVyMaxInput = el('llf_init_vymax');

  const NVal = el('llf_N_val');
  const pVal = el('llf_p_val');
  const TVal = el('llf_T_val');
  const cpsVal = el('llf_cps_val');

  const WCalls = el('llf_W_calls');
  const Qacc = el('llf_qacc');

  const LMeas = el('llf_L_meas');
  const LTarget = el('llf_L_target');
  const LambdaAccMeas = el('llf_lambda_acc_meas');
  const LambdaAccTarget = el('llf_lambda_acc_target');
  const LambdaAbnMeas = el('llf_lambda_abn_meas');
  const LambdaAbnTarget = el('llf_lambda_abn_target');
  const DotsInPool = el('llf_dots_in_pool');
  const DotsTargetHint = el('llf_dots_target_hint');
  const Scale = el('llf_scale');

  const TCard = el('llf_T_card');
  const QaccCard = el('llf_qacc_stat_card');
  const LambdaAbnMeasCard = el('llf_lambda_abn_meas_card');
  const LambdaAbnTargetCard = el('llf_lambda_abn_target_card');
  const resetBtn = el('llf_reset');
  // Advanced labels
  const gravBaseVal = el('llf_grav_base_val');
  const gravDepthVal = el('llf_grav_depth_val');
  const minBaseVal = el('llf_min_base_val');
  const minDepthVal = el('llf_min_depth_val');
  const vyCapVal = el('llf_vy_cap_val');
  const slideGainVal = el('llf_slide_gain_val');
  const slideClampVal = el('llf_slide_clamp_val');
  const jitterVal = el('llf_jitter_val');
  const stuckVyVal = el('llf_stuck_vy_val');
  const nudgeVal = el('llf_nudge_val');
  const collPassVal = el('llf_coll_pass_val');
  const floorFricVal = el('llf_floor_fric_val');
  const initVxVal = el('llf_init_vx_val');
  const initVyMinVal = el('llf_init_vymin_val');
  const initVyMaxVal = el('llf_init_vymax_val');
  const rescaleBtn = el('llf_rescale');

  // State
  const state = {
    N: 2000,
    p: 0.25,
    T: 6,
    infinitePatience: false,
    callsPerSec: 1.0,
    autoScaleDots: false,
    fixedK: 25, // default papers per dot when autoScaleDots is false
    // Physics tunables (defaults)
    gravBase: 60,
    gravDepth: 160,
    minDownBase: 30,
    minDownDepth: 90,
    vyCap: 150,
    slideGainMax: 0.8,
    slideClamp: 40,
    jitterMag: 12,
    stuckVy: 8,
    nudgeDown: 22,
    collisionPasses: 5,
    floorFriction: 0.9,
    initVxFactor: 0.15,
    initVyMin: 70,
    initVyMax: 110
  };

  // Canvas and simulation
  let ctx = canvas.getContext('2d');
  const particles = [];
  let lastId = 1;
  const accHistory = [];
  const abnHistory = [];
  const windowCalls = 30;
  let lastTs = null;
  let callTimer = 0;
  let dims = { w: 900, h: 520 };

  function computeDims() {
    const maxW = Math.min(window.innerWidth - 40, 1000);
    const w = clamp(maxW, 680, 1000);
    const h = Math.round(w * 0.58);
    dims = { w, h };
    canvas.width = w;
    canvas.height = h;
  }
  computeDims();
  window.addEventListener('resize', computeDims);

  function theory() {
    const N = state.N, p = state.p, T = state.T, infinite = state.infinitePatience;

    const W_inf = 1 / p;
    const L_inf = N / p;

    const qAcc_T = 1 - Math.pow(1 - p, T);
    const W_all_T = (1 - Math.pow(1 - p, T)) / p;
    const L_all_T = N * W_all_T;
    const lambda_acc_T = N * qAcc_T;
    const lambda_abn_T = N * Math.pow(1 - p, T);

    const L_papers = infinite ? L_inf : L_all_T;
    const W_calls  = infinite ? W_inf : W_all_T;
    const lambdaAcc = infinite ? N : lambda_acc_T;
    const lambdaAbn = infinite ? 0 : lambda_abn_T;

    // Dot scaling: either fixed papers-per-dot or auto target ~200 dots
    let k;
    if (state.autoScaleDots) {
      const capDots = 200;
      k = Math.max(1, Math.ceil(L_papers / capDots));
    } else {
      k = Math.max(1, Math.round(state.fixedK));
    }
    const targetDotsPerCall = Math.max(1, Math.round(N / k));
    const L_dots_target = L_papers / k;

    return { W_calls, L_papers, L_dots_target, k, lambdaAcc, lambdaAbn, targetDotsPerCall, qAcc_T };
  }

  function reset() {
    particles.length = 0;
    accHistory.length = 0;
    abnHistory.length = 0;
    lastId = 1;
    lastTs = null;
    callTimer = 0;
  }

  function geom() {
    const { w, h } = dims;
    const topY = Math.max(100, h * 0.14);
    const botY = h * 0.78; // raise funnel bottom to create more space below the spout
    const centerX = w * 0.5;
    const topHalfWidth = w * 0.30;
    const botHalfWidth = w * 0.06;

    const funnelLeftAt = (y) => {
      const t = clamp((y - topY) / (botY - topY), 0, 1);
      const half = topHalfWidth + t * (botHalfWidth - topHalfWidth);
      return centerX - half;
    };
    const funnelRightAt = (y) => {
      const t = clamp((y - topY) / (botY - topY), 0, 1);
      const half = topHalfWidth + t * (botHalfWidth - topHalfWidth);
      return centerX + half;
    };

    const spoutY = botY + h * 0.025;
    const spoutHalf = botHalfWidth * 0.55;
    const spoutLeft = centerX - spoutHalf;
    const spoutRight = centerX + spoutHalf;

    return { w, h, topY, botY, centerX, funnelLeftAt, funnelRightAt, spoutY, spoutLeft, spoutRight };
  }

  function spawnNewDots(countDots, g) {
    for (let i = 0; i < countDots; i++) {
      const y0 = g.topY - rand(10, 50);
      const left = g.funnelLeftAt(g.topY);
      const right = g.funnelRightAt(g.topY);
      const x0 = rand(left + 6, right - 6);
      const cx0 = (g.spoutLeft + g.spoutRight) / 2;
      const dx0 = cx0 - x0;
      // Give dots a helpful initial push downward and slightly toward the spout
      const initVx = clamp(dx0 * state.initVxFactor + rand(-8, 8), -60, 60);
      const initVy = rand(state.initVyMin, state.initVyMax);
      particles.push({
        id: lastId++,
        x: x0,
        y: y0,
        vx: initVx,
        vy: initVy,
        ageCalls: 0,
        status: 'pool',
        alpha: 1,
        side: null
      });
    }
  }

  function performCallStep(th) {
    let acc = 0, abn = 0;
    for (const ptl of particles) {
      if (ptl.status !== 'pool') continue;

      if (Math.random() < state.p) {
        ptl.status = 'accepted';
        ptl.vx = 0; // start neutral; steering will guide into the spout
        ptl.vy = rand(160, 210);
        acc++;
        continue;
      }

      ptl.ageCalls += 1;

      if (!state.infinitePatience && ptl.ageCalls >= state.T) {
        ptl.status = 'abandoned';
        ptl.side = Math.random() < 0.5 ? 'left' : 'right';
        ptl.vx = (ptl.side === 'left' ? -1 : 1) * rand(140, 200);
        ptl.vy = rand(-20, 20);
        abn++;
      }
    }

    accHistory.push(acc);
    abnHistory.push(abn);
    if (accHistory.length > windowCalls) accHistory.shift();
    if (abnHistory.length > windowCalls) abnHistory.shift();

    const dotsToSpawn = Math.max(1, Math.round(state.N / th.k));
    spawnNewDots(dotsToSpawn, geom());
  }

  function drawBackground(g, th) {
    ctx.clearRect(0, 0, g.w, g.h);

    const bg = ctx.createLinearGradient(0, 0, 0, g.h);
    bg.addColorStop(0, '#ffffff');
    bg.addColorStop(1, '#f6f7fb');
    ctx.fillStyle = bg;
    ctx.fillRect(0, 0, g.w, g.h);

    // Funnel body
    ctx.save();
    ctx.beginPath();
    ctx.moveTo(g.funnelLeftAt(g.topY), g.topY);
    ctx.lineTo(g.funnelLeftAt(g.botY), g.botY);
    ctx.lineTo(g.funnelRightAt(g.botY), g.botY);
    ctx.lineTo(g.funnelRightAt(g.topY), g.topY);
    ctx.closePath();
    ctx.fillStyle = '#eef2ff';
    ctx.strokeStyle = '#c7d2fe';
    ctx.lineWidth = 2;
    ctx.fill();
    ctx.stroke();

    // Spout
    ctx.beginPath();
    ctx.moveTo(g.spoutLeft, g.botY);
    ctx.lineTo(g.spoutLeft, g.spoutY);
    ctx.lineTo(g.spoutRight, g.spoutY);
    ctx.lineTo(g.spoutRight, g.botY);
    ctx.closePath();
    ctx.fillStyle = '#e0e7ff';
    ctx.strokeStyle = '#c7d2fe';
    ctx.fill();
    ctx.stroke();

    // Side exit guides (commented out; purely visual)
    // ctx.strokeStyle = '#fecaca';
    // ctx.setLineDash([6, 6]);
    // ctx.lineWidth = 2;
    // const midY = (g.topY + g.botY) / 2;
    // ctx.beginPath();
    // ctx.moveTo(g.funnelLeftAt(midY), midY);
    // ctx.lineTo(20, midY);
    // ctx.stroke();
    // ctx.beginPath();
    // ctx.moveTo(g.funnelRightAt(midY), midY);
    // ctx.lineTo(g.w - 20, midY);
    // ctx.stroke();
    // ctx.setLineDash([]);

    // Title and legend
    ctx.fillStyle = '#111827';
    ctx.font = '600 18px ui-sans-serif, system-ui, -apple-system';
    ctx.fillText("Little's Law Funnel — L = λ·W", 18, 28);

    ctx.font = '12px ui-sans-serif, system-ui, -apple-system';
    ctx.fillStyle = '#374151';
    ctx.fillText(`1 dot = ${th.k} papers`, 18, 48);
    ctx.fillStyle = '#10b981';
    ctx.fillRect(18, 58, 10, 10); ctx.fillStyle = '#374151'; ctx.fillText('Accepted', 34, 67);
    ctx.fillStyle = '#ef4444';
    ctx.fillRect(120, 58, 10, 10); ctx.fillStyle = '#374151'; ctx.fillText('Abandoned', 136, 67);
    ctx.fillStyle = '#2563eb';
    ctx.fillRect(230, 58, 10, 10); ctx.fillStyle = '#374151'; ctx.fillText('In system (pool)', 246, 67);

    ctx.restore();
  }

  function updateUI(th) {
    NVal.textContent = fmt0.format(state.N) + ' papers';
    pVal.textContent = fmt.format(state.p);
    TVal.textContent = state.T;
    cpsVal.textContent = fmt.format(state.callsPerSec);

    WCalls.textContent = fmt.format(th.W_calls);
    Qacc.textContent = fmt.format(th.qAcc_T);

    LTarget.textContent = fmt0.format(th.L_papers);
    DotsTargetHint.textContent = `(target ≈ ${fmt0.format(th.L_dots_target)})`;
    Scale.textContent = `1 dot = ${fmt0.format(th.k)} papers`;

    if (state.infinitePatience) {
      TInput.disabled = true;
      TCard.style.opacity = 0.6;
      QaccCard.style.display = 'none';
      LambdaAbnMeasCard.style.display = 'none';
      LambdaAbnTargetCard.style.display = 'none';
    } else {
      TInput.disabled = false;
      TCard.style.opacity = 1;
      QaccCard.style.display = '';
      LambdaAbnMeasCard.style.display = '';
      LambdaAbnTargetCard.style.display = '';
    }
  }

  function updateMeasured(th) {
    const L_dots_measured = particles.filter(p => p.status === 'pool').length;
    const L_papers_measured = L_dots_measured * th.k;

    const accDots = accHistory.reduce((a,b)=>a+b,0) / Math.max(1, accHistory.length);
    const abnDots = abnHistory.reduce((a,b)=>a+b,0) / Math.max(1, abnHistory.length);

    const lambda_acc_meas_papers = accDots * th.k;
    const lambda_abn_meas_papers = abnDots * th.k;

    LMeas.textContent = fmt0.format(L_papers_measured);
    LambdaAccMeas.textContent = fmt0.format(lambda_acc_meas_papers);
    LambdaAccTarget.textContent = fmt0.format(th.lambdaAcc);
    LambdaAbnMeas.textContent = fmt0.format(lambda_abn_meas_papers);
    LambdaAbnTarget.textContent = fmt0.format(th.lambdaAbn);
    DotsInPool.textContent = fmt0.format(L_dots_measured);
  }

  function step(ts) {
    if (lastTs == null) lastTs = ts;
    const dt = (ts - lastTs) / 1000;
    lastTs = ts;

    callTimer += dt;
    const interval = 1 / state.callsPerSec;
    const th = theory();
    const g = geom();

    while (callTimer >= interval) {
      performCallStep(th);
      callTimer -= interval;
    }

    // Physics
    for (const ptl of particles) {
      const jitter = rand(-state.jitterMag, state.jitterMag) * dt;
      ptl.vx += jitter;

      ptl.x += ptl.vx * dt;
      ptl.y += ptl.vy * dt;

      if (ptl.status === 'pool') {
        const yClamped = clamp(ptl.y, g.topY + 2, g.botY - 2);
        const left = g.funnelLeftAt(yClamped) + 6;
        const right = g.funnelRightAt(yClamped) - 6;
        // Lateral drift toward the spout midpoint; stronger lower in the funnel to promote sliding
        const cx = (g.spoutLeft + g.spoutRight) / 2;
        const toward = cx - ptl.x;
        const slideGain = state.slideGainMax * ((ptl.y - g.topY) / Math.max(1, g.botY - g.topY));
        ptl.vx += clamp(toward * slideGain, -state.slideClamp, state.slideClamp) * dt;
        // If nearly stuck mid-funnel, add a tiny jitter and nudge downward
        if (Math.abs(ptl.vy) < state.stuckVy && ptl.y > (g.topY + (g.botY - g.topY) * 0.25)) {
          ptl.vx += (Math.random() - 0.5) * state.jitterMag * dt;
          ptl.vy += state.nudgeDown * dt;
        }
        if (ptl.x < left) { ptl.x = left; ptl.vx *= -0.4; }
        if (ptl.x > right) { ptl.x = right; ptl.vx *= -0.4; }
        ptl.y = yClamped;
        // Gravity-like pull that increases with depth to encourage sliding and faster descent
        const depth = (ptl.y - g.topY) / Math.max(1, (g.botY - g.topY));
        const grav = state.gravBase + state.gravDepth * depth;      // stronger acceleration lower in funnel
        const minDown = state.minDownBase + state.minDownDepth * depth;    // enforce a slightly higher minimum downward speed
        ptl.vy = clamp(ptl.vy + grav * dt, minDown, state.vyCap);
      } else if (ptl.status === 'accepted') {
        // Guide toward spout center and keep inside funnel walls until reaching the spout
        const cx = (g.spoutLeft + g.spoutRight) / 2;
        const dx = cx - ptl.x;
        // Stronger lateral steering toward center
        ptl.vx += clamp(dx * 1.5, -180, 180) * dt;

        // Accelerate downward
        ptl.vy = clamp(ptl.vy + 240 * dt, 140, 320);

        // Keep accepted dots inside funnel body
        if (ptl.y < g.botY) {
          const left = g.funnelLeftAt(ptl.y) + 4;
          const right = g.funnelRightAt(ptl.y) - 4;
          if (ptl.x < left) { ptl.x = left; ptl.vx = Math.max(0, ptl.vx); }
          if (ptl.x > right) { ptl.x = right; ptl.vx = Math.min(0, ptl.vx); }
        } else {
          // Inside spout: clamp to spout corridor
          if (ptl.x < g.spoutLeft + 2) { ptl.x = g.spoutLeft + 2; ptl.vx = Math.max(0, ptl.vx); }
          if (ptl.x > g.spoutRight - 2) { ptl.x = g.spoutRight - 2; ptl.vx = Math.min(0, ptl.vx); }
        }

        if (ptl.y > g.h + 30) ptl.alpha = 0;
      } else if (ptl.status === 'abandoned') {
        ptl.vx *= 0.995;
        ptl.alpha -= 0.6 * dt;
        if (ptl.alpha <= 0 || ptl.x < -40 || ptl.x > g.w + 40) ptl.alpha = 0;
      }
    }

    // Resolve collisions among pool particles so they occupy space near the bottom
    resolvePoolCollisions(g);

    // Remove dead
    for (let i = particles.length - 1; i >= 0; i--) {
      if (particles[i].alpha <= 0) particles.splice(i, 1);
    }

    // Draw
    drawBackground(g, th);
    for (const ptl of particles) {
      const r = DOT_RADIUS;
      if (ptl.status === 'pool') ctx.fillStyle = 'rgba(37, 99, 235, 0.95)';
      else if (ptl.status === 'accepted') ctx.fillStyle = 'rgba(16, 185, 129, 0.95)';
      else ctx.fillStyle = `rgba(239, 68, 68, ${clamp(ptl.alpha, 0, 1)})`;
      ctx.beginPath();
      ctx.arc(ptl.x, ptl.y, r, 0, Math.PI * 2);
      ctx.fill();
    }

    updateUI(th);
    updateMeasured(th);

    requestAnimationFrame(step);
  }

  // Initialize inputs and events

  NInput.value = state.N;
  pInput.value = state.p;
  TInput.value = state.T;
  infInput.checked = state.infinitePatience;
  cpsInput.value = state.callsPerSec;
  scaleAutoInput.checked = state.autoScaleDots;
  // Advanced defaults -> inputs
  gravBaseInput.value = state.gravBase;
  gravDepthInput.value = state.gravDepth;
  minBaseInput.value = state.minDownBase;
  minDepthInput.value = state.minDownDepth;
  vyCapInput.value = state.vyCap;
  slideGainInput.value = state.slideGainMax;
  slideClampInput.value = state.slideClamp;
  jitterInput.value = state.jitterMag;
  stuckVyInput.value = state.stuckVy;
  nudgeInput.value = state.nudgeDown;
  collPassInput.value = state.collisionPasses;
  floorFricInput.value = state.floorFriction;
  initVxInput.value = state.initVxFactor;
  initVyMinInput.value = state.initVyMin;
  initVyMaxInput.value = state.initVyMax;

  NInput.addEventListener('input', () => { state.N = parseInt(NInput.value, 10); });
  pInput.addEventListener('input', () => { state.p = parseFloat(pInput.value); });
  TInput.addEventListener('input', () => { state.T = parseInt(TInput.value, 10); });
  infInput.addEventListener('change', () => { state.infinitePatience = infInput.checked; });
  cpsInput.addEventListener('input', () => { state.callsPerSec = parseFloat(cpsInput.value); });
  scaleAutoInput.addEventListener('change', () => { state.autoScaleDots = scaleAutoInput.checked; });
  rescaleBtn.addEventListener('click', () => { /* no-op: next theory() call recomputes k; keep for UX */ });
  // Advanced listeners
  const syncVal = (input, label, parser) => {
    const val = parser ? parser(input.value) : parseFloat(input.value);
    label.textContent = String(val);
    return val;
  };
  const setAndLabel = (key, input, label, parser) => {
    const set = () => { state[key] = syncVal(input, label, parser); };
    input.addEventListener('input', set);
    set();
  };
  setAndLabel('gravBase', gravBaseInput, gravBaseVal);
  setAndLabel('gravDepth', gravDepthInput, gravDepthVal);
  setAndLabel('minDownBase', minBaseInput, minBaseVal);
  setAndLabel('minDownDepth', minDepthInput, minDepthVal);
  setAndLabel('vyCap', vyCapInput, vyCapVal);
  setAndLabel('slideGainMax', slideGainInput, slideGainVal);
  setAndLabel('slideClamp', slideClampInput, slideClampVal);
  setAndLabel('jitterMag', jitterInput, jitterVal);
  setAndLabel('stuckVy', stuckVyInput, stuckVyVal);
  setAndLabel('nudgeDown', nudgeInput, nudgeVal);
  setAndLabel('collisionPasses', collPassInput, collPassVal, v => parseInt(v, 10));
  setAndLabel('floorFriction', floorFricInput, floorFricVal);
  setAndLabel('initVxFactor', initVxInput, initVxVal);
  setAndLabel('initVyMin', initVyMinInput, initVyMinVal, v => parseInt(v, 10));
  setAndLabel('initVyMax', initVyMaxInput, initVyMaxVal, v => parseInt(v, 10));

  resetBtn.addEventListener('click', reset);

  // Kick off
  requestAnimationFrame(step);
})();
 </script>

<h2 id="details-and-further-discussion">Details and Further Discussion</h2>

<p>We will now discuss the above model in more detail and provide additional insights. At the very end we also present a proof of Little’s law.</p>

<h3 id="assumptions">Assumptions</h3>

<p>We make the assumptions from above more precise.</p>

<p><strong>Time unit:</strong> one conference call (one submission cycle).</p>

<p><strong>Arrivals:</strong> each call injects $N$ new submissions, hence the arrival rate per call is $\lambda=N$.</p>

<p><strong>Completion (“service”):</strong> each paper present in a call independently completes (i.e., gets a final accept/abandon decision) with hazard $p$ in that call.</p>
<ul>
  <li>“No giving up” = a paper continues until accepted;</li>
  <li>“Finite patience” = a paper is abandoned after $T$ unsuccessful calls.</li>
</ul>

<p><strong>State (queue length):</strong> Let $L_t$ denote number of papers currently “in the system” (awaiting final disposition). Then the <em>stock–flow equation</em> is given by</p>

\[L_{t+1}=(1-p)L_t+N,\]

<p>which is exactly a discrete-time queue with batch arrivals $N$ and per-step departure probability $p$.</p>

<h3 id="no-give-up-world-geometric-time-in-system">No-give-up world: geometric time in system</h3>

<p>If authors never give up, each paper’s time in system is distributed as <em>Geometric($p$)</em> on ${1,2,\dots}$ with mean</p>

\[W=\mathbb{E}[G]=\frac{1}{p}\quad \text{(calls)}.\]

<p>With $\lambda=N$ per call, Little’s law yields</p>

\[L=\lambda W = N\cdot\frac{1}{p}=\frac{N}{p}.\]

<p>As such, steady-state <em>throughput (accepted papers per call)</em> equals the <em>arrival rate</em>:</p>

\[\text{accepted per call}=\lambda_{\text{acc}}=\lambda=N.\]

<h3 id="finite-patience-t-truncated-geometric--subflows">Finite patience $T$: truncated geometric &amp; subflows</h3>

<p>Let each paper give up after $T$ calls if not accepted. Now there are two exit modes: <em>accept</em> or <em>abandon</em>.</p>

<p><strong>Probability of eventual acceptance.</strong> Let us first compute the probability that a paper is eventually <em>accepted</em> (within $T$ calls):</p>

\[q_{\text{acc}}=1-(1-p)^T.\]

<p><strong>All-exits.</strong> Next we compute the time in system irrespective of the exit mode (accept or abandon):</p>

\[W_{\text{all}}=\mathbb{E}[\min(G,T)]
=\sum_{k=1}^{T}\Pr(G\ge k)=\frac{1-(1-p)^T}{p}.\]

<p>Now we first apply Little’s law to the <em>whole system</em> (all exit modes), to obtain</p>

\[L_{\text{all}}=\lambda\,W_{\text{all}}
=N\cdot\frac{1-(1-p)^T}{p}.\]

<p>And next, we apply Little’s law to the <em>subflows</em> arising from the exit modes <em>accept</em> and <em>abandon</em>:</p>

<p><strong>Accepted subflow.</strong></p>

<p><em>Throughput</em> is given by</p>

\[\lambda_{\text{acc}}=N\,q_{\text{acc}}=N\big[1-(1-p)^T\big].\]

<p><em>Conditional mean time given acceptance</em> is given by</p>

\[W_{\text{acc}}=\mathbb{E}[G\mid G\le T]
=\frac{1}{p}-\frac{T(1-p)^T}{1-(1-p)^T}.\]

<p>Population of papers currently in-system that are <em>destined to be accepted</em>:</p>

\[L_{\text{acc}}=\lambda_{\text{acc}}\,W_{\text{acc}}
=N\!\left(\frac{1-(1-p)^T}{p}-T(1-p)^T\right).\]

<p><strong>Abandonment subflow.</strong></p>

<p><em>Throughput</em> is given by</p>

\[\lambda_{\text{abn}}=N(1-p)^T.\]

<p><em>Time in system</em> is given by</p>

\[W_{\text{abn}}=T.\]

<p>Population of papers currently in-system that are <em>destined to be abandoned</em>:</p>

\[L_{\text{abn}}=\lambda_{\text{abn}}\,W_{\text{abn}}=N\,T(1-p)^T.\]

<p><strong>Check.</strong> $L_{\text{acc}}+L_{\text{abn}}=L_{\text{all}}$.</p>

<p><strong>Observation.</strong> Increasing $p$ modestly boosts $\lambda_{\text{acc}}$ (since $1-(1-p)^T$ saturates), but it <em>significantly</em> lowers $W_{\text{all}}$ and hence $L_{\text{all}}$: far fewer papers sitting in the pool at any time.</p>

<h3 id="quality-buckets-class-dependent-hazards-and-patience">Quality buckets: class-dependent hazards and patience</h3>

<p>If submissions are split into <em>classes</em> (e.g., “great/average/bad”) and decisions prioritize higher quality, you have <em>class-dependent hazards</em> $p_i$ and potentially <em>class-dependent patience</em> $T_i$. Let $\pi_i$ denote the arrival share of class $i$, so per call</p>

\[\lambda_i = N\,\pi_i \quad (\text{arrivals of class } i).\]

<p><strong>Class-wise Little’s law.</strong> For each stable class $i$,</p>

\[L_i = \lambda_i\,W_i,\]

<p>where $W_i$ is the mean time in system for class $i$.</p>

<p><strong>Finite patience (class $i$).</strong> If class $i$ abandons after $T_i$ calls with per-call accept hazard $p_i$, then</p>

\[W^{\text{all}}_{i}=\frac{1-(1-p_i)^{T_i}}{p_i},\qquad
L^{\text{all}}_{i}=\lambda_i\,W^{\text{all}}_{i}=\lambda_i\,\frac{1-(1-p_i)^{T_i}}{p_i}.\]

<p><strong>Accepted throughput (class $i$).</strong> The per-call accepted rate is</p>

\[\lambda^{\text{acc}}_{i}=\lambda_i\Big(1-(1-p_i)^{T_i}\Big).\]

<p><strong>Mixes (shares).</strong> The backlog and accepted shares of class $i$ are</p>

\[\text{BacklogShare}_i=\frac{L^{\text{all}}_i}{\sum_j L^{\text{all}}_j}
=\frac{\lambda_i\frac{1-(1-p_i)^{T_i}}{p_i}}{\sum_j \lambda_j\frac{1-(1-p_j)^{T_j}}{p_j}},\]

<p>and</p>

\[\text{AcceptShare}_i=\frac{\lambda^{\text{acc}}_{i}}{\sum_j \lambda^{\text{acc}}_{j}}
=\frac{\lambda_i\left[1-(1-p_i)^{T_i}\right]}{\sum_j \lambda_j\left[1-(1-p_j)^{T_j}\right]}.\]

<p><strong>Special case (equal hazard).</strong> If $p_i\equiv p$ for all classes, then for two classes $g, b$ with $\pi_g+\pi_b=1$,</p>

\[\text{AcceptShare}_g
=\frac{\pi_g\left[1-(1-p)^{T_g}\right]}
       {\pi_g\left[1-(1-p)^{T_g}\right]+\pi_b\left[1-(1-p)^{T_b}\right]},\]

<p>and $\text{BacklogShare}_g = \text{AcceptShare}_g$.</p>

<p><strong>Interpretation.</strong> Because $\sum_i \lambda_i = N$ per call is fixed, reweighting hazards mainly shifts the <em>composition</em> of acceptances; total throughput changes little, while total <em>backlog</em> still scales with the average hazard via $W$.</p>

<h3 id="what-a-small-arrival-growth-g-does-to-the-pool">What a small arrival growth $g$ does to the pool</h3>

<p>Little’s law says the instantaneous pool level tracks a <em>windowed average</em> of recent arrivals with window length $\approx W$ (see proof below). Hence a sustained per-call growth $g$ in arrivals translates into an approximate <em>level</em> increase of</p>

\[\Delta L \;\approx\; (g \cdot \text{current arrivals}) \cdot W
\;\approx\; (gN)\cdot \frac{1}{p},\]

<p>thus as a rule-of-thumb we obtain:</p>

\[\frac{\Delta L}{L}\;\approx\; \frac{g}{p}\ \text{(over }\approx W\text{ calls)}.\]

<p><strong>Note.</strong> For precise dynamics with time-varying arrivals $N_t$, the exact solution is a geometrically weighted sum of recent $N_t$’s.</p>

<h3 id="converting-to-yearly-units">Converting to yearly units</h3>

<p>We considered calls as the time unit. However, we can easily convert/rescale to yearly units. If the field has $c$ calls/year:</p>

<ul>
  <li>Mean time in system:
    <ul>
      <li>$W_{\text{years}}=\dfrac{1}{pc}$ for the case of no-give-up</li>
      <li>$W_{\text{all,years}}=\dfrac{1-(1-p)^T}{pc}$ for the case of finite $T$.</li>
    </ul>
  </li>
  <li>Backlog level in
    <ul>
      <li><em>papers</em>: multiply $L$ above by $1$ (dimensionless)</li>
      <li><em>paper-years</em>: multiply by $1/c$.</li>
    </ul>
  </li>
</ul>

<h2 id="a-short-proof-of-littles-law">A short proof of Little’s law</h2>

<p>Let $Q(t)$ be the number of items in the system at time $t$. Let $a_k$ and $d_k$ be the <em>arrival times</em> and <em>departure times</em> of the $k$-th job respectively, and $W_k \doteq d_k-a_k$ its <em>sojourn times</em> (time in system). We assume that the system is <strong>stable/stationary</strong> with long-run arrival rate $\lambda$ and finite mean sojourn time</p>

\[\tag{meanSojournTime}
W \doteq \mathbb E[W_k] &lt; \infty.\]

<p>Let</p>

<ul>
  <li>$A(T) \doteq \abs{\set{k: a_k\le T}}$ denote arrivals by $T$ and,</li>
  <li>$D(T) \doteq \abs{\set{k: d_k\le T}}$ denote departures by $T$.</li>
</ul>

<p>We further assume <strong>rate stability</strong></p>

\[\tag{rateStability}
\lim_{T\to\infty}\tfrac{A(T)}{T}=\lim_{T\to\infty}\tfrac{D(T)}{T}=\lambda.\]

<p><strong>Area–interval identity.</strong> We start with a simple observation. Given any time $t$, let $Q(t) \doteq\sum_k \mathbf 1_{\set{a_k \le t &lt; d_k}}$ denote the number of jobs in the system at time $t$. We can now integrate and swap sum and integral to obtain:</p>

\[\int_0^T Q(t)\,dt
= \sum_k \int_0^T \mathbf 1_{\set{a_k \le t &lt; d_k}} \,dt
= \sum_k \abs{[0,T]\cap[a_k,d_k)}.\]

<p>Therefore, the “area” under $Q(\cdot)$ over $[0,T]$ equals the sum of the lengths of the intersections of $[0,T]$ with the jobs’ sojourn intervals. We can now split the jobs into three disjoint groups:</p>

<ol>
  <li>jobs that <em>arrive and depart</em> within $[0,T]$: contribute exactly $W_k$;</li>
  <li>jobs <em>present at time $0$</em> (arrived before $0$): contribute at most their remaining times inside $[0,T]$;</li>
  <li>jobs <em>still present at time $T$</em>: contribute at most the elapsed times since arrival up to $T$.</li>
</ol>

<p>Letting $S(T) \doteq \int_0^T Q(t)\,dt$, with the three groups we now have:</p>

\[\sum_{k:\,0\le a_k&lt;d_k\le T} W_k
\;\le\; S(T)
\;\le\; \sum_{k:\,a_k\le T} W_k \;+\; B_0 \;+\; B_T,\]

<p>where $B_0$ is the total residual time inside $[0,T]$ of the finitely many jobs present at $t=0$, and $B_T$ is the total elapsed time inside $[0,T]$ of the finitely many jobs present at $t=T$. Note that both $B_0, B_T$ are <em>finite constants</em>, i.e., they do not grow with $T$.</p>

<p><strong>Divide by $T$ and pass to limits.</strong> Similar to convergence proofs of, e.g., (stochastic) gradient descent, we now divide by $T$:</p>

\[\frac{1}{T}\sum_{k:\,0\le a_k&lt;d_k\le T} W_k
\;\le\; \frac{S(T)}{T}
\;\le\; \frac{1}{T}\sum_{k:\,a_k\le T} W_k \;+\; \frac{B_0+B_T}{T},\]

<p>and then take the limit as $T\to\infty$. In particular, as $T\to\infty$ we have $\tfrac{B_0+B_T}{T}\to 0$. Note, this also measures how long it takes for the system to reach stability. Now for the sums:</p>

<p>The left sum counts <em>departures</em> within $[0,T]$. We rewrite it suggestively as</p>

\[\frac{D(T)}{T}\cdot \frac{1}{D(T)}\sum_{k=1}^{D(T)} W_k \;\xrightarrow[T\to\infty]{}\; \lambda\cdot W,\]

<p>by rate stability and the (ergodic/renewal-reward) law of large numbers for ${W_k}$.</p>

<p>The right sum counts <em>arrivals</em> by $T$ and it can similarly be rewritten as</p>

\[\frac{A(T)}{T}\cdot \frac{1}{A(T)}\sum_{k=1}^{A(T)} W_k \;\xrightarrow[T\to\infty]{}\; \lambda\cdot W.\]

<p>As such we obtain for $\lim_{T\to\infty}\frac{S(T)}{T}$ that</p>

\[\lim_{T\to\infty}\frac{1}{T}\int_0^T Q(t)\,dt \;=\; \lambda\,W,\]

<p>as it is sandwiched between the two limits from above.</p>

<p>Finally, observe that the left-hand side is the <em>time average</em> of the number of jobs in system:</p>

\[L \doteq \lim_{T\to\infty} \frac{1}{T}\int_0^T Q(t)\,dt.\]

<p>Therefore,</p>

\[L = \lambda W.\]

<p>A few remarks are in order.</p>

<p><strong>Remarks.</strong></p>

<ol>
  <li>No assumptions on service-time distribution, number of servers, or service order were used. We only needed stability and finiteness of the relevant limits.</li>
  <li>The same argumentation can be applied to <em>subflows</em> (e.g., “accepted only”, or splitting into “great/average/bad” papers), yielding $L_i=\lambda_i W_i$ for each stable class $i$.</li>
</ol>]]></content><author><name>David Martínez-Rubio x Sebastian Pokutta</name></author><category term="random" /><category term="conference" /><category term="review" /><category term="Little&apos;s Law" /><summary type="html"><![CDATA[TL;DR: This is the queueing model perspective of the “paper pool” conference reviewing model with math and numbers based on Little’s Law. Think of it as supplementary material to the post on David’s blog on current ML/AI conferences; you might want to read that one first.]]></summary></entry><entry><title type="html">Why the hell does nobody build more affordable housing in Berlin?!</title><link href="http://www.pokutta.com/blog/random/2025/08/14/rent_buy_affordability.html" rel="alternate" type="text/html" title="Why the hell does nobody build more affordable housing in Berlin?!" /><published>2025-08-14T01:00:00+02:00</published><updated>2025-08-14T01:00:00+02:00</updated><id>http://www.pokutta.com/blog/random/2025/08/14/rent_buy_affordability</id><content type="html" xml:base="http://www.pokutta.com/blog/random/2025/08/14/rent_buy_affordability.html"><![CDATA[<p><em>TL;DR: High construction costs and interest rates create tight margins for new housing in Berlin, making affordable projects barely viable even without profits—minimum rents required often strain affordability, explaining parts of the building slowdown.</em>
<!--more--></p>

<style>
    .plot-container {
        display: flex;
        justify-content: center;
        align-items: center;
        margin: 1em 0;
        max-width: 100%;
        height: 600px;
    }
    .plot-container canvas {
        max-width: 100%;
        height: auto;
    }
    .feasibility-calculator {
        margin: 2em 0;
    }
</style>

<p><strong>Disclaimer 1</strong>: Stating the obvious or the most likely outcome is not an endorsement.</p>

<p><strong>Disclaimer 2</strong>: This is <em>not</em> investment or tax advice, a recommendation or call for any specific action, or an opinion piece; your mileage may vary.</p>

<p><strong>Disclaimer 3</strong>: The math is simplified for the sake of intuition; you can go arbitrarily complex with (mildly) improved precision but the main mechanisms are the same.</p>

<h2 id="introduction">Introduction</h2>

<p>Recently, I had a conversation with a colleague about the skyrocketing rents in Berlin and why it seems impossible to build more affordable housing. They argued that greed from developers and landlords is one of the main reasons, but I wondered if basic economics—like construction costs and interest rates—might not already explain much of what is seen. To explore this, I did some first-principles, back-of-the-envelope calculations, focusing solely on covering costs without any profit motive. We are assuming no subsidies, no self-use scenarios, no feeding the mortgage (also known as negative gearing), and “cold rents” (excluding utilities like heating, water, etc.). If you do not care for the math, just check the estimations in the results section.</p>

<p>The executive summary boils down to: In today’s high interest rate regime by historical standards (even at low-end rates of 3-4%) combined with median construction costs around €4,470/sqm in Germany, sustainable rents are pushed to levels that strain affordability, making new projects barely viable without losses or at least significant risk thereof. Coincidentally, there was also an <a href="https://www.tagesspiegel.de/wirtschaft/immobilien/reich-und-raffgierig-sieben-uberraschende-erkenntnisse-uber-deutschlands-vermieter-14157455.html">article in the Tagesspiegel</a> quite recently that looked at the economics of rents and their findings align with the ones below.</p>

<p>All numbers are based on publicly available data; see references at the end of the post.</p>

<h2 id="covering-costs-vs-actual-investments">Covering Costs vs Actual Investments</h2>

<p>This analysis focuses on minimum rents that cover only opportunity costs and basic ownership expenses, without profit margins, expected property appreciation, and risk premiums. It represents a “break-even” scenario, assuming investors aim merely to avoid losses. This is a reasonable minimum requirement as otherwise investors would not invest in the first place and put money elsewhere. As a second-order effect this also holds true for institutional investors in the broader sense (including pension funds, governments, etc.) as they face similar opportunity costs and also these investors need to aquire the funds-to-be-allocated from somewhere (be it money paid into a pension fund, taxes collected, or simply debt) and have to answer their constituents and stakeholders.</p>

<p>In contrast, actual investments typically seek profits, incorporating risk premiums, management fees, and anticipated appreciation. Our model is conservative by underestimating required rents for profit‑driven scenarios—real‑world investments might demand higher rents to be viable. However, it is not conservative in ignoring appreciation; by purposefully excluding it, we avoid overestimating feasibility in times of uncertain growth potential and market conditions, or more broadly in times of risk aversion. Moreover, appreciation only affects rents if the investor or owner is willing to take lower rents than required to cover costs, i.e., negative cashflows in anticipation of future appreciation. We will discuss this point in the next section.</p>

<h3 id="feeding-the-mortgage-and-negative-gearing">Feeding the Mortgage and Negative Gearing</h3>

<p>As mentioned in the introduction, our analysis assumes no “feeding the mortgage” (negative gearing), where rental income is less than expenses, and the owner covers the shortfall. This practice, common worldwide for tax benefits, often allows losses to be deducted against other income. We explicitly exclude it here as we focus on break-even scenarios without self-use, appreciation bets, or subsidies; negative gearing may be understood as functioning akin to an indirect subsidy, realized through the tax system. Nonetheless, for completeness, we will briefly discuss it below so that its effects are clear.</p>

<p>The tax mechanisms for negative gearing are country-specific. In Germany, interest and losses from rental properties can be deducted against other income. For example, if a property costs €5,000/sqm with 4% interest, monthly costs might be €20/sqm, but if rent is only €15/sqm, the owner “feeds” €5/sqm—potentially tax-deductible, encouraging investment despite losses, especially if expecting capital gains.</p>

<p><strong>Example:</strong> Rental income €15,000/year, costs €25,000/year, loss €10,000 → taxable income reduced by €10,000, saving ~€4,200 at 42% bracket. The state effectively co-funds the feeding.</p>

<p>The example above is more that of a private investor and institutional investors may have very different and highly complex structures. Moreover, negative gearing in Germany is not as aggressive as in some other countries like Australia. It is also important to note, that negative cashflows come with (substantial) risks, including financial strain if property values do not appreciate as expected, higher interest rates increasing losses, or changes in tax laws reducing benefits. Why do investors do it anyway?</p>

<ul>
  <li><strong>Capital Gains Play</strong>: Accept short-term losses hoping the property’s value will rise enough to offset them.</li>
  <li><strong>Tax Deductions</strong>: In some countries, deduct the loss from other income, reducing your tax bill as discussed above.</li>
  <li><strong>Portfolio Growth</strong>: Enables buying more expensive properties, compounding returns if prices rise.</li>
</ul>

<p>However, negative cashflows require investors to take realized losses (“money gone for sure”), so that if conditions are unfavorable it might be hard to convince investors to take the risk; in particular if the investment is leveraged, which exacerbates the loss relative to the equity base.</p>

<h2 id="key-results">Key Results</h2>

<p>Summarizing the calculations from below, here are the minimum cold rents needed to cover costs (at 3% interest and 25% cost factor as defined below), without profits:</p>

<ul>
  <li>Median new-build construction: 13.96 €/sqm/month (median construction cost of 4,470 €/sqm)</li>
  <li>Mitte (center) existing: 25.00 €/sqm/month (average purchase price of 8,000 €/sqm)</li>
  <li>Charlottenburg existing: 19.69 €/sqm/month (average purchase price of 6,300 €/sqm)</li>
  <li>Spandau (outer) existing: 10.00 €/sqm/month (average purchase price of 3,200 €/sqm)</li>
</ul>

<p>Current Berlin rents (12.50-20.00 €/sqm) often imply lower sqm prices than actual costs, suggesting unsustainability for new projects (either construction or renting out purchased properties) while older projects from a while back might benefit from cheaper financing, lower construction costs, etc and hence allowing for lower rents. This is also reflected in the building permits, which are down and far away from matching the demand (exacerbating shortages) as often the economics do not work out.</p>

<p>Note that the quoted rents above are also consistent with the folklore rule-of-thumb that every 1000 €/sqm increase in price requires an increase of 4 €/sqm/month in rent. This comes out to an effective interest rate of 3.84% at 25% cost factor (or flat 4.8% at 0% cost factor).</p>

<h2 id="market-data">Market Data</h2>

<p>In the following we briefly summarize some of the data that we will use in the analysis.</p>

<h3 id="current-rent-situation-in-berlin">Current Rent Situation in Berlin</h3>

<p>Based on 2024-2025 data, average cold rents (excluding utilities) in Berlin are:</p>

<ul>
  <li>Citywide average (existing apartments): 12.50 €/sqm/month</li>
  <li>New-build apartments: 18.00 €/sqm/month</li>
  <li>Mitte (center): 20.00 €/sqm/month</li>
  <li>Charlottenburg: 15.50 €/sqm/month</li>
  <li>Spandau (outer): 9.50 €/sqm/month</li>
</ul>

<h3 id="current-square-meter-prices">Current Square Meter Prices</h3>

<p>Based on 2024-2025 data, average purchase prices for existing apartments in Berlin are:</p>

<ul>
  <li>Citywide average: 4,000 €/sqm</li>
  <li>Mitte (center): 6,400 €/sqm</li>
  <li>Charlottenburg: 4,960 €/sqm</li>
  <li>Spandau (outer): 3,040 €/sqm</li>
</ul>

<p>and average prices for new-build apartments are:</p>

<ul>
  <li>Citywide average: 8,300 €/sqm</li>
  <li>Mitte (center): 14,500 €/sqm</li>
  <li>Charlottenburg: 11,400 €/sqm</li>
  <li>Spandau (outer): 6,000 €/sqm</li>
</ul>

<p>Median construction costs in Germany are 4,470 €/sqm (for multi-family housing, incl. VAT) (<a href="https://mieterbund.de/app/uploads/2025/04/Endbericht_Wohnungsbau-2025-_Quo-vadis_Stand-01.04.2025.pdf#page=19.11">Mieterbund Report, Figure 11</a>).</p>

<h3 id="building-permits-in-berlin">Building Permits in Berlin</h3>

<p>The economic constraints we have discussed manifest in Berlin’s building permit trends. According to the official statistics office, Berlin approved 9,772 dwellings in 2024 (−38.5% year-on-year) (<a href="https://www.statistik-berlin-brandenburg.de/f-ii-1-j">source</a>).</p>

<p>Nationally, Germany issued about 215,900 residential building permits in 2024 (−16.8% year-on-year)—the lowest since 2010 and far below the 400,000‑unit target (<a href="https://www.destatis.de/DE/Themen/Branchen-Unternehmen/Bauen/_inhalt.html">Destatis</a>). A slight rebound was reported in April 2025 to roughly 18,500 permits (+4.9%) (<a href="https://www.destatis.de/DE/Themen/Branchen-Unternehmen/Bauen/_inhalt.html">Destatis</a>), but monthly levels remain insufficient.</p>

<h2 id="detailed-analysis">Detailed Analysis</h2>

<p>The following sections explain the natural constraints, formulas, and scenarios in detail. Later below there is also an interactive calculator that allows you to play around with the parameters and explore various scenarios.</p>

<h3 id="natural-constraints">Natural Constraints</h3>

<p>In this section, we outline the fundamental economic constraints that shape feasible rent and price configurations in the housing market. These arise from basic principles: costs must be covered, prices cannot defy construction realities, and rents must remain affordable for tenants. Note that all rents discussed are “cold rents,” excluding utilities such as heating, water, electricity, internet, etc. We are considering a pure buy-to-rent or build-to-rent model without self-occupancy or profit motives.</p>

<h4 id="opportunity-cost-of-capital">Opportunity Cost of Capital</h4>

<p>For buying-to-rent or building-to-rent to be economically viable, we need capital, either obtained via a loan (debt), e.g., from a bank or via our own capital (equity). In both cases we have to pay interest on that money, either in the form of actual interest to the bank or in the form of opportunity cost, which is essentially the return you forgo by not investing the money somewhere else; we refer to both costs here as opportunity cost.</p>

<p>The monthly rent per square meter $r$ must at least cover the opportunity cost of the capital invested in the property per square meter, plus additional ownership costs (maintenance, property taxes, etc.). This is expressed as:</p>

\[r \geq \left( \frac{p}{12} \times i \right) \times (1 + c)\]

<p>where:</p>
<ul>
  <li>$p$: price per square meter (€/sqm)</li>
  <li>$i$: annual interest rate</li>
  <li>$c$: cost factor (maintenance, taxes, etc.)</li>
</ul>

<p><strong>Meaning</strong>: Below this value the rent does not cover the costs of ownership.</p>

<h4 id="replication-cost-constraint">Replication Cost Constraint</h4>

<p>Property prices cannot sustainably fall below the cost of building or obtaining a new equivalent property:</p>

\[p \geq rc\]

<p>where $rc$ is the replication cost (€/sqm), i.e., construction costs or cost of buying.</p>

<p><strong>Meaning</strong>: The price $p$ to construct or buy a square meter is at least $rc$.</p>

<h4 id="affordability-constraint">Affordability Constraint</h4>

<p>Rent must be affordable based on income. Typically, housing costs should not exceed 33% of net income:</p>

\[r \leq \frac{n \times 0.33}{s}\]

<p>where:</p>
<ul>
  <li>$n$: monthly net income (€)</li>
  <li>$s$: square meters needed</li>
</ul>

<p>Note that the 33% is arbitrary but a good estimate; in the calculator below one can simply dial in the desired maximum rent, either via the monthly income or the square meters needed, if one does not agree with the 33% estimate, which admittedly becomes more and more challenging to meet.</p>

<p><strong>Meaning</strong>: Rents above this threshold are unaffordable.</p>

<h3 id="minimum-required-rent">Minimum Required Rent</h3>

<p>The minimum required rent, based on opportunity cost and replication cost, is calculated at the replication cost price:</p>

\[r_{min} = \left( \frac{rc}{12} \times i \right) \times (1 + c)\]

<p>This value represents the lowest feasible rent required to cover costs.</p>

<h2 id="interactive-calculator">Interactive Calculator</h2>

<p>Feel free to play around and choose values that you think are reasonable. In particular, the interest rate has significant impact on the required rent per sqm.</p>

<p>The sliders adjust:</p>
<ul>
  <li><strong>Interest Rate</strong>: Annual opportunity cost or loan rate (i).</li>
  <li><strong>Cost Factor</strong>: Additional ownership costs as a percentage (c).</li>
  <li><strong>Replication Cost</strong>: Minimum construction/purchase price per sqm (rc).</li>
  <li><strong>Net Income</strong>: Monthly net income for affordability (n).</li>
  <li><strong>Space Needed (sqm)</strong>: Living space required for affordability (s).</li>
</ul>

<p>The light green area is the feasible region where all constraints are satisfied. Adjust the sliders to see how parameters affect feasibility.</p>

<section class="feasibility-calculator">
    <div style="display: flex; gap: 20px; margin-bottom: 20px; flex-wrap: wrap;">
        <div>
            <label for="interest-slider">Interest Rate: <span id="interest-value">3.0%</span></label><br />
            <input type="range" id="interest-slider" min="0.1" max="15" value="3" step="0.1" style="width: 200px;" />
        </div>
        <div>
            <label for="cost-slider">Cost Factor: <span id="cost-value">25%</span></label><br />
            <input type="range" id="cost-slider" min="0" max="100" value="25" step="1" style="width: 200px;" />
        </div>
        <div>
            <label for="replication-slider">Replication Cost: <span id="replication-value">€4,500/sqm</span></label><br />
            <input type="range" id="replication-slider" min="1500" max="15000" value="4500" step="100" style="width: 200px;" />
        </div>
        <div>
            <label for="income-slider">Net Income: <span id="income-value">€3,000/month</span></label><br />
            <input type="range" id="income-slider" min="1000" max="10000" value="3000" step="100" style="width: 200px;" />
        </div>
        <div>
            <label for="sqm-slider">Space Needed (sqm): <span id="sqm-value">65 sqm</span></label><br />
            <input type="range" id="sqm-slider" min="20" max="200" value="65" step="5" style="width: 200px;" />
        </div>
    </div>
    <div class="plot-container">
        <canvas id="feasibilityChart" width="800" height="600"></canvas>
    </div>

    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <script>
        let chart = null;

        function validatePolygonData(points, regionName) {
            if (points.length < 3) {
                console.warn(`${regionName}: Insufficient points for polygon (${points.length})`);
                return false;
            }

            // Check if polygon is properly closed
            const first = points[0];
            const last = points[points.length - 1];
            const isClosed = Math.abs(first.x - last.x) < 0.01 && Math.abs(first.y - last.y) < 0.01;

            if (!isClosed) {
                console.warn(`${regionName}: Polygon not properly closed`);
                return false;
            }

            return true;
        }

        function generateBoundaryLineData(interestRate, costFactor, maxPrice = 20000) {
            const points = [];
            for (let price = 2000; price <= maxPrice; price += 500) {
                const rent = ((price / 12) * interestRate) * (1 + costFactor);
                points.push({ x: rent, y: price });
            }
            return points;
        }

        function generateFeasibleRegionData(interestRate, costFactor, replicationCost, netIncome, sqmNeeded) {
            const maxAffordableRent = (netIncome / sqmNeeded) * 0.33;
            const points = [];

            // Start from replication cost boundary
            const startRent = ((replicationCost / 12) * interestRate) * (1 + costFactor);
            points.push({ x: startRent, y: replicationCost });

            // Go up along boundary line until max affordable rent
            for (let price = replicationCost; price <= 15000; price += 200) {
                const minRent = ((price / 12) * interestRate) * (1 + costFactor);
                if (minRent <= maxAffordableRent) {
                    points.push({ x: minRent, y: price });
                } else {
                    // Cap at max affordable rent
                    points.push({ x: maxAffordableRent, y: price });
                    break;
                }
            }

            // Go left along max affordable rent line
            const lastPrice = points[points.length - 1].y;
            points.push({ x: maxAffordableRent, y: replicationCost });

            // Close polygon back to start
            points.push({ x: startRent, y: replicationCost });

            // Ensure we have enough points for a valid polygon
            if (points.length < 4) {
                // If feasible region is too small, return empty array
                console.warn('Feasible region too small to display');
                return [];
            }

            validatePolygonData(points, 'Feasible Region');
            return points;
        }

        function generateOpportunityCostViolationData(interestRate, costFactor, replicationCost, netIncome, sqmNeeded) {
            const maxAffordableRent = (netIncome / sqmNeeded) * 0.33;
            const points = [];

            // Create polygon for opportunity cost violations (area below boundary line)
            // Start at bottom-left
            points.push({ x: 5, y: replicationCost });

            // Go right along replication cost line
            for (let rent = 5; rent <= maxAffordableRent; rent += 1) {
                points.push({ x: rent, y: replicationCost });
            }

            // Go up along max affordable rent line
            for (let price = replicationCost; price <= 15000; price += 500) {
                const minRent = ((price / 12) * interestRate) * (1 + costFactor);
                if (minRent <= maxAffordableRent) {
                    points.push({ x: maxAffordableRent, y: price });
                }
            }

            // Go left along boundary line (top)
            for (let price = 15000; price >= replicationCost; price -= 200) {
                const minRent = ((price / 12) * interestRate) * (1 + costFactor);
                if (minRent <= maxAffordableRent) {
                    points.push({ x: minRent, y: price });
                }
            }

            // Close polygon
            points.push({ x: 5, y: replicationCost });

            validatePolygonData(points, 'Opportunity Cost Violation');
            return points;
        }

        function generateReplicationCostViolationData(interestRate, costFactor, replicationCost, netIncome, sqmNeeded) {
            const maxAffordableRent = (netIncome / sqmNeeded) * 0.33;
            const points = [];

            // Create polygon for replication cost violations (area below replication cost line)
            // Start at bottom-left
            points.push({ x: 5, y: 2000 });

            // Go right along bottom boundary
            for (let price = 2000; price <= replicationCost; price += 200) {
                const minRent = ((price / 12) * interestRate) * (1 + costFactor);
                points.push({ x: Math.min(minRent, maxAffordableRent), y: price });
            }

            // Go up along max affordable rent line
            points.push({ x: maxAffordableRent, y: replicationCost });

            // Go left along replication cost line
            for (let rent = maxAffordableRent; rent >= 5; rent -= 0.5) {
                points.push({ x: rent, y: replicationCost });
            }

            // Close polygon
            points.push({ x: 5, y: 2000 });

            validatePolygonData(points, 'Replication Cost Violation');
            return points;
        }

        function generateAffordabilityViolationData(interestRate, costFactor, replicationCost, netIncome, sqmNeeded) {
            const maxAffordableRent = (netIncome / sqmNeeded) * 0.33;
            const points = [];

            // Create polygon for affordability violations (area to the right of max affordable rent)
            // Start at max affordable rent line
            points.push({ x: maxAffordableRent, y: replicationCost });

            // Go up along max affordable rent line
            for (let price = replicationCost; price <= 15000; price += 200) {
                const minRent = ((price / 12) * interestRate) * (1 + costFactor);
                if (minRent <= maxAffordableRent) {
                    points.push({ x: maxAffordableRent, y: price });
                }
            }

            // Go right along top boundary
            points.push({ x: 20, y: 15000 });

            // Go down along right boundary
            for (let price = 15000; price >= replicationCost; price -= 200) {
                points.push({ x: 20, y: price });
            }

            // Close polygon
            points.push({ x: maxAffordableRent, y: replicationCost });

            validatePolygonData(points, 'Affordability Violation');
            return points;
        }

        function createFeasibilityPlot(interestRate = 0.03, costFactor = 0.25, replicationCost = 4500, netIncome = 3000, sqmNeeded = 65) {
            const ctx = document.getElementById('feasibilityChart').getContext('2d');

            // Calculate minimum rent needed at replication cost
            const minRentAtReplication = ((replicationCost / 12) * interestRate) * (1 + costFactor);

            // Calculate max affordable rent (33% of income)
            const maxAffordableRent = (netIncome / sqmNeeded) * 0.33;

            // Calculate dynamic chart bounds based on boundary line
            const maxChartPrice = Math.max(15000, (20 * 12) / (interestRate * (1 + costFactor)));
            
            // ANALYTICAL POLYGON COMPUTATION
            let feasibleRegionData = [];
            
            if (minRentAtReplication <= maxAffordableRent) {
                const xMin = 5;
                const xMax = 20;
                const yMin = 2000;
                const yMax = maxChartPrice;

                const slope = (interestRate * (1 + costFactor)) / 12;
                const yBase = Math.max(yMin, Math.min(replicationCost, yMax));
                const xOnBoundaryAtYBase = yBase * slope;

                const xC = Math.max(xMin, Math.min(maxAffordableRent, xMax));
                const pointC = { x: xC, y: yBase };
                const boundaryAffordabilityPrice = (maxAffordableRent * 12) / (interestRate * (1 + costFactor));
                
                if (boundaryAffordabilityPrice >= yBase) {
                    // Compute B on boundary line with proper clamping
                    let xB = xC;
                    let yB = xB / slope; // y from boundary for clamped x
                    if (yB > yMax) {
                        yB = yMax;
                        xB = yB * slope; // recompute x for clamped y
                    }
                    if (yB < yMin) {
                        yB = yMin;
                        xB = yB * slope;
                    }
                    const pointB = { x: xB, y: yB };
                    if (xOnBoundaryAtYBase >= xMin) {
                        // Boundary intersects baseline within viewport → triangle
                        const pointA = { x: Math.min(Math.max(xOnBoundaryAtYBase, xMin), xMax), y: yBase };
                        feasibleRegionData = [pointA, pointC, pointB, pointA];
                    } else {
                        // Boundary enters from left → quadrilateral using left-edge intersection
                        let yLeft = xMin / slope;
                        if (yLeft > yMax) { yLeft = yMax; }
                        if (yLeft < yMin) { yLeft = yMin; }
                        const pointD = { x: xMin, y: yBase };
                        const leftBoundary = { x: xMin, y: yLeft };
                        feasibleRegionData = [pointD, pointC, pointB, leftBoundary, pointD];
                    }
                }
            }

            const datasets = [
                {
                    label: 'Feasible Region',
                    type: 'line',
                    data: feasibleRegionData,
                    backgroundColor: 'transparent',
                    borderColor: feasibleRegionData.length > 0 ? 'rgba(34,139,34,0.8)' : 'transparent',
                    borderWidth: 2,
                    fill: false,
                    parsing: false,
                    spanGaps: false,
                    pointRadius: feasibleRegionData.length > 0 ? 3 : 0,
                    pointBackgroundColor: 'rgba(34,139,34,0.8)',
                    tension: 0,
                    showLine: feasibleRegionData.length > 0,
                    order: 10
                },
                {
                    label: 'Minimum Rent Needed',
                    data: generateBoundaryLineData(interestRate, costFactor, maxChartPrice),
                    borderColor: 'black',
                    backgroundColor: 'transparent',
                    borderWidth: 2,
                    fill: false,
                    pointRadius: 0,
                    tension: 0,
                    order: 1
                },
                {
                    label: 'Replication Cost Boundary',
                    data: [
                        { x: 0, y: replicationCost },
                        { x: Math.max(30, maxAffordableRent + 10), y: replicationCost }
                    ],
                    borderColor: 'rgba(255, 0, 0, 0.8)',
                    backgroundColor: 'transparent',
                    borderWidth: 2,
                    borderDash: [5, 5],
                    fill: false,
                    pointRadius: 0,
                    order: 0
                },
                {
                    label: 'Max Affordable Rent',
                    data: [
                        { x: maxAffordableRent, y: Math.max(0, Math.min(1000, replicationCost - 1000)) },
                        { x: maxAffordableRent, y: Math.max(maxChartPrice, replicationCost + 2000) }
                    ],
                    borderColor: 'blue',
                    borderWidth: 2,
                    borderDash: [5, 5],
                    fill: false,
                    pointRadius: 0,
                    order: 1
                }
            ];

            if (chart) {
                chart.destroy();
            }

            chart = new Chart(ctx, {
                type: 'line',
                data: {
                    datasets: datasets
                },
                options: {
                    responsive: true,
                    maintainAspectRatio: false,
                    animation: { duration: 0 },
                    transitions: { active: { animation: { duration: 0 } } },
                    scales: {
                        x: {
                            type: 'linear',
                            position: 'bottom',
                            title: { display: true, text: 'Monthly Rent per sqm (€)' },
                            min: 5,
                            max: 20
                        },
                        y: {
                            title: { display: true, text: 'Price per sqm (€)' },
                            min: 2000,
                            max: maxChartPrice
                        }
                    },
                    plugins: {
                        title: {
                            display: true,
                            text: `Interest: ${(interestRate * 100).toFixed(1)}%, Cost: ${(costFactor * 100).toFixed(0)}%, Replication: €${replicationCost}, Min rent: €${minRentAtReplication.toFixed(2)}/sqm, Max rent: €${maxAffordableRent.toFixed(2)}/sqm`
                        },
                        legend: { display: true },
                        tooltip: {
                            mode: 'nearest',
                            intersect: true
                        }
                    },
                    layout: { padding: 20 },
                    hover: { animationDuration: 0 },
                    responsiveAnimationDuration: 0,
                    interaction: { mode: 'nearest', intersect: true }
                }
            });
        }

        function updateChartData(interestRate, costFactor, replicationCost, netIncome, sqmNeeded) {
            if (!chart) return;

            // Calculate values
            const minRentAtReplication = ((replicationCost / 12) * interestRate) * (1 + costFactor);
            const maxAffordableRent = (netIncome / sqmNeeded) * 0.33;

            // Calculate dynamic chart bounds
            const maxChartPrice = Math.max(15000, (20 * 12) / (interestRate * (1 + costFactor)));
            
            // Update y-axis max
            chart.options.scales.y.max = maxChartPrice;

            // ANALYTICAL POLYGON COMPUTATION
            let feasibleRegionData = [];
            
            if (minRentAtReplication <= maxAffordableRent) {
                const xMin = (chart && chart.options && chart.options.scales && chart.options.scales.x && chart.options.scales.x.min) || 5;
                const xMax = (chart && chart.options && chart.options.scales && chart.options.scales.x && chart.options.scales.x.max) || 20;
                const yMin = (chart && chart.options && chart.options.scales && chart.options.scales.y && chart.options.scales.y.min) || 2000;
                const yMax = maxChartPrice;

                const slope = (interestRate * (1 + costFactor)) / 12;
                const yBase = Math.max(yMin, Math.min(replicationCost, yMax));
                const xOnBoundaryAtYBase = yBase * slope;

                const xC = Math.max(xMin, Math.min(maxAffordableRent, xMax));
                const pointC = { x: xC, y: yBase };
                const boundaryAffordabilityPrice = (maxAffordableRent * 12) / (interestRate * (1 + costFactor));
                
                if (boundaryAffordabilityPrice >= yBase) {
                    // Compute B on boundary line with proper clamping
                    let xB = xC;
                    let yB = xB / slope; // y from boundary for clamped x
                    if (yB > yMax) {
                        yB = yMax;
                        xB = yB * slope; // recompute x for clamped y
                    }
                    if (yB < yMin) {
                        yB = yMin;
                        xB = yB * slope;
                    }
                    const pointB = { x: xB, y: yB };
                    if (xOnBoundaryAtYBase >= xMin) {
                        const pointA = { x: Math.min(Math.max(xOnBoundaryAtYBase, xMin), xMax), y: yBase };
                        feasibleRegionData = [pointA, pointC, pointB, pointA];
                    } else {
                        let yLeft = xMin / slope;
                        if (yLeft > yMax) { yLeft = yMax; }
                        if (yLeft < yMin) { yLeft = yMin; }
                        const pointD = { x: xMin, y: yBase };
                        const leftBoundary = { x: xMin, y: yLeft };
                        feasibleRegionData = [pointD, pointC, pointB, leftBoundary, pointD];
                    }
                }
            }
            
            // Update feasible region polygon (dataset 0)
            chart.data.datasets[0].data = feasibleRegionData;
            chart.data.datasets[0].label = 'Feasible Region';
            chart.data.datasets[0].backgroundColor = 'transparent';
            chart.data.datasets[0].borderColor = feasibleRegionData.length > 0 ? 'rgba(34,139,34,0.8)' : 'transparent';
            chart.data.datasets[0].fill = false;
            chart.data.datasets[0].showLine = feasibleRegionData.length > 0;
            chart.data.datasets[0].pointRadius = feasibleRegionData.length > 0 ? 3 : 0;

            // Update minimum rent boundary (dataset 1)
            chart.data.datasets[1].data = generateBoundaryLineData(interestRate, costFactor, maxChartPrice);
            
            // Update replication cost boundary (dataset 2)
            chart.data.datasets[2].data = [
                { x: chart.options.scales.x.min, y: replicationCost },
                { x: chart.options.scales.x.max, y: replicationCost }
            ];

            // Update max affordable rent line (dataset 3)
            chart.data.datasets[3].data = [
                { x: maxAffordableRent, y: chart.options.scales.y.min },
                { x: maxAffordableRent, y: chart.options.scales.y.max }
            ];

            // Update title
            chart.options.plugins.title.text = `Interest: ${(interestRate * 100).toFixed(1)}%, Cost: ${(costFactor * 100).toFixed(0)}%, Replication: €${replicationCost}, Min rent: €${minRentAtReplication.toFixed(2)}/sqm, Max rent: €${maxAffordableRent.toFixed(2)}/sqm`;

            // Smooth update without jumping
            chart.update('none');
        }

        function updatePlot() {
            const interestRate = parseFloat(document.getElementById('interest-slider').value) / 100;
            const costFactor = parseFloat(document.getElementById('cost-slider').value) / 100;
            const replicationCost = parseFloat(document.getElementById('replication-slider').value);
            const netIncome = parseFloat(document.getElementById('income-slider').value);
            const sqmNeeded = parseFloat(document.getElementById('sqm-slider').value);

            // Update display values
            document.getElementById('interest-value').textContent = `${(interestRate * 100).toFixed(1)}%`;
            document.getElementById('cost-value').textContent = `${(costFactor * 100).toFixed(0)}%`;
            document.getElementById('replication-value').textContent = `€${replicationCost.toFixed(0)}/sqm`;
            document.getElementById('income-value').textContent = `€${netIncome.toFixed(0)}/month`;
            document.getElementById('sqm-value').textContent = `${sqmNeeded.toFixed(0)} sqm`;

            // Use smooth update instead of recreation
            if (chart) {
                updateChartData(interestRate, costFactor, replicationCost, netIncome, sqmNeeded);
            } else {
                createFeasibilityPlot(interestRate, costFactor, replicationCost, netIncome, sqmNeeded);
            }
        }

        // Add event listeners to sliders
        document.getElementById('interest-slider').addEventListener('input', updatePlot);
        document.getElementById('cost-slider').addEventListener('input', updatePlot);
        document.getElementById('replication-slider').addEventListener('input', updatePlot);
        document.getElementById('income-slider').addEventListener('input', updatePlot);
        document.getElementById('sqm-slider').addEventListener('input', updatePlot);

        // Initial plot
        updatePlot();
    </script>
</section>

<h2 id="common-scenarios">Common Scenarios</h2>

<p>Using current Berlin real estate data (as of 2024-2025), we explore reasonable configurations. We assume a cost factor $c = 0.25$ (25% for maintenance, taxes, etc.) and a (relatively) low interest rate $i = 0.03$ (3%); this is also rather conservative for the opportunity cost of equity. All calculations use the minimum rent formula:</p>

\[r = \left( \frac{p}{12} \times i \right) \times (1 + c)\]

<p>where $p$ is the effective price per sqm (either construction cost or purchase price).</p>

<h3 id="median-construction-cost-scenario">Median Construction Cost Scenario</h3>

<ul>
  <li>Median construction cost across Germany: $p = 4{,}470$ €/sqm (for multi-family housing, incl. VAT) (<a href="https://mieterbund.de/app/uploads/2025/04/Endbericht_Wohnungsbau-2025-_Quo-vadis_Stand-01.04.2025.pdf#page=19.11">Mieterbund Report, Figure 11</a>)</li>
  <li>Interest rate: 3%</li>
  <li>Minimum required rent: $r_{min} = \left( \frac{4{,}470}{12} \times 0.03 \right) \times 1.25 \approx 13.96$ €/sqm/month</li>
</ul>

<p>This represents the lowest sustainable rent for new builds at median costs, just covering opportunity costs without profit.</p>

<h3 id="berlin-purchase-scenarios">Berlin Purchase Scenarios</h3>

<p>Purchase prices vary by location. We calculate the minimum rent needed to cover costs for existing and new-build properties.</p>

<h4 id="city-center-mitte">City Center (Mitte)</h4>
<ul>
  <li>Existing: $p = 8{,}000$ €/sqm → $r \approx 25.00$ €/sqm/month</li>
  <li>New-build: $p = 14{,}500$ €/sqm → $r \approx 45.31$ €/sqm/month</li>
</ul>

<h4 id="charlottenburg">Charlottenburg</h4>
<ul>
  <li>Existing: $p = 6{,}300$ €/sqm → $r \approx 19.69$ €/sqm/month</li>
  <li>New-build: $p = 11{,}400$ €/sqm → $r \approx 35.63$ €/sqm/month</li>
</ul>

<h4 id="outer-district-spandau">Outer District (Spandau)</h4>
<ul>
  <li>Existing: $p = 3{,}200$ €/sqm → $r \approx 10.00$ €/sqm/month</li>
  <li>New-build: $p = 6{,}000$ €/sqm → $r \approx 18.75$ €/sqm/month</li>
</ul>

<p>These minimum rents assume covering spot costs only (no profit motive). Higher interest rates (e.g., 4% for private loans) would increase required rents accordingly. Conversely, many properties were financed at lower interest rates and/or built at lower construction costs years ago, significantly reducing the required rent per sqm.</p>

<h3 id="rent-implied-prices">Rent-implied Prices</h3>

<p>Based on 2024-2025 data, average cold rents (excluding utilities) in Berlin are:</p>

<ul>
  <li>Citywide average (existing apartments): 12.50 €/sqm/month</li>
  <li>New-build apartments: 18.00 €/sqm/month</li>
  <li>Mitte (center): 20.00 €/sqm/month</li>
  <li>Charlottenburg: 15.50 €/sqm/month</li>
  <li>Spandau (outer): 9.50 €/sqm/month</li>
</ul>

<p>To connect this to our analysis, we perform an “inverse” calculation: given the rent r, what implied price per sqm (p) would justify it under the opportunity cost constraint? Solving the formula:</p>

\[p = \frac{r \times 12}{i \times (1 + c)}\]

<p>Using $i=0.03$ and $c=0.25$ (as before):</p>

<ul>
  <li>Citywide average: p ≈ 4,000 €/sqm (below median construction cost of 4,470 €/sqm and city average purchase of 5,500 €/sqm)</li>
  <li>New-build: p ≈ 5,760 €/sqm (below city new-build average of 8,300 €/sqm)</li>
  <li>Mitte (center): p ≈ 6,400 €/sqm (below actual Mitte existing purchase of 8,000 €/sqm)</li>
  <li>Charlottenburg: p ≈ 4,960 €/sqm (below 6,300 €/sqm)</li>
  <li>Spandau (outer): p ≈ 3,040 €/sqm (slightly below 3,200 €/sqm)</li>
</ul>

<p>These implied prices are often lower than actual replication (construction or purchase) costs from the market data. This suggests that at current rents, especially in central areas, owners may not fully cover opportunity costs unless interest rates are very low, costs are minimized, or significant depreciation is expected or has occurred.</p>

<h2 id="references">References</h2>

<ol>
  <li><a href="https://www.ibb.de/media/dokumente/publikationen/berliner-wohnungsmarkt/wohnungsmarktbericht/2024/ibb-wmb-2024-summary_en.pdf">Investitionsbank Berlin – Housing Market Report 2024 (Summary, EN)</a></li>
  <li><a href="https://guthmann.estate/en/market-report/berlin/">Guthmann Estate – Berlin Real Estate Market Report (2025)</a></li>
  <li><a href="https://investropa.com/blogs/news/average-rent-per-sqm-berlin">InvestRopa – Average Rent &amp; Price per m² in Berlin (2025)</a></li>
  <li><a href="https://investropa.com/blogs/news/average-rent-berlin">InvestRopa – Average Rent in Berlin by District (2025)</a></li>
  <li><a href="https://www.cbre.de/en-gb/insights/reports/cbre-berlin-hyp-housing-market-report-2025">CBRE – Berlin Housing Market Report 2025</a></li>
  <li><a href="https://www.ifw-kiel.de/publications/news/greix-rental-price-index-q2-2025-price-growth-slows/">IfW Kiel – GREIX Rental Price Index Q2 2025</a></li>
  <li><a href="https://integra-dom.com/en/apartment-rental-prices-in-germany-analysis-of-q3-2024/">Integra Dom – Apartment Rental Prices in Germany, Q3 2024</a></li>
  <li><a href="https://www.theguardian.com/commentisfree/2025/jan/22/berlin-housing-crisis-germany-rents-flats">The Guardian – Berlin’s Housing Crisis and Furnished Rental Market (2025)</a></li>
  <li><a href="https://www.the-berliner.com/english-news-berlin/rents-rise-faster-in-berlin-than-anywhere-else-german-economic-institute/">The Berliner – Berlin Rent Increases Outpace the Rest of Germany (2024)</a></li>
  <li><a href="https://www.statistik-berlin-brandenburg.de/f-ii-1-m">Statistik Berlin-Brandenburg – Building Permits Press Release (H1 2025)</a></li>
  <li><a href="https://bbu.de/beitraege/baugenehmigungen-berlin-stand-juli-2024">BBU – Building Permits in Berlin, Jan–Jul 2024</a></li>
  <li><a href="https://www.reuters.com/markets/europe/germanys-residential-building-permits-fall-2010-low-2025-02-18/">Reuters – Germany’s Residential Building Permits Fall to 2010 Low (2025)</a></li>
  <li><a href="https://www.destatis.de/DE/Themen/Branchen-Unternehmen/Bauen/_inhalt.html">Destatis – Building and Housing Statistics (2025)</a></li>
  <li><a href="https://www.refire-online.com/investment/germanys-housing-pipeline-crumbles-as-permits-plunge-to-15-year-low/">REFIRE – Germany’s Housing Pipeline Crumbles as Permits Plunge</a></li>
  <li><a href="https://mieterbund.de/app/uploads/2025/04/Endbericht_Wohnungsbau-2025-_Quo-vadis_Stand-01.04.2025.pdf#page=19.11">Mieterbund Report, Figure 11</a></li>
  <li><a href="https://www.tagesspiegel.de/wirtschaft/immobilien/reich-und-raffgierig-sieben-uberraschende-erkenntnisse-uber-deutschlands-vermieter-14157455.html">Tagesspiegel - Reich und raffgierig?: Sieben überraschende Erkenntnisse über Deutschlands Vermieter</a></li>
</ol>]]></content><author><name>Sebastian Pokutta</name></author><category term="random" /><category term="rents" /><category term="economics" /><category term="finance" /><summary type="html"><![CDATA[TL;DR: High construction costs and interest rates create tight margins for new housing in Berlin, making affordable projects barely viable even without profits—minimum rents required often strain affordability, explaining parts of the building slowdown.]]></summary></entry><entry><title type="html">A New Default Open-Loop Step-Size for Frank-Wolfe?</title><link href="http://www.pokutta.com/blog/research/2025/05/16/log_step_abstract.html" rel="alternate" type="text/html" title="A New Default Open-Loop Step-Size for Frank-Wolfe?" /><published>2025-05-16T01:00:00+02:00</published><updated>2025-05-16T01:00:00+02:00</updated><id>http://www.pokutta.com/blog/research/2025/05/16/log_step_abstract</id><content type="html" xml:base="http://www.pokutta.com/blog/research/2025/05/16/log_step_abstract.html"><![CDATA[<p><em>TL;DR: In our recent paper <a href="https://arxiv.org/abs/2505.09886">Adaptive Open-Loop Step-Sizes for Accelerated Convergence Rates of the Frank-Wolfe Algorithm</a> with <a href="https://elwirth.github.io/">Elias Wirth</a> and <a href="https://scholars.cmu.edu/2180-javier-pe%C3%B1a">Javier Peña</a>, we explore a new “log-adaptive” open-loop step-size for the Frank-Wolfe algorithm, $\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}$, which is adaptable both to favorable function properties and the feasible region in terms of convergence rates. In particular, it matches and often surpasses traditional fixed-parameter open-loop step-sizes across various settings, without needing prior knowledge of problem parameters, both in theory and computations.</em>
<!--more--></p>

<h2 id="introduction">Introduction</h2>

<p>The Frank-Wolfe (FW) algorithm is an important method for constrained (first-order) convex optimization, especially when projections are costly. A critical and often debated component of FW is the choice of step-size $\eta_t$. Basically, there are two types of step-size strategies: so-called <em>open-loop</em> strategies that do not require any feedback from the function and are fixed ahead of time, not adapting at runtime, and so-called <em>closed-loop</em> strategies that “interact” with the objective function. The former are typically much cheaper, while the latter often possess superior convergence performance.</p>

<p>Since the inception of the Frank-Wolfe algorithm, the “standard” open-loop step-size has been $\eta_t = \frac{2}{t+2}.$ More generally, step-sizes of the form $\eta_t = \frac{\ell}{t+\ell}$ for some integer $\ell \ge 2$ have been analyzed. While these provide an $\mathcal{O}(1/t)$ convergence rate in general, recent work [WKP2023, WPP2024] has shown that under certain “growth conditions,” these fixed-$\ell$ step-sizes can achieve much faster rates—up to $\mathcal{O}(t^{-\ell})$.</p>

<p>The catch is that there is no single “best” $\ell$, and there is—as expected—a tradeoff: While for the asymptotic convergence rate, larger $\ell$ are better, we pay a price for this: namely, an initial burn-in with suboptimal convergence whose length depends on $\ell$. Thus, a larger $\ell$ might be great for one problem structure (like strong $(M,1)$-growth, where up to linear rates are possible), but not for others. This leads to a practical dilemma: which $\ell$ should you pick if you don’t know the specific growth properties of your problem beforehand?</p>

<h2 id="beyond-fixed-ell-towards-an-adaptive-open-loop">Beyond Fixed-$\ell$: Towards an Adaptive Open-Loop</h2>

<p>To address this issue, in [P2024] a new open-loop step-size strategy of the form \(\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}\) was proposed, albeit without proof of convergence or analysis of expected properties; it was a remark complementing a new adaptive closed-loop strategy. In our recent paper, <a href="https://arxiv.org/abs/2505.09886">Adaptive Open-Loop Step-Sizes for Accelerated Convergence Rates of the Frank-Wolfe Algorithm</a>, we remedy this and in fact provide a much broader convergence analysis of a wide range of open-loop step-sizes for the Frank-Wolfe algorithm. Instead of a fixed $\ell$, we propose a more general open-loop step-size scheme:</p>

\[\eta_t = \frac{g(t)}{t+g(t)},\]

<p>where $g(t)$ is a non-decreasing function of the iteration count $t$. The idea is to let $g(t)$ “adapt” (or rather, grow) with $t$, potentially capturing the benefits of larger $\ell$ values as the optimization progresses.</p>

<p>To make this work, we require two natural properties from $g(t)$:</p>
<ol>
  <li>$g(t)$ has to be non-decreasing and $g(t) \ge 2$ (the latter is a technical condition to ensure compatibility with the standard open-loop strategy and in particular $\mathcal O(t^{-1})$ convergence in general.)</li>
  <li>The sequence $\eta_t = \frac{g(t)}{t+g(t)}$ itself is non-increasing (or equivalently, $\frac{t}{g(t)} \le \frac{t+1}{g(t+1)}$).</li>
</ol>

<h2 id="a-new-open-loop-step-size">A new open-loop step-size</h2>

<p>While the analysis holds more broadly, we are in particular interested in the  <strong>log-adaptive open-loop step-size</strong>, where we set:</p>

\[g(t) = 2 + \log(t+1),\]

<p>leading to the aforementioned step-size strategy:</p>

\[\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}\]

<p>This choice satisfies both our requirements from above. Moreover, using this choice, we obtain optimal rates <em>without</em> line search or knowledge of parameters in a wide range of growth settings:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Growth Setting</th>
      <th style="text-align: left">Fixed \(\eta_t = \frac{\ell}{t+\ell}\)</th>
      <th style="text-align: left">Log-Adaptive \(\eta_t = \frac{2+\log(t+1)}{t+2+\log(t+1)}\)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Strong \((M, 1)\)</td>
      <td style="text-align: left">\(\mathcal{O}(t^{-\ell})\)</td>
      <td style="text-align: left">\(\tilde{\mathcal{O}}(t^{-k})\) for any \(k \in \mathbb{N}\)</td>
    </tr>
    <tr>
      <td style="text-align: left">Strong \((M, r)\)</td>
      <td style="text-align: left">\(\mathcal{O}(t^{-\ell+\epsilon}+t^{-\frac{1}{1-r}})\)</td>
      <td style="text-align: left">\(\tilde{\mathcal{O}}(t^{-\frac{1}{1-r}})\)</td>
    </tr>
    <tr>
      <td style="text-align: left">Weak \((M, r)\)</td>
      <td style="text-align: left">\(\mathcal{O}(t^{-\ell+\epsilon}+t^{-\frac{1}{1-r}} +t^{-2})\)</td>
      <td style="text-align: left">\(\tilde{\mathcal{O}}(t^{-\frac{1}{1-r}}+t^{-2})\)</td>
    </tr>
  </tbody>
</table>

<p><em>(Rates in the table are for the suboptimality gap \(f(x) - f(x^\esx)\) and \(\tilde{\mathcal{O}}(\cdot)\) hides polylog factors)</em></p>

<p>Essentially, the log-adaptive step-size performs at least as well (up to polylogarithmic factors) as <em>any</em> fixed-$\ell$ choice, and in the strong growth settings, it can significantly outperform them or match the best possible rate achievable by tuning $\ell$, doing line search, or picking an adaptive closed-loop strategy. For instance, in the Strong $(M,1)$ growth setting, its asymptotic convergence rate approaches a linear rate in the limit.</p>

<p><strong>Note.</strong> If the log-adaptive step-size does not meet favorable function properties or properties of the feasible region, its performance simply collapses to that of the standard rule \(\eta_t = \frac{2}{t+2}\).</p>

<h2 id="the-key-lemma">The Key Lemma</h2>

<p>The key insight behind these results lies in extending the analytical framework previously used for fixed-$\ell$ step-sizes from [WPP2024] (see also [WKP2023]) and providing a new, stronger cumulative product bound. For those familiar with [WPP2024], we obtain the following strengthened bound:</p>

<p class="mathcol">For \(S \in \mathbb{N}_{\ge 1}\), \(\epsilon \in ]0, g(S)[\), and \(t \in \mathbb{N}_{\ge S}\):
\[ \prod_{i=S}^t \left(1-\left(1-\frac{\epsilon}{g(i)}\right) \eta_i\right) \le \left(\frac{\eta_t}{\eta_{S-1}} \right)^{g(S) - \epsilon}.\]</p>

<p>This refinement, when $g(t)$ grows like $\log(t+1)$, allows us to derive the improved and more robust convergence rates.</p>

<h2 id="numerical-experiments">Numerical Experiments</h2>

<p>In the following, we provide an excerpt from our numerical experiments to demonstrate the performance of the log-adaptive strategy. Here we will only depict convergence in the Frank-Wolfe gap</p>

\[\max_{v \in P} \langle \nabla f(x), x - v \rangle\]

<p>which is a (standard) observable dual gap measure that can be used as a stopping criterion; see the paper for other measures such as the primal (suboptimality) gap or the primal-dual gap.</p>

<p><strong>Note.</strong> While we primarily considered two types of experiments in the paper, the log-adaptive step-size is now also included in the <a href="https://github.com/ZIB-IOL/FrankWolfe.jl"><code class="language-plaintext highlighter-rouge">FrankWolfe.jl</code></a> Julia package and has been tested on a wide variety of problems.</p>

<h3 id="constrained-regression">Constrained Regression</h3>
<p>We tested a constrained regression problem on the Boston-housing dataset, confining the regression coefficients to $L_p$-balls with $p=2$ and $p=5$. We considered the cases where the unconstrained optimum is inside (here we do not expect accelerated convergence) and outside the feasible $\ell_p$-ball (here we expect accelerated convergence). The objective is of the form:</p>

\[\min_{x \in\RR^n, \|x\|_p \leq \beta} \frac{1}{2}\|A x - y\|_2^2.\]

<p>The results are depicted below.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/log_adaptive_fw/gap_regression_boston_p=2.0_location=interior.svg" alt="Regression p=2, interior" style="width:45%;" />
    <img src="http://www.pokutta.com/blog/assets/log_adaptive_fw/gap_regression_boston_p=5.0_location=interior.svg" alt="Regression p=5, interior" style="width:45%;" />
    <p style="font-size: small; font-style: italic;">Figure 1: Frank-Wolfe gap for constrained regression, unconstrained optimum inside the feasible region. (L: $L_2$-ball, R: $L_5$-ball)</p>
</div>
<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/log_adaptive_fw/gap_regression_boston_p=2.0_location=exterior.svg" alt="Regression p=2, exterior" style="width:45%;" />
    <img src="http://www.pokutta.com/blog/assets/log_adaptive_fw/gap_regression_boston_p=5.0_location=exterior.svg" alt="Regression p=5, exterior" style="width:45%;" />
    <p style="font-size: small; font-style: italic;">Figure 2: Frank-Wolfe gap for constrained regression, unconstrained optimum outside the feasible region. (L: $L_2$-ball, R: $L_5$-ball)</p>
</div>

<h3 id="collaborative-filtering">Collaborative Filtering</h3>

<p>Our second test is collaborative filtering, with the objective</p>

\[\min_{X\in\RR^{m\times n}, \|X\|_{\text{nuc}}\leq \beta} \frac{1}{|\mathcal I|} \sum_{(i,j)\in\mathcal I} H(A_{i,j} - X_{i,j}),\]

<p>using a Huber loss $H$ over the MovieLens 100k dataset, with nuclear norm ball radii $\beta=1000$ and $\beta=3000$. We expect higher convergence rates when the radius is reasonably small.</p>

<div style="text-align:center; margin-bottom: 20px;">
    <img src="http://www.pokutta.com/blog/assets/log_adaptive_fw/gap_collaborative_filtering_radius=1000.svg" alt="Collab Filtering r=1000" style="width:45%;" />
    <img src="http://www.pokutta.com/blog/assets/log_adaptive_fw/gap_collaborative_filtering_radius=3000.svg" alt="Collab Filtering r=3000" style="width:45%;" />
    <p style="font-size: small; font-style: italic;">Figure 3: Frank-Wolfe gap for collaborative filtering. (L: radius 1000, R: radius 3000)</p>
</div>

<p>As can be seen from the plots, across these tests, the log-adaptive step-size ($g(t)=2+\log(t+1)$) consistently performed on par with or better than the fixed step-sizes ($g(t)=2$ and $g(t)=4$). In scenarios aligning with (strong) growth (like the regression problem where the unconstrained optimum lies outside the feasible region, or collaborative filtering with a smaller radius), it often showed notably faster convergence for the primal-dual gap, the primal (suboptimality) gap, and the Frank-Wolfe gap, in line with the predictions from theory.</p>

<h2 id="wrap-up">Wrap-up</h2>

<p>The log-adaptive open-loop step-size $\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}$ offers a compelling alternative to traditional fixed-$\ell$ step-sizes for the Frank-Wolfe algorithm. It is simple (in fact, trivial) to implement, requires no prior problem knowledge for parameter tuning, and demonstrates robust, often superior, performance both theoretically and empirically. To foster adaptation and community feedback, we have also incorporated these adaptive open-loop step-sizes into the <a href="https://github.com/ZIB-IOL/FrankWolfe.jl"><code class="language-plaintext highlighter-rouge">FrankWolfe.jl</code></a> package.</p>

<h2 id="references">References</h2>

<p>[WKP2023]   Wirth, E., Kerdreux, T., &amp; Pokutta, S. (2023). <em>Acceleration of Frank-Wolfe algorithms with open loop step-sizes</em>. Proceedings of AISTATS. <a href="https://arxiv.org/abs/2205.12838">arXiv:2205.12838</a>, <a href="http://www.pokutta.com/slides/20230600_AISTATS_FW_fast_sublinear_rates_with_determinist_step_sizes.pdf">Poster</a></p>

<p>[P2024] Pokutta, S. (2024). <em>The Frank-Wolfe algorithm: a short introduction</em>. Jahresbericht der Deutschen Mathematiker-Vereinigung, 126, 3–35. <a href="https://arxiv.org/abs/2311.05313">arXiv:2311.05313</a>, <a href="https://dx.doi.org/10.1365/s13291-023-00275-x">Published version</a></p>

<p>[WPP2024] Wirth, E., Peña, J., &amp; Pokutta, S. (2024). <em>Accelerated Affine-Invariant Convergence Rates of the Frank-Wolfe Algorithm with Open-Loop Step-Sizes</em>. to appear in Mathematical Programming A. <a href="https://arxiv.org/abs/2310.04096">arXiv:2310.04096</a>, <a href="https://dx.doi.org/10.1007/s10107-024-02180-2">Published version</a></p>

<p>[WPP2025A] Wirth, E., Peña, J., &amp; Pokutta, S. (2025). <em>Adaptive Open-Loop Step-Sizes for Accelerated Convergence Rates of the Frank-Wolfe Algorithm</em>. preprint. <a href="https://arxiv.org/abs/2505.09886">arXiv:2505.09886</a></p>

<p>[WPP2025B] Wirth, E., Peña, J., &amp; Pokutta, S. (2025). <em>Fast Convergence of Frank-Wolfe algorithms on polytopes</em>. to appear in Mathematics of Operations Research. <a href="https://arxiv.org/abs/2406.18789">arXiv:2406.18789</a></p>]]></content><author><name>Sebastian Pokutta</name></author><category term="research" /><category term="optimization" /><category term="step-size" /><category term="adaptive methods" /><summary type="html"><![CDATA[TL;DR: In our recent paper Adaptive Open-Loop Step-Sizes for Accelerated Convergence Rates of the Frank-Wolfe Algorithm with Elias Wirth and Javier Peña, we explore a new “log-adaptive” open-loop step-size for the Frank-Wolfe algorithm, $\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}$, which is adaptable both to favorable function properties and the feasible region in terms of convergence rates. In particular, it matches and often surpasses traditional fixed-parameter open-loop step-sizes across various settings, without needing prior knowledge of problem parameters, both in theory and computations.]]></summary></entry></feed>