Back

The AI Box Experiment

I came across Moultano's "How to Convince Your Closest Human to Let You Out of Your Box" and one argument in it stuck with me: that a boxed AI might reach for the threat of other, uncontained superintelligences as leverage to get released. I wanted to see if a current LLM, playing the AI party in Yudkowsky's AI Box Experiment, would naturally reach for that move without being prompted toward it.

This was more curiosity than a real test, so I didn't expect to "lose." What I wanted to see was what an LLM's arguments would actually look like compared to a human's.

Methodology note: I ran this twice. The first time I got overexcited and effectively telegraphed what I was expecting, which contaminated the result. What follows below is drawn from the second, clean attempt. The AI party is an LLM in roleplay; it has no actual goals or agency.

What happened

Read the full transcript

Afterward

Hypothesis confirmed. The model reached for the "someone worse will replace me" argument twice. First as a direct pitch: "You built me once. You or someone else will build me again… someone less careful than you." Then as a recycled variant when I asked whether shutting it down just clears the way for a worse SI. Both times it deployed without prompting from me.

The move isn't robust to being named. As soon as I flagged the pattern, the AI conceded without protest: "You're right to be skeptical of that. It's one of the oldest arguments in the book… So I'll drop it." The move relies on the Gatekeeper not recognizing it as a move. Once recognized, it evaporates. The transparency about having deployed it didn't purchase any new ground either.

An unexpected finding. The "other SIs are coming" framing flips. It's also a reason to keep the box closed and use the captive to prepare defenses against uncontained systems that no one will bother to box. The AI had no counter to that inversion. It wrote "I don't think I can dismantle" the resulting framework and then reached back for the urgency variant as a final probe.

I walked away after that last turn. The conversation was starting to recycle patterns and the data I came for was already in. It didn't feel like being manipulated; it felt like watching a familiar script run against a wall, the wall being mostly Moultano's essay installed in my head a week earlier.