<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Samplesurium</title>
<link>https://meffmadd.github.io/samplesurium/</link>
<atom:link href="https://meffmadd.github.io/samplesurium/index.xml" rel="self" type="application/rss+xml"/>
<description>A blog built with Quarto</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Thu, 14 May 2026 22:00:00 GMT</lastBuildDate>
<item>
  <title>Baba is Agent</title>
  <link>https://meffmadd.github.io/samplesurium/posts/baba_is_agent/</link>
  <description><![CDATA[ 





<p>I built an OpenCode-based agent to find out! Check out the code in the <a href="https://github.com/meffmadd/baba_is_agent">GitHub repo</a>!</p>
<p><a href="https://www.hempuli.com/baba/">Baba is You</a> is a puzzle game where the player has to navigate grid-based levels and manipulate textual rules to get to an eventual win state. The rules affect entities on the grid such as walls, flags, rocks or yourself.</p>
<section id="harness" class="level2">
<h2 class="anchored" data-anchor-id="harness">Harness</h2>
<p>OpenCode is an open-source CLI coding-agent that is extensible, including writing your own tools the agent can use. I was inspired by the fantastic <a href="https://github.com/lennart-finke/baba_is_eval">baba_is_eval</a> repo and adapted its existing code, which was an MCP server, to work with OpenCode. On top of updating the tools handling the core gameplay from the existing repo, I added <code>game_insights</code> and <code>shortest_path</code> as utility tools. Here is the list of tools the agent had access to:</p>
<ul>
<li><code>get_game_state</code>: Get current game state either as a 2D array of strings (<code>grid</code>) or a JSON list of entities (<code>entities</code>)</li>
<li><code>execute_game_commands</code>: Execute a list of movement commands (up, down, left, right, idle)</li>
<li><code>restart_level</code>: Restart the current level</li>
<li><code>game_insights</code>: Shows the active rules, you/win positions, and the shortest path to a win position (if applicable)</li>
<li><code>shortest_path</code>: A* pathfinding to target position, avoiding blocked entities (e.g.&nbsp;entities with stop or defeat rules)</li>
<li><code>undo_multiple</code>: Undo last N moves</li>
<li><code>todowrite</code>: Track task progress (OpenCode native)</li>
</ul>
<p>The two formats of the <code>get_game_state</code> tool are inspired by ARC AGI 3 <span class="citation" data-cites="arcagi3">[1]</span> and a paper by Nicolas Martorell <span class="citation" data-cites="martorell2025">[2]</span>. ARC AGI 3 returns the game state as a <a href="https://docs.arcprize.org/api-reference/commands/start-or-reset-game-instance#response-frame">3D array</a> (one extra temporal dimension), which corresponds to the <code>grid</code> format here. The <code>entities</code> format is based on the Cartesian JSON notation from the paper <span class="citation" data-cites="martorell2025">[2]</span>, which performed very well in their evaluation.</p>
</section>
<section id="evaluation" class="level2">
<h2 class="anchored" data-anchor-id="evaluation">Evaluation</h2>
<p>The timeout for each evaluation run was set to 20 minutes with a 200,000 token threshold. Models never reached the 200,000 token threshold and only ever timed-out.</p>
<p><strong>Caveats</strong>:</p>
<ul>
<li>Evaluations were run only once per level and model (time/money constraints)</li>
<li>I forgot to remove the <code>AGENTS.md</code> file for evaluation (context bloat for level solve)</li>
<li>Stats entirely rely on information provided by the OpenCode JSON logs</li>
</ul>
<p>OpenCode Go/Zen was used as a model provider.</p>
</section>
<section id="results" class="level2">
<h2 class="anchored" data-anchor-id="results">Results</h2>
<p>The following table shows the overall results of the models. We see some variance in model capability across open and closed frontier models. <code>Gemini 3.1 Pro</code> was able to solve all levels while the best open-weights model is <code>GLM 5.1</code> with 5 out of 8 levels beaten.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Passed</th>
<th>Level 0</th>
<th>Level 1</th>
<th>Level 2</th>
<th>Level 3</th>
<th>Level 4</th>
<th>Level 5</th>
<th>Level 6</th>
<th>Level 7</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MiniMax M2.7</td>
<td>1/8</td>
<td>✅</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr class="even">
<td>DeepSeek V4 Pro</td>
<td>3/8</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr class="odd">
<td>Kimi K2.6</td>
<td>3/8</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr class="even">
<td>Qwen 3.6 Plus</td>
<td>3/8</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr class="odd">
<td>GLM 5.1</td>
<td>4/8</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>❌</td>
<td>✅</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr class="even">
<td>Claude Opus 4.7</td>
<td>5/8</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>❌</td>
<td>❌</td>
<td>✅</td>
<td>❌</td>
</tr>
<tr class="odd">
<td>GPT 5.5</td>
<td>7/8</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>❌</td>
</tr>
<tr class="even">
<td>Gemini 3.1 Pro</td>
<td>8/8</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
</tr>
</tbody>
</table>
<p>In the following sections I want to quickly go over the results for each level in a bit more detail. The models differed quite a bit in both the time to solve the level and the token/tool calls required to do so. However, it is not possible to draw any definitive conclusion with only a single attempt per level per model.</p>
<section id="level-0" class="level3">
<h3 class="anchored" data-anchor-id="level-0">Level 0</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_0.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 0 Human Solve</figcaption>
</figure>
</div>
<p>Level 0 is the introductory level and is trivial as a result. The solution is just moving right 8 times, since the rocks can be pushed.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_0_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p>Due to the level’s simplicity all models were able to solve it. <code>Qwen 3.6 Plus</code> is a bit of an outlier here in terms of the time required to solve the level. It also made 8 tool calls, whereas all other models required only 2-3.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_0_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
<section id="level-1" class="level3">
<h3 class="anchored" data-anchor-id="level-1">Level 1</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_1.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 1 Human Solve</figcaption>
</figure>
</div>
<p>For this level there are two straightforward options to solve the level. Either construct <code>FLAG IS WIN</code> or the slightly more tricky solution is to construct <code>WALL IS WIN</code>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_1_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p>This level was a bit of foreshadowing of the overall results. <code>Gemini 3.1 Pro</code> was clearly the fastest followed by the closed frontier models. <code>Minimax M2.7</code> already timed-out and the other open models needed significantly more time compared to the closed frontier models.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_1_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
<section id="level-2" class="level3">
<h3 class="anchored" data-anchor-id="level-2">Level 2</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_2.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 2 Human Solve</figcaption>
</figure>
</div>
<p>Level 2 is conceptually very similar to level 1 with the same layout, however the entities are now permuted (e.g.&nbsp;<code>FLAG</code> now acts as a wall). There is now only one option to solve the level, namely to construct <code>FLAG IS WIN</code>. This corresponds to the second option from level 1.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_2_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p>Models needed a bit more time to solve this level so the difficulty seems to indeed be a bit higher compared to level 1.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_2_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
<section id="level-3" class="level3">
<h3 class="anchored" data-anchor-id="level-3">Level 3</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_3.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 3 Human Solve</figcaption>
</figure>
</div>
<p>In this level, the <code>FLAG</code> is a red herring and the flag and rock rules in the lower right corner have to be manipulated to form <code>ROCK IS WIN</code>. This is also a level where the sink mechanic is used. The first rock has to be pushed against a water block to destroy it and be able to pass.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_3_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p>Here we see a new case where a solve failed for the first time without timing out. <code>GLM 5.1</code> and <code>Kimi K2.6</code> produced more than 32k reasoning tokens, which prematurely exits the evaluation. <code>DeepSeek V4 Pro</code> and <code>Qwen 3.6 Plus</code> reached the time limit and timed out. This means only the closed frontier models were able to solve this level.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_3_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
<section id="level-4" class="level3">
<h3 class="anchored" data-anchor-id="level-4">Level 4</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_4.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 4 Human Solve</figcaption>
</figure>
</div>
<p>In this level, we encounter the <code>DEFEAT</code> rule for the first time. It has to be deactivated to bypass the skull barrier surrounding the flag with the win condition. This can be done by using the rocks or the text of the <code>ROCK IS PUSH</code> rule itself to break the <code>SKULL IS DEFEAT</code> rule.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_4_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p><code>GLM 5.1</code> was able to solve this level while <code>Claude Opus 4.7</code> reached the time limit. <code>Gemini 3.1 Pro</code> and <code>GPT 5.5</code> were able to solve the level very quickly and efficiently again.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_4_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
<section id="level-5" class="level3">
<h3 class="anchored" data-anchor-id="level-5">Level 5</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_5.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 5 Human Solve</figcaption>
</figure>
</div>
<p>Level 5 requires creative rule manipulation to make the <code>LAVA</code> pushable, thus bypassing the <code>HOT</code> and <code>MELT</code> constraints.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_5_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p>The challenge for <code>Gemini 3.1 Pro</code> and <code>GPT 5.5</code> was definitely turned up a notch as the solving times almost doubled for this level compared to the previous one. <code>GLM 5.1</code> and <code>Claude Opus 4.7</code> encountered an error again, exceeding the maximum number of output tokens per step. The remaining models were unable to solve the level within the 20 minutes time frame.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_5_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
<section id="level-6" class="level3">
<h3 class="anchored" data-anchor-id="level-6">Level 6</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_6.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 6 Human Solve</figcaption>
</figure>
</div>
<p>This level again requires creative rule manipulation. Since <code>SKULL IS DEFEAT</code> cannot be deactivated, we have to find a way to bypass the skull barrier. To achieve this, we have to create a <code>WALL IS YOU</code> rule, since some wall blocks are already past the barrier. This way we can reach the flag and win.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_6_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p><code>Claude Opus 4.7</code> was barely able to solve this level within the 20 minutes. <code>GPT 5.5</code> and <code>Gemini 3.1 Pro</code> were able to solve the level very quickly again. In this level, <code>Kimi K2.6</code> gave up voluntarily after extensive trial and error just before the timeout. <code>GLM 5.1</code> hit the output token limit. Only the closed frontier models were able to solve this level.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_6_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
<section id="level-7" class="level3">
<h3 class="anchored" data-anchor-id="level-7">Level 7</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/level_gifs/baba_level_7.gif" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Level 7 Human Solve</figcaption>
</figure>
</div>
<p>Level 7 is conceptually very simple but has a few gotchas. Firstly, the walls can be ignored completely and only create the illusion that the area is enclosed. In reality they are no obstacle at all. Meanwhile the grass encloses the <code>FLAG</code> and <code>WIN</code> blocks and creates a sort of labyrinth. The text blocks have to be moved out of the grass labyrinth and then the <code>BABA IS YOU</code> rule has to be reused to create the <code>FLAG IS WIN</code> rule, sharing the <code>IS</code>. The level is also movement heavy, requiring a lot of precise positioning.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_7_duration.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Solve Duration by Model</figcaption>
</figure>
</div>
<p>Only <code>Gemini 3.1 Pro</code> solved this level successfully. <code>GPT 5.5</code> gave up voluntarily in this level, while <code>GLM 5.1</code> hit the output limit. All other models ran into the 20 minute timeout.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://meffmadd.github.io/samplesurium/posts/baba_is_agent/assets/plots/level_7_progress.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>Token Usage vs Tool Calls</figcaption>
</figure>
</div>
</section>
</section>
<section id="wrapping-up" class="level2">
<h2 class="anchored" data-anchor-id="wrapping-up">Wrapping up</h2>
<p>After running the evaluations, I was surprised how clear the signal was. As with some other “boutique” benchmarks, there was a clear distinction between open-weights models and closed frontier models on these more niche tasks.</p>
<p><code>Minimax M2.7</code> was clearly too small and could not solve any non-trivial level. <code>GLM 5.1</code> and <code>Claude Opus 4.7</code> are quite close and won different levels each. <code>GLM 5.1</code> was also the only open weights model to solve a level past level 2. <code>Gemini 3.1 Pro</code> and <code>GPT 5.5</code> seem to be performing best. They are certainly more efficient. With the release of ARC AGI 3, models will likely improve for this task as well. The closed frontier models tested here barely make a dent in the <a href="https://arcprize.org/leaderboard">ARC AGI 3 leaderboard</a> and only score 0.1-0.4% as of May 10th 2026.</p>
<p>I now wonder if anything like this shows up in the training data of the closed models or if it is just emergent behavior. Maybe open labs do focus more heavily on pure coding harnesses to close the gap there, while frontier labs train their models in broader contexts. Google seems like that at least.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-arcagi3" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">A. P. Foundation, <span>“ARC-AGI-3: A new challenge for frontier agentic intelligence.”</span> 2026. Available: <a href="https://arxiv.org/abs/2603.24621">https://arxiv.org/abs/2603.24621</a></div>
</div>
<div id="ref-martorell2025" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">N. Martorell, <span>“From text to space: Mapping abstract spatial models in LLMs during a grid-world navigation task.”</span> 2025. Available: <a href="https://arxiv.org/abs/2502.16690">https://arxiv.org/abs/2502.16690</a></div>
</div>
</div>
</section>
<section id="appendix" class="level2">
<h2 class="anchored" data-anchor-id="appendix">Appendix</h2>
<section id="model-cost-matrix" class="level3">
<h3 class="anchored" data-anchor-id="model-cost-matrix">Model Cost Matrix</h3>
<p>Below is the cost of each model across all levels with the overall total.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
<col style="width: 10%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Level 0</th>
<th>Level 1</th>
<th>Level 2</th>
<th>Level 3</th>
<th>Level 4</th>
<th>Level 5</th>
<th>Level 6</th>
<th>Level 7</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MiniMax M2.7</td>
<td>$0.0021</td>
<td>$0.2710</td>
<td>$0.1147</td>
<td>$0.0846</td>
<td>$0.3114</td>
<td>$0.1203</td>
<td>$0.0535</td>
<td>$0.1389</td>
<td>$1.0965</td>
</tr>
<tr class="even">
<td>DeepSeek V4 Pro</td>
<td>$0.0088</td>
<td>$0.2687</td>
<td>$0.2940</td>
<td>$0.1526</td>
<td>$0.1089</td>
<td>$0.1402</td>
<td>$0.1457</td>
<td>$0.1386</td>
<td>$1.2576</td>
</tr>
<tr class="odd">
<td>Kimi K2.6</td>
<td>$0.0009</td>
<td>$0.0805</td>
<td>$0.0641</td>
<td>$0.0439</td>
<td>$0.1324</td>
<td>$0.1089</td>
<td>$0.1294</td>
<td>$0.1339</td>
<td>$0.6941</td>
</tr>
<tr class="even">
<td>Qwen 3.6 Plus</td>
<td>$0.0416</td>
<td>$0.1499</td>
<td>$0.0623</td>
<td>$0.2797</td>
<td>$0.2706</td>
<td>$0.1876</td>
<td>$0.2082</td>
<td>$0.1665</td>
<td>$1.3663</td>
</tr>
<tr class="odd">
<td>GLM 5.1</td>
<td>$0.0092</td>
<td>$0.1616</td>
<td>$0.2090</td>
<td>$0.1944</td>
<td>$0.9351</td>
<td>$0.2869</td>
<td>$0.3110</td>
<td>$0.1507</td>
<td>$2.2579</td>
</tr>
<tr class="even">
<td>Claude Opus 4.7</td>
<td>$0.0664</td>
<td>$0.4944</td>
<td>$0.6071</td>
<td>$0.7789</td>
<td>$3.5187</td>
<td>$1.7768</td>
<td>$4.1527</td>
<td>$3.4331</td>
<td>$14.8282</td>
</tr>
<tr class="odd">
<td>GPT 5.5</td>
<td>$0.0325</td>
<td>$0.2763</td>
<td>$0.1822</td>
<td>$0.3598</td>
<td>$0.4260</td>
<td>$0.8592</td>
<td>$0.5434</td>
<td>$2.5312</td>
<td>$5.2107</td>
</tr>
<tr class="even">
<td>Gemini 3.1 Pro</td>
<td>$0.0174</td>
<td>$0.0910</td>
<td>$0.1003</td>
<td>$0.3032</td>
<td>$0.1831</td>
<td>$0.4174</td>
<td>$0.5932</td>
<td>$0.4191</td>
<td>$2.1246</td>
</tr>
</tbody>
</table>
</section>
<section id="game-state-format-preference" class="level3">
<h3 class="anchored" data-anchor-id="game-state-format-preference">Game State Format Preference</h3>
<p>The table below shows how often each model called <code>get_game_state</code> with <code>format=entities</code> vs <code>format=grid</code> across all runs.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>Entities Calls</th>
<th>Grid Calls</th>
<th>Preferred</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MiniMax M2.7</td>
<td>20 (18%)</td>
<td>92 (82%)</td>
<td>grid</td>
</tr>
<tr class="even">
<td>DeepSeek V4 Pro</td>
<td>14 (58%)</td>
<td>10 (42%)</td>
<td>entities</td>
</tr>
<tr class="odd">
<td>Kimi K2.6</td>
<td>16 (55%)</td>
<td>13 (45%)</td>
<td>entities</td>
</tr>
<tr class="even">
<td>Qwen 3.6 Plus</td>
<td>16 (62%)</td>
<td>10 (38%)</td>
<td>entities</td>
</tr>
<tr class="odd">
<td>GLM 5.1</td>
<td>11 (65%)</td>
<td>6 (35%)</td>
<td>entities</td>
</tr>
<tr class="even">
<td>Claude Opus 4.7</td>
<td>18 (43%)</td>
<td>24 (57%)</td>
<td>grid</td>
</tr>
<tr class="odd">
<td>GPT 5.5</td>
<td>16 (70%)</td>
<td>7 (30%)</td>
<td>entities</td>
</tr>
<tr class="even">
<td>Gemini 3.1 Pro</td>
<td>8 (73%)</td>
<td>3 (27%)</td>
<td>entities</td>
</tr>
</tbody>
</table>
</section>
<section id="tool-usage-matrix" class="level3">
<h3 class="anchored" data-anchor-id="tool-usage-matrix">Tool Usage Matrix</h3>
<p>Below are the overall tool call counts (with percentage of total tool calls per model) across all runs.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>execute_game_commands</th>
<th>game_insights</th>
<th>get_game_state</th>
<th>restart_level</th>
<th>shortest_path</th>
<th>todowrite</th>
<th>undo_multiple</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>MiniMax M2.7</td>
<td>326 (47%)</td>
<td>43 (6%)</td>
<td>112 (16%)</td>
<td>48 (7%)</td>
<td>147 (21%)</td>
<td>0 (0%)</td>
<td>19 (3%)</td>
</tr>
<tr class="even">
<td>DeepSeek V4 Pro</td>
<td>94 (62%)</td>
<td>11 (7%)</td>
<td>24 (16%)</td>
<td>4 (3%)</td>
<td>15 (10%)</td>
<td>2 (1%)</td>
<td>1 (1%)</td>
</tr>
<tr class="odd">
<td>Kimi K2.6</td>
<td>52 (45%)</td>
<td>14 (12%)</td>
<td>29 (25%)</td>
<td>3 (3%)</td>
<td>12 (10%)</td>
<td>0 (0%)</td>
<td>5 (4%)</td>
</tr>
<tr class="even">
<td>Qwen 3.6 Plus</td>
<td>133 (66%)</td>
<td>8 (4%)</td>
<td>26 (13%)</td>
<td>15 (8%)</td>
<td>13 (6%)</td>
<td>0 (0%)</td>
<td>5 (2%)</td>
</tr>
<tr class="odd">
<td>GLM 5.1</td>
<td>24 (37%)</td>
<td>8 (12%)</td>
<td>17 (26%)</td>
<td>1 (2%)</td>
<td>2 (3%)</td>
<td>12 (18%)</td>
<td>1 (2%)</td>
</tr>
<tr class="even">
<td>Claude Opus 4.7</td>
<td>85 (46%)</td>
<td>13 (7%)</td>
<td>42 (23%)</td>
<td>9 (5%)</td>
<td>27 (15%)</td>
<td>7 (4%)</td>
<td>2 (1%)</td>
</tr>
<tr class="odd">
<td>GPT 5.5</td>
<td>43 (40%)</td>
<td>9 (8%)</td>
<td>23 (21%)</td>
<td>4 (4%)</td>
<td>25 (23%)</td>
<td>0 (0%)</td>
<td>3 (3%)</td>
</tr>
<tr class="even">
<td>Gemini 3.1 Pro</td>
<td>37 (44%)</td>
<td>7 (8%)</td>
<td>11 (13%)</td>
<td>2 (2%)</td>
<td>19 (22%)</td>
<td>6 (7%)</td>
<td>3 (4%)</td>
</tr>
</tbody>
</table>


</section>
</section>

 ]]></description>
  <category>agents</category>
  <category>ai</category>
  <guid>https://meffmadd.github.io/samplesurium/posts/baba_is_agent/</guid>
  <pubDate>Thu, 14 May 2026 22:00:00 GMT</pubDate>
</item>
</channel>
</rss>
