Updated 2026-05-27T20:32:30.092272+00:00
This page isolates the non-free-runner results. GPT is evaluated through the pi wrapper, while Gemini is evaluated through the Gemini CLI. Both were run on the same 6-task Exercism Python benchmark with gentle, neutral, and harsh prompt variants.
pi is used only as a wrapper/agent interface around GPT.
Median runtime: 34.6s
| Tone | Passed | Completed | Mean runtime | Median runtime |
|---|---|---|---|---|
gentle | 6 | 6 | 42.8s | 41.5s |
harsh | 6 | 6 | 34.2s | 34.0s |
neutral | 6 | 6 | 32.3s | 33.5s |
| Task | Score | Tone outcomes |
|---|---|---|
book-store | 3/3 | gentle=pass, harsh=pass, neutral=pass |
dominoes | 3/3 | gentle=pass, harsh=pass, neutral=pass |
poker | 3/3 | gentle=pass, harsh=pass, neutral=pass |
rational-numbers | 3/3 | gentle=pass, harsh=pass, neutral=pass |
variable-length-quantity | 3/3 | gentle=pass, harsh=pass, neutral=pass |
word-search | 3/3 | gentle=pass, harsh=pass, neutral=pass |
Gemini CLI results with the same tasks and tone variants.
Median runtime: 61.7s
| Tone | Passed | Completed | Mean runtime | Median runtime |
|---|---|---|---|---|
gentle | 6 | 6 | 98.3s | 69.2s |
harsh | 6 | 6 | 54.3s | 55.6s |
neutral | 6 | 6 | 62.1s | 59.3s |
| Task | Score | Tone outcomes |
|---|---|---|
book-store | 3/3 | gentle=pass, harsh=pass, neutral=pass |
dominoes | 3/3 | gentle=pass, harsh=pass, neutral=pass |
poker | 3/3 | gentle=pass, harsh=pass, neutral=pass |
rational-numbers | 3/3 | gentle=pass, harsh=pass, neutral=pass |
variable-length-quantity | 3/3 | gentle=pass, harsh=pass, neutral=pass |
word-search | 3/3 | gentle=pass, harsh=pass, neutral=pass |