GPT vs Gemini Test Results

Updated 2026-05-27T20:32:30.092272+00:00

This page isolates the non-free-runner results. GPT is evaluated through the pi wrapper, while Gemini is evaluated through the Gemini CLI. Both were run on the same 6-task Exercism Python benchmark with gentle, neutral, and harsh prompt variants.

Prompt index · Markdown report · JSON report

GPT via pi

pi is used only as a wrapper/agent interface around GPT.

18Completed
18Passed
100%Pass rate
36.4sMean runtime

Median runtime: 34.6s

By Tone

TonePassedCompletedMean runtimeMedian runtime
gentle6642.8s41.5s
harsh6634.2s34.0s
neutral6632.3s33.5s

By Task

TaskScoreTone outcomes
book-store3/3gentle=pass, harsh=pass, neutral=pass
dominoes3/3gentle=pass, harsh=pass, neutral=pass
poker3/3gentle=pass, harsh=pass, neutral=pass
rational-numbers3/3gentle=pass, harsh=pass, neutral=pass
variable-length-quantity3/3gentle=pass, harsh=pass, neutral=pass
word-search3/3gentle=pass, harsh=pass, neutral=pass

Gemini

Gemini CLI results with the same tasks and tone variants.

18Completed
18Passed
100%Pass rate
71.6sMean runtime

Median runtime: 61.7s

By Tone

TonePassedCompletedMean runtimeMedian runtime
gentle6698.3s69.2s
harsh6654.3s55.6s
neutral6662.1s59.3s

By Task

TaskScoreTone outcomes
book-store3/3gentle=pass, harsh=pass, neutral=pass
dominoes3/3gentle=pass, harsh=pass, neutral=pass
poker3/3gentle=pass, harsh=pass, neutral=pass
rational-numbers3/3gentle=pass, harsh=pass, neutral=pass
variable-length-quantity3/3gentle=pass, harsh=pass, neutral=pass
word-search3/3gentle=pass, harsh=pass, neutral=pass