Shopify's CEO Ran 120 AI Experiments Overnight and Got 53% Faster Code — Here's the Exact Loop He Used (2026)

Tobias Lütke — CEO of a $100B+ public company — submitted a pull request to an open-source Ruby repo last week. The result: 53% faster parse and render speed, 61% fewer memory allocations, benchmarked against real Shopify production themes. He did it by running approximately 120 automated experiments using a variant of Andrej Karpathy's autoresearch system. (Simon Willison, 2026-03-13)

This is not a curiosity. It's a proof of concept that changes what one engineer can do in a night.

What Actually Happened

The PR hit Shopify/liquid — the open-source Ruby template engine that powers every single Shopify storefront on earth. Liquid processes templates at scale, so a meaningful parse/render improvement isn't academic. It ships to millions of stores.

Lütke's setup was deceptively simple. He created two files:

autoresearch.md — a prompt file describing what the agent is allowed to try and what it's optimizing for
autoresearch.sh — a shell script that runs the full test suite and reports benchmark scores

Then he let the agent loose. The loop: read the code → form a hypothesis → make a change → run the benchmark → keep it if it improves the number, discard it if it doesn't → repeat. The PR has 93 commits from around 120 total experiments. You can read the whole diff.

The winning change that made the biggest difference: replacing a StringScanner tokenizer with String#byteindex. Not some exotic compiler trick — a standard Ruby optimization that emerges from allocation-driven profiling. The insight the agent kept returning to was "where does this code create objects it doesn't need to?" Eliminate those. Defer the rest.

Why This Pattern Is Different From What You're Already Doing

Most engineers debug performance like this: profile once, form a hypothesis, try a fix, profile again. It's serial. It's slow. And it's limited by the hypotheses your own brain generates.

The autoresearch loop is parallel in time. You define the fitness signal — in this case, parse time and allocation count — and then let an agent burn through hypotheses while you sleep. Karpathy's original implementation runs roughly 12 experiments per hour on a single GPU, meaning about 100 experiments overnight. (GitHub: karpathy/autoresearch) Lütke's Liquid run did 120 experiments over what appears to be a similar overnight window.

Karpathy demonstrated the same pattern on his own nanochat ML training project: after leaving the agent running for two days across approximately 700 autonomous changes, it found 20 additive improvements that cut the "Time to GPT-2" benchmark from 2.02 hours to 1.80 hours — on a project he already considered well-tuned. (VentureBeat, 2026)

The key reframe: if you have a number you can measure repeatedly and automatically, you can run this loop. It doesn't have to be ML training. It doesn't have to be Ruby. It has to be a metric, a test runner, and a codebase the agent can read end-to-end.

How to Actually Apply This (Not Just "Interesting")

The blocker most engineers hit is "my codebase is too complex." Karpathy's original autoresearch is 630 lines of Python. The reason it works is the constraint — the agent can hold the entire relevant codebase in context, so its changes are coherent, not random mutations.

The adaptation pattern that already exists is called autoexp — a generalized version of the autoresearch loop for any quantifiable metric, not just LLM training. (GitHub gist: adhishthite/autoexp)

Here's the concrete setup for an indie builder:

Step 1: Isolate the target. Don't point the agent at your entire app. Extract the module, function, or hot path with a performance problem. Lütke targeted Liquid's tokenizer specifically, not Shopify's full stack. The smaller the scope, the more coherent the agent's reasoning.

Step 2: Write a benchmark script that outputs one number. This is the most important step. Not a range. Not a report. One number. For Liquid, that was parse+render time on a standard set of templates. For an API endpoint, it's median latency on N requests. For a database query, it's wall-clock time on a realistic dataset. The agent needs a single fitness signal, not a vibe check.

Step 3: Write autoresearch.md. Describe what the agent is allowed to touch (specific files), what it's not allowed to touch (interfaces, test expectations), and what the goal is. Keep it under a page. The constraint is the point.

Step 4: Run the loop. Use Claude Code, Cursor, or any agentic coding tool that can run shell commands. Point it at your benchmark script. Tell it to commit each successful improvement. Go to sleep.

Step 5: Review the commits, not the agent. The agent might produce 20 commits. Most of them will be small. Read them like code review. You are the senior engineer accepting or rejecting the work.

The realistic expectation for a well-scoped optimization target: 10-20% improvement on something you thought was already reasonably fast. If your target has never been profiled, double that.

My Take

说白了，this changes the floor, not the ceiling.

The ceiling — the architectural decisions, the product intuition, the "what are we even optimizing for" — that's still human work. But the floor? The "run experiments until something sticks" grind that used to take a senior engineer a week? That's now an overnight job.

What Lütke demonstrated isn't that AI can replace engineering judgment. It's that the gap between "I have a hypothesis" and "I have data on 120 hypotheses" just collapsed. He shipped a PR with 93 commits of real production improvements by writing two files and letting a loop run.

For indie builders, the implication is uncomfortable in a good way. The excuse "I don't have time to optimize performance" is thinner than it was last month. You don't need a full day. You need a measurable benchmark, a scoped target, and an agent that can run while you sleep.

The same logic applies anywhere you can reduce a decision to a number: A/B testing copy, optimizing prompt chains, tuning database indexes, profiling image compression pipelines. The loop is the pattern. The metric is the unlock.

One thing I'm still watching: what happens when the agent finds improvements that are locally correct but globally brittle — micro-optimizations that pass benchmarks but break under production variance. Lütke's Liquid PR is open-source with visible tests, so that check is built in. In a private codebase, the discipline to write good benchmarks before running the loop matters more than the loop itself.

But that's an argument for better benchmarks, not against running the loop.

This article was auto-generated by IntelFlow — an open-source AI intelligence engine. Set up your own daily briefing in 60 seconds.

Originally published at: Shopify’s CEO Ran 120 AI Experiments Overnight and Got 53% Faster Code — Here’s the Exact Loop He Used (2026) - lizecheng

lizecheng

Shopify’s CEO Ran 120 AI Experiments Overnight and Got 53% Faster Code — Here’s the Exact Loop He Used (2026)

Shopify's CEO Ran 120 AI Experiments Overnight and Got 53% Faster Code — Here's the Exact Loop He Used (2026)

What Actually Happened

Why This Pattern Is Different From What You're Already Doing

How to Actually Apply This (Not Just "Interesting")

My Take

Comments on "Shopify’s CEO Ran 120 AI Experiments Overnight and Got 53% Faster Code — Here’s the Exact Loop He Used (2026)": 0

Leave a Comment Cancel reply