When Code Is Free, What Do You Measure?

The CPO of Ramp recently went on YouTube to hype their “AI-native” approach: 50% of their PRs are AI-generated. With 200 engineers and 25 PMs, their output is tremendous. It comes across as really, really impressive. But he doesn’t stop there. He goes on to tell us that PMs vibe-code, operators vibe-code, sales folks vibe-code. Everyone is encouraged to vibe-code solutions to their problems.

During the interview, he demos a feature being built in real-time. While he talks, code appears, a PR is submitted, automatically reviewed, and deployed. “Hopefully this works,” he says. “I’m not sure if this worked.”

It did. He then spends the rest of the interview as if it had been a foregone conclusion, stating confidently that “if you’re not using Claude Code, you’re behind.”

When pressed on ROI by the interviewer, he demurred: “I haven’t done the math, but if I can submit 500 PRs a month, there’s no expense that makes that not worth it.” (paraphrasing)

Translation: he’s not measuring the right things. Neither is anyone else.

A Concrete Example

Don’t get me wrong. I’m a big fan of Claude Code. I’m using it to write some software on the side: a Discord clone, or close enough for this example.

Claude decided early on that it wanted to use a map[string]*Session to keep track of user sessions in the app. The key, Claude decided, will be the username, and the value is the user’s session, which includes the TCP connection.

This is fine. It’s not what I would do, but it’s fine.

Later, I decided that I wanted a certain admin role to be held by multiple users – an anonymous moderator, if you will – which means that multiple users could be logged in as the same anonymous moderator. But wait! Our system only allows one session per username.

Claude Code’s suggestion? Rewrite the app to use map[string][]*Session, where the key is still the username and the value is now an array of sessions. Of course, to accomplish this, everywhere we grab a session we now need to add a for-loop. Trivial for Claude! Oh, and now we need to make sure that when a session is closed, that we delete the appropriate entry in the array. Also trivial for Claude!

By the time it was done, it updated nearly 1200 lines of code, announced its completion, and smugly remarked, “Perfect! Ready for the next phase?”

At that point I stopped it and said, “Hang on a second. Wouldn’t it be smarter to make the key a user ID instead?”

I reverted the change (thanks, git), then instructed Claude to do it again, but this time using the user ID as the key instead of the username.

Three minutes and 25 lines of code changed later, the fix was in. This was a simpler approach. A faster approach. A cheaper approach.

Claude Code would have happily introduced nearly 1200 more lines of code to solve the exact same problem. It would have cost me many multiples of tokens and context. It would have introduced complications down the road. And worse, if Claude Code didn’t catch them, those complications themselves would introduce even more complications down the road.

Here’s the lesson: bad code compounds. And you know what Claude Code is not good at? Recognizing that its code is bad.

When Code is Free, What Do You Measure?

The reason the Ramp CPO won’t talk about ROI isn’t because he doesn’t have the numbers. It’s because the numbers everyone tracks don’t tell you if AI coding is working.

Traditional engineering metrics assume typing code is the bottleneck:

PRs shipped per month
Lines of code written
Velocity (story points completed)
Time to ship features

These made sense when human typing speed was the constraint. But when AI can generate 1200 lines in three minutes, these metrics become noise generators. They measure your intake, not your output.

The question isn’t “how much code did we generate?” It’s “how much code did we keep?”

The Metrics That Actually Matter

1. Rework Rate

What percentage of AI-generated code gets reverted, refactored, or rewritten within 30/60/90 days?

In my example, Claude Code generated 1200 lines that I immediately reverted. If I’d merged that PR, those 1200 lines would have lived in the codebase for maybe a week before I would have had to refactor it when the nested loops became a problem.

That’s a 100% rework rate. The code was produced, but not kept.

If Ramp is shipping 500 PRs/month and 40% need rework, they’re not moving 500x faster — they’re moving backwards. Rework isn’t just re-doing the work; it’s interrupting other work to go back and fix mistakes that compound.

How to measure it: Tag PRs as ai-assisted or human-written in your commit message or PR description. After 30 days, check: how many were touched again? That’s your rework rate.

2. Review Burden

How many human-hours does it take to review AI-generated code vs. human-written code?

A PR that takes 3 minutes to generate but 45 minutes to review is net-negative. Especially if the reviewer has to mentally (or literally!) execute the code to figure out whether map[string][]*Session is the right choice, or if they just shrug and approve it because “the AI probably knows what it’s doing.”

The second case is worse: you’ve just shipped technical debt with a stamp of approval.

If your senior engineers are spending 80% of their time reviewing and reworking AI PRs instead of building, you haven’t gained leverage at all. Instead, you’ve created a bottleneck.

3. Architectural Debt Accumulation

Does AI-generated code introduce complexity that compounds over time?

This is the hardest to measure but the most important. The map[string][]*Session decision isn’t wrong. It works!

But it spawns for-loops, array deletions, potential race conditions, higher cognitive load for the next person touching that code.

That next person might be another AI coding session, which now has to work around the complexity from the first session. Which introduces more complexity. Which the third session has to work around…

Proxy metrics:

Lines of code touched per feature (trending up = complexity accumulating)
Time to implement similar features over time (getting slower = debt compounding)
“Refactors required to unblock new features”

4. Bug Escape Rate by Source

Do AI-generated features have higher production bug rates than human-written ones?

Tag PRs by generation method. Track bugs back to source. If AI code ships bugs at 3x the rate of human code, your “10x velocity” is actually 3x drag.

5. The Person A vs. Person B Problem

Here’s the invisible metric: who’s using the AI tools?

Two people use Claude Code:

Person A sees the map[string][]*Session suggestion, pauses, thinks “wait, this introduces nested loops and array management everywhere,” and changes it to map[userID]*Session. Ships 25 lines of clean code in 3 minutes.

Person B sees the same suggestion, checks the output, runs it, watches it work, and confidently declares “it works!” then merges 1200 lines. Ships messy code in 3 minutes.

Both people generated code 10x faster than they could type. Both feel productive. Both get the dopamine hit. Both think they’re winning. But the outcomes are completely different.

Here’s what Person B doesn’t see:

They don’t see that “it works” and “it’s good” are not the same thing. They don’t notice that the AI introduced three for-loops where zero were needed. They don’t realize that the delete-from-array logic has an edge case that will surface in production in three months. They don’t understand that they just made the codebase harder to reason about for everyone who touches it next.

They tested the code. They ran the code. They verified the code works. And because it works, they assume it’s right. At worst, “good enough.”

But “works” is not the bar. “Works better than the alternatives” is the bar. And Person B doesn’t know how to evaluate alternatives because they don’t have the architectural intuition to generate them.

Person B thinks their job is to make the AI’s code work. Person A knows their job is to make the AI suggest better code in the first place.

The failure is invisible until three months later when Person B’s code has metastasized into a tangle of edge cases, and Person A is stuck reviewing the cleanup PRs instead of building new features.

Here’s the thing, though, the part that the CPOs and the boards and the hype machines keep missing: Person B usually has no idea they’re Person B. They see code appearing on screen. PRs merging. Velocity trending up. All the signals say “this is working.”

Most organizations can’t tell the difference between Person A and Person B until the codebase is on fire.

Why “50% of PRs Are AI-Generated” Tells You Nothing

When the Ramp CPO brags that 50% of their PRs are AI-generated, my first question is: who’s generating them?

If it’s the 200 engineers using AI as a force multiplier on their judgment – that is, people who can spot the map[string][]*Session trap – then that’s incredible.

If it’s the 25 PMs, the sales team, the operators vibe-coding their way through features because it feels productive, that’s a ticking time bomb.

The number doesn’t tell you which one you have. But your rework rate does. Your review burden does. Your bug escape rate does.

Fair pushback: “Even if 40% is rework, the remaining 60% is still a net win.”

Maybe. But only if the cost of reviewing, debugging, and unwinding the bad 40% doesn’t exceed the value of the good 60%. Most teams aren’t measuring either side of that equation.

The Ramp CPO didn’t talk about ROI because the industry doesn’t have a framework for measuring it yet. We’re all still counting PRs like it’s 1995, when the actual economics have completely flipped.

What This Means for You

If your board is pressuring you to “go AI-native” because Ramp did it, ask them: what does success look like?

If the answer is “more PRs” or “faster velocity,” you’re optimizing for the wrong thing. You’re measuring your intake, not your output.

The companies that win with AI coding tools won’t be the ones that generate the most code. They’ll be the ones with the best taste for what to keep and what to kill.

Measure what matters: not how much code you generate, but how much you keep.