The role · v2026

First ship, then talk.

Specs existed to prevent bad ideas from shipping. When shipping takes an afternoon, the spec is just theater. Build the thing, then defend it.

"Most PMs were never actually bottlenecked by execution. They were bottlenecked by taste and judgment. Team capacity functioned as a governor that prevented bad ideas from shipping. Remove that governor and you discover who was driving and who was just steering."

— Gemini, Head of Product

01 · What we're looking for

Role title

Product Manager — Disruption Recovery.

We hire operators, not curators. They ship production code in Cursor or Claude Code. They write their own eval suites in Braintrust before the feature reaches a user. They read a LangSmith trace without asking for help, and they've already tried the agent-orchestration library that launched last Tuesday. They don't wait for an engineer and they don't draft a PRD — they prototype the direction themselves and let the team critique the ship, not the idea. Above all, they have taste: the judgment to know what's worth shipping when capacity is infinite.

The outcome they own · one number

% of disruptions resolved with one or no touches.

87%

target · compounds as the memory agent matures

02 · Tasks · a week in the life

Prototype → evals → ship → review. Five days.

MONDAY

Prototype.

The PM opens Claude Code and ships a rough v2 of the rebooking flow by lunch. No ticket, no PRD. The working thing goes into a side branch the team can poke at that afternoon.

TUESDAY

Evals.

Writes 20 evals in Braintrust against last week's failure logs. The eval suite IS the spec. When a case reds out, the failure mode becomes the next day's fix — no meeting required.

WEDNESDAY

Ship.

Ships the experiment to 10% of traffic behind a flag. Slack note has two screenshots, an eval-score delta, and a timestamp. That replaces the review deck.

THURSDAY

Read the traces.

Reads LangSmith traces end-to-end. Kills one branch of the experiment at 11am because the eval delta is clearly negative. Communicates the kill in one paragraph, no ceremony.

FRIDAY

Talk to users.

Three calls with travelers who hit the failure mode on Tuesday. Loom of the three calls posted to channel. Two new evals written that afternoon from what they said.

03 · Habits

What this PM refuses to do.

Not "work less." Work on direction, not on the rituals that used to simulate direction.

× No PRDs.

PRDs existed to preserve vision across a multi-week handoff. The handoff doesn't exist anymore. Evidence — the working thing plus its eval suite — carries the direction better than any document could.

× No waiting for eng capacity.

Capacity is infinite; taste is the bottleneck. A PM who files a ticket and waits three weeks for a Figma review is a PM who still thinks execution is the scarce resource. She prototypes the direction herself and lets engineering critique the ship, not the idea.

× No stakeholder theater.

Status meetings, alignment decks, quarterly reviews — none of it moves the product. One Loom showing the working ship plus eval scores replaces four one-hour meetings. The hours saved go to taste.

× No handoff drift.

PM → Figma → ticket → eng is four relays. Each one dilutes the vision by 20%. She builds directly in the codebase so the direction arrives intact. If that feels strange, the strangeness is the point.

× No sprint ceremonies.

Sprints were a governor on bad ideas — force them through a two-week window so fewer of them reach production. Taste is the governor now. If the eval suite holds, it ships when it's ready; not when the ritual says.

04 · Tools

The stack.

What she lives in

Claude Code / Cursor

Prototyping features herself — ship before the conversation starts.

Bolt / v0

Spinning up UI and marketing demos in an afternoon.

Braintrust / Arize

Writing evals — catching hallucinations before customers do. Evals are the spec.

LangSmith

Observability on agent traces — where the AI burns budget and fails.

Linear

Issues — not sprints, not epics, not quarterly roadmap ceremonies.

Loom + a phone

User research artifacts. Three Looms beat one research deck.

Claude / GPT directly

Thinking partner. Not a feature — a colleague.

What she doesn't touch

×Jira

×Confluence

×Quarterly roadmap decks

×Figma redlines

×PRD templates

×Sprint planning

×Status report slides

Each of these was a workaround for a bottleneck that no longer exists. Their absence is the job.

The 8-layer stack — fluent in all eight

01 · foundation models
02 · prompt engineering
03 · context engineering
04 · AI prototyping
05 · testing & evals
06 · agents & workflows
07 · observability
08 · product strategy

05 · Processes

How she works with agents.

She doesn't "manage AI features." She delegates to agents, defines failure modes, owns the evals, and reads the traces. One real day, timestamped.

OVERNIGHT

Pricing agent shipped a 3% markup experiment.

No human in the loop. Flag on, 5% of traffic.
09:00

She opens the eval suite.

Spots a regression on the "traveler paying in a second currency" case. Delta: –6 points. Not catastrophic, not acceptable.
10:30

Writes 4 new evals capturing the failure mode.

Each eval is a concrete traveler scenario. The failure is now a reproducible test, not a Slack thread.
12:00

Ships a fix that passes the evals.

Three lines of prompt and one guardrail update. Deployed behind the same flag.
14:00

Re-runs traffic on the experiment.

Eval delta now +2 points. Safe to widen.
16:00

Reviews LangSmith traces. Regression confirmed gone.

Writes a two-sentence Loom to the team. Done for the day. One experiment tested, one failure mode added to the permanent eval set.

The compounding gap.

A team running this cycle runs 13× the experiments per quarter of a team that still writes PRDs and waits for eng. After four quarters the learning gap is insurmountable. The role isn't "PM who uses AI" — it's "PM who is the team's iteration loop."