My role is shifting from product and design into design engineering. On the consumer side, I did the user research, ran database analytics, figured out what to build, and used an AI-native approach to shape the experience. Now I’m building an enterprise interface for an agentic workflow. It maps hundreds of signals from a building’s automation system to the energy services running on top of it. What to build is already defined. The risk lives on the human side — a wrong binding confirmed without anyone realizing, or trust lost the first time the AI does something weird. The real question is how the agents talk to the operator. Sometimes chat is enough. Sometimes you need a purpose-built interface. At a small startup, you can’t staff a separate PM, designer, design engineer, frontend engineer, and backend engineer for every slice. So the lines blur. On this project, I owned design and frontend engineering end to end.

I built Forge to make that kind of ownership repeatable. It’s an agent harness. A coordinated team of AI specialists with structured artifacts, human review gates, and an adversarial second opinion from Codex at every phase boundary. The thesis is simple: with the right harness, one product-minded builder can ship work that survives engineering-level review.

What follows isn’t the whole workflow. Just the pieces I think are most unique. The harness is flexible by design, so the shape can be adapted to a different context. Some specifics are omitted for confidentiality.

A cycle, seven agents, three human gates

Forge is a nine-stage cycle. Seven specialist agents run it, orchestrated by Yuki — my Claude Code instance, named after my dog. Yuki queues the specialists in the right order, prepares the briefings I read at each human gate, and tracks every finding as the cycle runs.

Step 1
Claude Design
Designer
Step 2
Scenarios
PM
Step 3
Risk profile
EngineerPM
Step 4
Contract
All
Step 5
Build
Engineer
Step 6
Smoke-test
Engineer
Step 7
Grade
QADesigner
Step 8
Integrate + audit
IntegrationPerfSecurity
Step 9
Scorecard + retro
Yuki
Three human review gates, plus one conditional mid-build. Performance and Security audits run only when the surface's risk profile triggers them.

The seven specialists, briefly:

  • Product Manager. Writes scenarios and use cases from our product principles doc. Eleven principles, each with a falsifiable test and explicit anti-patterns. The only agent that decides whether something should exist.
  • Frontend Engineer. Owns data contracts, implementation, feasibility, and integration. When artifacts conflict, FE proposes a resolution, gets the relevant agent to confirm or counter, and escalates anything unresolved.
  • Enterprise Designer. Reads our UX-pattern library and grades visual hierarchy, behavioral expression, and DS compliance. Two depths: light (tokens only) or full (interaction and states). Coordinates with our design-system repo when a needed component is missing.
  • QA Engineer. Writes scenario-traced functional and accessibility criteria, edge cases included. Grades each surface in two passes. The second pass explicitly looks for what the first one missed.
  • Performance Engineer. Pulled in for scale-sensitive or full surfaces. Sets structural constraints in the contract: virtualization thresholds, lazy loading boundaries, bundle splits, fetch strategy. Audits the implementation after build.
  • Security Engineer. Pulled in for data-sensitive or full surfaces. Defines auth level, tenant scoping, data exposure, and mutation safety. Audits the implementation after build.
  • Integration Reviewer. Looks across contract sections for contradictions, ambiguities, orphan requirements, hidden couplings, and unresolved assumptions. Reports findings only. FE owns resolution.

Three human gates are mandatory: after scenarios, after the contract is drafted, and before final ship. A fourth is conditional, triggered mid-build if assumptions wobble. The contract gate is the highest-leverage one by far. Catching a misalignment there is cheap. Catching it after build is not.

Design first or product first?

Why do we have so many roles in product? Designer, product designer, design engineer, product manager, product engineer. The distinctions are real, but the boundaries blur when one person with AI agents can cover more of the slice.

On this project, the work boils down to three concerns: what to build, how the interface should look, and how the implementation talks to the data. The interesting question isn’t which title leads. It’s whether you have a well-defined scope. If you do, design-first is viable. If you don’t, PM comes first. Someone has to figure out what to build before pixels land.

In my case, we already know what to build. Our CTO wrote a well-defined architectural spec and API, so design-first is the right starting point. The PM role still exists in the cycle, run by a PM agent.

Design with Claude Design

Claude Design arrived a few weeks into the enterprise onboarding workflow and changed the loop.

Before Claude Design, the loop was simpler. Feed the engineering spec to the PM agent. It produces use cases, scenarios, and edge cases. Claude Code takes those and generates pages using DS tokens and components. PM-first, scenario-first, code after.

Two things made the old loop painful on this project.

The first was hardware. To design for it, I had to learn vocabulary I’d never used before — sensors, setpoints, control circuits, watchdog timers, BMS signals. Without it I couldn’t even picture what a good interface should look like.

The second was agentic UI. What should the operator notice first? Where should questions live? When does chat help, and when does it just distract? Regardless of how much AI is under the hood, the software still has to serve the operator’s actual problem.

So I sent the onboarding spec to Claude Design first, not the PM agent. It mocked up the hardware layer: sensors, control circuits, what shows up when auto-binding fails, the operator’s link back to the building. The visualization filled my domain gaps and unlocked the rest of the design.

It also gave me a low-cost canvas. Iterating on small layout and microcopy changes inside Claude Design is cheap. Token-expensive, sure, but still cheaper than reworking shipped UI. Iterating once the real UI is shipped is not.

When the design is good, I trigger Claude Design’s handoff. The artifacts drop into a scratchpad. Claude Code reads from there, and the rest of Forge runs against the mockup. The mockup is reference, not source. The design system owns the final tokens, colors, and typography. What carries forward into the build is the shape: layout, interaction patterns, microcopy, component composition. PM, Designer, and FE share it as reference.

Where Claude Design could go further

It works for simple sites. It strains against a mature design system.

A few pieces of feedback after using it on this project.

Importing the design system is still primitive. Even a limited import — foundation tokens, type, spacing, colors, semantic tokens — came through half wrong, with hallucinated values that didn’t exist in the source. That meant hours of manual correction.

Component usage is the next gap. It wasn’t obvious how to make Claude Design actually call our components when sketching, rather than invent its own.

The handoff is the third gap. Claude Design can push artifacts to Claude Code, but design isn’t a static activity. After Claude Code starts building, I want a revised mockup and a clean delta back into the build. That round-trip doesn’t exist yet. For bidirectional iteration, Claude Design still isn’t better than Figma. The handoff between design and code is manual.

The deeper opportunity is taste. We keep treating design taste as a judgment call. A lot of it isn’t. When should an animation use ease-in versus ease-out? At how many milliseconds? Conventions exist. Durations are measurable. Spacing rhythm, contrast ratios, motion curves, type hierarchy — all measurable. If Claude Design stops chasing Figma’s feature surface and starts thinking of itself as part of the code system, those parts of taste can become evaluable like any engineering practice. That’s the shift I want to see.

Match ceremony to scope

Each surface gets a risk profile, and the profile decides which specialists participate. A static text panel doesn’t pull in a Performance Engineer. A five-thousand-row table does.

The FE agent proposes the profile after reading the scenarios. The PM agent confirms it. Two hands on the wheel: FE sees the technical surface area, PM sees what the scenarios imply about hidden risk. A “simple” display surface that actually exposes tenant-scoped data gets caught here, not later.

Five profiles:

  • Standard. FE plus QA plus Designer (tokens only).
  • High-polish. Add full Designer review.
  • Data-sensitive. Add Security review.
  • Scale-sensitive. Add Performance review.
  • Full. All specialists.

Triggers route surfaces automatically, so we don’t relitigate the mix every cycle. Anything that mutates data, touches a new endpoint, or reads tenant-scoped data goes data-sensitive. Anything streaming, real-time, or carrying more than a hundred items goes scale-sensitive. First screens go high-polish. Cross-surface state or auth-level changes go full. Triggers stack.

The same scoping logic applies at the cycle level. /forge runs the full cycle for a new surface. /forge-feature runs a lighter cycle on an existing one. /forge-retrofit writes a contract for screens that pre-date the framework.

The contract is the highest-leverage moment

A Forge contract is a structured markdown document.

forge/contracts/major-feature/{screen}/contract.md markdown
Section 0      Scenarios
Round 0        Surface Risk Profiles
Section 1      Data Contract
Section 2      Functional Criteria (QA)
Section 3      Visual Criteria (Designer)
Section 4      Feasibility Notes
Section 5      Performance Constraints
Section 6      Security Requirements
Human Gate     Records of reviews and approvals
Section 7      Contract Intelligence
Section 8      Grading table
Section 9      Amendments (running history)

Our contract is 1,066 lines of markdown. Every data field looks like this:

contract.md markdown
Field:         building.name
Type:          string
UX shows as:   Page header
Scenario IDs:  S1, S2
Notes:         confirmed in upstream spec

Scenarios are the gravity. Every field, criterion, and grading row points back to a scenario ID. If it can’t, it doesn’t belong in the contract.

The contract gets written in three rounds, on top of Round 0’s risk profile. Round 1 is Data and Intent: FE drafts Section 1, PM reviews the product intent, Security joins if the surface is data-sensitive. Round 2 is Criteria and Constraints: QA writes functional, Designer writes visual, Performance writes structural if the surface is scale-sensitive. Round 3 is Feasibility and Integration: FE reviews the whole contract, locks risk profiles, replays any reviews affected by changes, and fills in Section 7. Codex runs an adversarial review at each round boundary.

The Designer agent picks up what the design system can’t. Token compliance is mechanical. Behavioral expression, hierarchy, state transitions, the “DS-compliant but feels wrong” feeling — that needs a human-trained eye. The design system catches violations. The Designer catches drift.

Human Gate

This is the gate I spend the most time at. Honestly, it’s also the most exhausting part of the cycle. The back-and-forth with the AI to get the contract right can run a couple of hours. The contract describes what’s about to be built across every surface. Cheap to change here. Expensive to change after FE has shipped two surfaces against the wrong assumption.

Build, check, grade

Nothing moves on until a surface clears three checks.

First, the boring machines: type check, lint, build, a manifest-driven audit on changed DS components, a schema-drift check, and a security lint. Fast, mechanical, no judgment.

Then Playwright opens the surface like a user would. Does it load? Do the links work? Did the button actually write what it claimed? Surfaces with live state get the two-tab check: change something in tab A, verify tab B updates without a refresh. It’s the bug I keep finding in my own code. The server pushes the update back to whoever made the change, and forgets everyone else watching the same record.

Then the QA and Designer agents grade against the contract. Every finding carries a confidence tag: HIGH, MEDIUM, or LOW. A LOW pass doesn’t commit. A LOW finding sounds like “I think this meets the empty-state requirement, but I only verified the happy path.” That’s not a pass. That’s a request for evidence — another agent confirming, a fresh screenshot, or an explicit human override.

A single grading pass tends to confirm rather than challenge. The second pass starts from “I missed something, find it” — and consistently surfaces issues the first pass overlooked.

All of this runs without me in the loop. If any layer fails, the surface bounces back: FE fixes, gates re-run, graders re-grade. The hardest part of the cycle is the contract — once that’s locked, build and verification are the perfect things to hand off. I’ll kick a surface off before bed and read the result the next morning with coffee.

Once individual surfaces clear grading, the Integration Reviewer looks for the weird stuff between them: contradictions, loose ends, hidden couplings, assumptions nobody noticed they were making. Performance and Security audit the implementation against what the contract promised. The contract said the list virtualizes. The audit verifies it actually does.

How Forge keeps AI honest

Every LLM hallucinates. Opus 4.7 in particular has hallucinated on me a lot, especially around API shape. The mechanisms below exist because the alternative is shipping invented fields, made-up vocabulary, and confidently wrong claims. They run across the cycle, not at one stage.

Explore-first source grounding. Before any agent drafts a field, criterion, or endpoint, it dispatches an Explore subagent. The subagent reads the API spec and engineering docs with line citations. Drafting from memory isn’t allowed. Ambiguity in the source gets reported back, not filled in. This is the single most explicit anti-hallucination mechanism in the framework.

Spec fidelity. Even with a well-written API spec, frontend and backend negotiate constantly. Sometimes the shape we need isn’t the shape published. To keep up with a moving spec, the API spec wins over the reference prototype. Conflicts get flagged, not papered over. Every field in the data contract carries a provenance tag from a four-tier ladder. Confirmed means the spec publishes the schema. Documented means prose-only. Composed means derived from spec fields. Proposed means FE-invented and tagged for the upstream team: what the frontend needs but the backend hasn’t agreed to yet.

Amendment log. Section 9 is a running history of corrections, hallucinations Codex caught, vocabulary fixes, and scope realignments. Required reading on every resume. The contract describes the current state. The log describes how you got there. Skipping it is how prior corrections quietly repeat. Honestly, I should have structured this section better up front — it grows long over time and gets harder to scan. A future iteration probably needs categories or timestamps so the signal doesn’t drown in the chronology.

Codex review at phase boundaries. Codex runs adversarially at four cycle boundaries: after Round 1, after Round 2, before the contract gate (H2), and before final review (H3). Plus every code PR before merge. Codex has been my unsung hero. I run it through Claude, but it’s become my trusted partner — catching Claude’s hallucinations and giving me sound system-design thinking when I need a second brain.

Calibrating the framework

After every screen, Yuki writes a report card. Which agent caught real issues. Which agent surfaced noise. What slipped past each review. Where rework piled up. The point is calibration, not blame.

Every three screens, we check the trend. Are agents getting more reliable, or drifting overconfident? Are issues caught earlier? Is shipping with Forge actually better than shipping without it?

Where this falls short

I’d rather be honest about this than not.

Contracts get too big. Ours is one document at 1,066 lines, which is more than any reviewer can hold in their head at once. It also runs up the token bill: loading the whole thing into seven agents across multiple rounds adds up, and Claude Code caches automatically but doesn’t expose explicit cache breakpoints the way the SDK does. The next iteration splits the contract into per-surface mini-contracts that compose into a screen-level index — decomposition is the more reliable lever.

Evaluation has to evolve. UI quality still depends on a human reading the screen. AI coding remains uneven at visual design, and there’s no automated UI eval I trust yet. Take animation timing — whether an element uses ease-out at 200ms or ease-in at 350ms is documented and measurable, but no agent in the cycle currently verifies it lands right. The same gap shows up every time we improve a surface: no automated way to grade whether the new version is actually better. Forge grades itself in trends, not proof. Three open problems on my list.

Forge isn’t autonomy. It’s discipline. Like any discipline, it has blind spots. The honest version of this article includes the ones I haven’t solved.

Engineering, by any name

There’s no shortage of labels these days with the rise of AI: design engineer, product engineer, GTM engineer, growth engineer. We’ve debated a lot about which roles are going away and who owns which lines of work. In my humble opinion, the missing piece is bringing the same engineering discipline to whatever line of work you’re doing. Forge is one shape of that discipline, for someone who wants her product and design work to ship to production. The shape will look different in your context, but the harness applies the same to reach that discipline.


Written by me via Wispr Flow. Reviewed and polished by Yuki (Claude Code).