Skip to content

The Pipeline

PRD in. Production microservice out.

Backend microservices, verified, scored, and deployable. Multiple agents per stage. Every output checked before the next stage starts. Runs on your machine, uses your tools.

You control the stack

Mix and match models and CLI tools per task tier. Use Opus for the hard stuff, Gemini for the routine tasks, Haiku for scaffolding. Your config, your cost profile.

# constraints.yaml
model_tiers:
  fast: gemini-2.5-flash      # scaffolding, formatting, simple tasks
  default: claude-sonnet-4      # routes, services, test files
  critical: claude-opus-4       # security, auth, payment logic

execution:
  mode: cli-connected         # uses your local CLI directly
  cli: claude-code            # or: codex, gemini-cli

timeouts:
  scaffolding: 300
  testing: 900
  documentation: 600

Multi-model

Different model per task tier. Opus where it matters, Flash where it doesn't.

Multi-CLI

Claude Code, Codex, Gemini CLI. Use whichever subscription you have.

Your cost profile

Flat-rate subscription or per-token API keys. Your choice per run.

Plugs into your workflow

Shipwright connects to your existing tools via MCP (Model Context Protocol). Pull tickets, push results, get notified. No changes to how your team already works.

GitHub

Pull issues as PRDs. Create PRs from completed runs. Readiness score as a status check.

Jira

Pull tickets as input. Push readiness scores, cost breakdowns, and audit results back to the issue.

Linear

Sync tickets and cycles. Auto-update status as the pipeline progresses through stages.

Slack

Run notifications, approval requests, and readiness reports delivered to your channel.

CI / CD

Trigger runs from your pipeline. Use the readiness score as a deployment gate.

Any MCP Server

The pipeline speaks MCP natively. If your tool has an MCP server, it plugs in.

MCP is an open protocol. If your tool has an MCP server, Shipwright can use it. No vendor lock-in.
01Input

Describe what you need

Start from a PRD, a pack template, or a plain description. Point it at an existing codebase and it adapts to your stack and conventions.

  • Pack templates for common patterns (auth, analytics, payments, file upload)
  • Constraints file sets stack, framework, test runner, linting rules
  • Existing codebase detection. Reads your project and works within it
02Discovery

Specialist agents research your domain

Not a single prompt. Multiple agents with different roles each produce a specific artifact. Outputs feed forward: architecture informs test plans, threat models inform security checks. Live API docs pulled and cached for every integration.

  • Requirements, architecture, test plan, security model. Each from a dedicated agent
  • Artifacts are interdependent. Later agents build on earlier agents' output
  • Real API docs fetched and cached for every integration you use
  • All artifacts validated at a gate before planning starts
03Plan

Task DAG with dependency layers and validation gates

Tickets generated from discovery artifacts, not from a single prompt. Dependencies mapped across layers. The plan is validated before execution begins.

  • Dozens of tickets depending on service complexity
  • Multiple dependency layers, so scaffolding runs before routes before tests
  • Structural validation: cycle prevention, duplicate detection, coverage checks
  • Plan is locked after validation. Agents can not drift from it
04Execution

Each ticket gets its own agent with assembled context

The orchestrator builds each agent's context from previous agents' outputs: relevant spec sections, domain docs, and dependency files. No shared 200k-token window. Each agent sees only what it needs to complete its task.

  • Model tier per task complexity. Critical tasks get Opus, routine tasks get Haiku
  • Parallel execution where the DAG allows
  • Configurable timeouts per task type
  • Protected outputs. Verified files can not be overwritten by later tasks
05Verification

Every ticket verified before the next one starts

Not just "does it compile." Does it boot. Does it respond. Does the test actually test something.

  • tsc --noEmit, ESLint, Prettier
  • Test runner (vitest/jest) must exit 0
  • Docker build + health check within 10 seconds
  • Wiring check: are new routes and plugins actually registered?
  • Failed verification triggers retry with error context, not a blind retry
06Audit

47 binary criteria, pass or fail

No subjective scoring. Every criterion maps to an industry standard or a specific failure we observed in real runs. You get a checklist, not a vibe.

  • Functional completeness (8): route registration, config fail-fast, dependency completeness
  • Code quality (7): no dead code, lock files, pinned versions, CI pipeline
  • Security (10): no secrets in code, auth on endpoints, input validation, CORS, headers
  • Testing (8): integration tests exist, assertion depth, error path coverage
  • Observability (6): health checks, structured logging, graceful shutdown, metrics
  • Deployment (8): Dockerfile, multi-stage build, non-root container, entry point consistency
07Self-Correction

Audit failures become corrective tickets

The engine generates corrective tickets from audit failures. Same execution, verification, and audit loop. Capped to prevent infinite cycles.

  • Corrections respect protected outputs from the main plan
  • Capped to prevent infinite cycles
  • Each correction re-verified before moving on
  • Patterns learned, so the same failures do not repeat across runs
08Output

Scored, traced, deployable

Every decision documented. Readiness score with per-category breakdown. Full cost tracking. Patterns cached for future runs.

  • Readiness report: 47 criteria with pass/fail + evidence per item
  • Test results, security audit, coverage metrics
  • Run cost breakdown by task and model tier
  • Doc cache updated so API shapes are learned for the next run
  • Typical: 2-4 hours, 300+ tests, 45+/47 criteria passing

Estimate before you run

You do not commit blind. Shipwright runs discovery and planning first, then shows you the task count, model distribution, and estimated cost before execution starts. You confirm or walk away.

$ shipwright plan --prd ./my-service.md

Discovery    complete, 14 artifacts
Planning     42 tasks across 5 layers

Estimated cost:
  CLI-Connected  $0  (uses your subscription)
  API keys      ~$12  (8 Opus + 22 Sonnet + 12 Haiku)

Proceed? [y/N]

What it costs

CLI-Connected (recommended)

$0

Shipwright spawns your local CLI as a subprocess. You pay nothing extra beyond your existing subscription.

Claude Code Max, $20/mo or $200/mo

Codex Pro, included with ChatGPT Pro

Gemini CLI, free with Google account

API Keys (BYOK)

$7–20

Direct API access. Full control over model selection per task tier.

Simple service, ~$7 (25-30 tasks)

Standard service, ~$12 (35-50 tasks)

Complex service, ~$20 (50-70 tasks)

Free while in beta. We charge when the product earns it.

What Shipwright builds today

Backend microservices

APIs, integrations, background workers, webhooks. TypeScript and Node today, more runtimes coming. Every service deployed with tests, security headers, health checks, and structured logging out of the box.

Not UI. Deliberately.

Frontend code is subjective and hard to verify programmatically. Backend services have clear, measurable quality criteria: routes registered, tests passing, containers booting, security headers present. That's where automated verification actually works, so that's where we started.

Free while in beta. Works with your existing CLI subscription.

Join the Beta