Specification-Driven Development
Spec is source of truth. Generate and verify artifacts from it.
The Problem
AI can write code faster than humans. But if you skip executable checks, AI tends to drift from what you actually need.
Core principle:
Spec → executable checks → code → review against spec
The spec is the source of truth. Use AI to generate and verify artifacts from it (tests, API contracts, implementation)—not to invent requirements.
If the spec does not define failure behavior, AI will guess.
Vague specs produce unpredictable code. Precise specs produce correct code. The effort shifts from writing code to writing specifications—which is where the thinking should happen anyway.
Current Options
| Option | Pros | Cons |
|---|---|---|
| Contract-First (OpenAPI, Zod, Proto)Generate code from schema, not from prose alone. |
|
|
| Test-First (TDD with AI)Write failing tests, then let AI implement. |
|
|
| Prose-Only SpecificationsNatural language requirements without executable checks. |
|
|
Future Outlook
Specifications become the durable artifact. Code becomes disposable.
When AI rewrites implementations in minutes, tests are what survives.
The spec encodes intent. The tests encode behavior. The implementation is just one possible realization—replaceable if the tests still pass.
Human role shifts:
From writing code to:
- Defining interfaces and contracts
- Writing acceptance criteria
- Reviewing AI output against specs
- Handling ambiguity and edge cases
The key insight: Good specs are hard. Good code is cheap. Invest accordingly.
Our Decision
✓Why we chose this
- Executable truthTests pass or fail. No ambiguity between human intent and AI interpretation.
- Fearless iterationAI tries ten approaches while a human tries one. Tests judge correctness.
- Verification at every stepRun tests after every change. AI iterates until green.
- Durable artifactsSpecs and tests survive implementation rewrites. Code is disposable.
×Trade-offs we accept
- Spec quality = code qualityGarbage specs produce garbage code. No shortcut.
- Gaps are invisibleIf the test does not check for it, AI will not implement it.
- Performance blind spotsTests check correctness, not efficiency. Add performance tests explicitly.
Motivation
What a good spec should contain:
- Goal and non-goals — What we're building, what we're not
- Inputs/outputs — Data shapes, types, examples
- Invariants — What must always be true
- Error cases — How failures are handled
- Example requests/responses — Concrete, not abstract
- Acceptance tests — Plain English is fine, but precise
- Performance/security constraints — If relevant
If the spec does not define it, AI will guess.
The right order for generating artifacts:
- Spec → contract/schema (OpenAPI, Zod, Proto)
- Contract/schema → generated types/stubs
- Spec examples → contract tests
- Invariants → unit/property tests
- Implementation
Generate code from schema, not from prose alone.
How to prompt AI effectively:
Ask for constrained outputs, not open-ended implementation.
Good prompts:
- "Read this spec and list ambiguities before coding."
- "Generate failing tests from these acceptance criteria only."
- "Create Zod/OpenAPI schema from the spec examples."
- "Implement minimal code to satisfy tests; do not refactor unrelated files."
- "Review this diff against the spec and list mismatches."
This keeps AI acting like a deterministic worker, not a creative collaborator.
Recommendation
The Practical Workflow
- Write a small, testable spec.
- Turn spec into acceptance criteria and API/schema contracts.
- Ask AI to generate failing tests first.
- Ask AI to scaffold API/types from the contract.
- Implement the smallest vertical slice to make tests pass.
- Run lint/typecheck/tests in CI.
- Review diff against the spec (not just "does code look fine").
- Merge quickly; repeat in small increments.
Branching Strategy
Best default: short-lived branches or stacked changes, not long-lived feature branches.
- One branch/change per spec slice.
- Keep slices small — ~300 lines when possible.
- Rebase/merge from trunk frequently.
- Use stacked PRs for larger features.
- Avoid concurrent edits to same files when multiple agents work.
If you use git, use short feature branches. If you use jj, stacked changes and automatic rebasing make this easier.
Quality Gates (Must-Have)
Before merge:
- ✓ Lint
- ✓ Typecheck
- ✓ Unit tests
- ✓ Contract tests
- ✓ Integration tests (for changed flows)
For API changes, add compatibility checks (breaking/non-breaking).
Common Failure Modes
| Failure | Result |
|---|---|
| Spec too vague | AI invents behavior |
| Spec too large | Slow progress, large diffs, hard reviews |
| No contract schema | Frontend/backend drift |
| Multiple agents editing same files | Merge conflict churn |
| Long-lived feature branches | Integration pain |
| No reviewer | Subtle regressions |
For multi-agent orchestration patterns, see Collaborating Agents.
Examples
# Payment Service Specification
## Goal
Process payments and refunds via Stripe.
## Non-Goals
- Subscription management (separate service)
- Invoice generation
## Interface
- processPayment(amount, currency, token) → PaymentResult
- refundPayment(transactionId, amount?) → RefundResult
## Invariants
- No payment without valid token
- Refund cannot exceed original amount
- All amounts in cents (no decimals)
## Error Cases
- INSUFFICIENT_FUNDS → decline, no charge
- CARD_DECLINED → decline, log reason
- DUPLICATE_IDEMPOTENCY → return original result
## Acceptance Tests
1. Valid card with funds → success + transactionId
2. Insufficient funds → INSUFFICIENT_FUNDS error
3. Duplicate request → same result, no double charge
4. Partial refund → success + remaining balance
5. Refund > original → REFUND_EXCEEDS_ORIGINAL errorA good specification includes: goal/non-goals, interface, invariants, error cases, and acceptance tests. If the spec does not define failure behavior, AI will guess.
openapi: 3.0.0
info:
title: Payment API
version: 1.0.0
paths:
/payments:
post:
operationId: processPayment
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/PaymentRequest'
responses:
'200':
description: Payment processed
content:
application/json:
schema:
$ref: '#/components/schemas/PaymentResult'
'402':
description: Payment declined
content:
application/json:
schema:
$ref: '#/components/schemas/PaymentError'
components:
schemas:
PaymentRequest:
type: object
required: [amount, currency, token]
properties:
amount:
type: integer
minimum: 1
currency:
type: string
enum: [USD, EUR, GBP]
token:
type: string
PaymentResult:
type: object
properties:
transactionId:
type: string
status:
type: string
enum: [success]
PaymentError:
type: object
properties:
code:
type: string
enum: [INSUFFICIENT_FUNDS, CARD_DECLINED, INVALID_TOKEN]
message:
type: stringGenerate code from schema, not from prose alone. OpenAPI, Zod, Proto—the contract is executable and unambiguous. Frontend and backend stay in sync.
import { describe, it, expect } from 'vitest';
import { processPayment, refundPayment } from './payment';
describe('Payment Contract Tests', () => {
it('processes valid payment', async () => {
const result = await processPayment({
amount: 1000,
currency: 'USD',
token: 'tok_visa',
});
expect(result.status).toBe('success');
expect(result.transactionId).toBeDefined();
});
it('rejects insufficient funds', async () => {
await expect(
processPayment({
amount: 1000,
currency: 'USD',
token: 'tok_insufficient_funds',
})
).rejects.toMatchObject({
code: 'INSUFFICIENT_FUNDS',
});
});
it('returns same result for duplicate request', async () => {
const idempotencyKey = 'test-123';
const first = await processPayment({
amount: 1000,
currency: 'USD',
token: 'tok_visa',
idempotencyKey,
});
const second = await processPayment({
amount: 1000,
currency: 'USD',
token: 'tok_visa',
idempotencyKey,
});
expect(second.transactionId).toBe(first.transactionId);
});
it('rejects refund exceeding original', async () => {
const payment = await processPayment({
amount: 1000,
currency: 'USD',
token: 'tok_visa',
});
await expect(
refundPayment(payment.transactionId, 2000)
).rejects.toMatchObject({
code: 'REFUND_EXCEEDS_ORIGINAL',
});
});
});Tests generated from spec acceptance criteria. These are the judge. AI implements until they pass. If a behavior is not tested, AI will not implement it.