Claude Codeai

Specification-Driven Development

Spec is source of truth. Generate and verify artifacts from it.

v1.1·8 min read·Kenneth Pernyér

claudeagentstddspecificationcontractsai-development

The Problem

AI can write code faster than humans. But if you skip executable checks, AI tends to drift from what you actually need.

Core principle:

Spec → executable checks → code → review against spec

The spec is the source of truth. Use AI to generate and verify artifacts from it (tests, API contracts, implementation)—not to invent requirements.

If the spec does not define failure behavior, AI will guess.

Vague specs produce unpredictable code. Precise specs produce correct code. The effort shifts from writing code to writing specifications—which is where the thinking should happen anyway.

Current Options

Option	Pros	Cons
Contract-First (OpenAPI, Zod, Proto)Generate code from schema, not from prose alone.	Unambiguous—schema is executable Frontend and backend stay in sync AI can verify against contract Breaking changes are detectable	Upfront design effort Schema must be maintained Learning curve for contract tools
Test-First (TDD with AI)Write failing tests, then let AI implement.	Clear success criteria AI iterates until green Refactoring is safe Documentation via tests	Test quality determines code quality Coverage gaps = behavior gaps Performance not tested by default
Prose-Only SpecificationsNatural language requirements without executable checks.	Fastest to write No tooling required Easy for non-technical stakeholders	Ambiguous—AI will guess No automated verification Drift is invisible until production

Future Outlook

Specifications become the durable artifact. Code becomes disposable.

When AI rewrites implementations in minutes, tests are what survives.

The spec encodes intent. The tests encode behavior. The implementation is just one possible realization—replaceable if the tests still pass.

Human role shifts:

From writing code to:

Defining interfaces and contracts
Writing acceptance criteria
Reviewing AI output against specs
Handling ambiguity and edge cases

The key insight: Good specs are hard. Good code is cheap. Invest accordingly.

Our Decision

✓Why we chose this

Executable truthTests pass or fail. No ambiguity between human intent and AI interpretation.
Fearless iterationAI tries ten approaches while a human tries one. Tests judge correctness.
Verification at every stepRun tests after every change. AI iterates until green.
Durable artifactsSpecs and tests survive implementation rewrites. Code is disposable.

×Trade-offs we accept

Spec quality = code qualityGarbage specs produce garbage code. No shortcut.
Gaps are invisibleIf the test does not check for it, AI will not implement it.
Performance blind spotsTests check correctness, not efficiency. Add performance tests explicitly.

Motivation

What a good spec should contain:

Goal and non-goals — What we're building, what we're not
Inputs/outputs — Data shapes, types, examples
Invariants — What must always be true
Error cases — How failures are handled
Example requests/responses — Concrete, not abstract
Acceptance tests — Plain English is fine, but precise
Performance/security constraints — If relevant

If the spec does not define it, AI will guess.

The right order for generating artifacts:

Spec → contract/schema (OpenAPI, Zod, Proto)
Contract/schema → generated types/stubs
Spec examples → contract tests
Invariants → unit/property tests
Implementation

Generate code from schema, not from prose alone.

How to prompt AI effectively:

Ask for constrained outputs, not open-ended implementation.

Good prompts:

"Read this spec and list ambiguities before coding."
"Generate failing tests from these acceptance criteria only."
"Create Zod/OpenAPI schema from the spec examples."
"Implement minimal code to satisfy tests; do not refactor unrelated files."
"Review this diff against the spec and list mismatches."

This keeps AI acting like a deterministic worker, not a creative collaborator.

Recommendation

The Practical Workflow

Write a small, testable spec.
Turn spec into acceptance criteria and API/schema contracts.
Ask AI to generate failing tests first.
Ask AI to scaffold API/types from the contract.
Implement the smallest vertical slice to make tests pass.
Run lint/typecheck/tests in CI.
Review diff against the spec (not just "does code look fine").
Merge quickly; repeat in small increments.

Branching Strategy

Best default: short-lived branches or stacked changes, not long-lived feature branches.

One branch/change per spec slice.
Keep slices small — ~300 lines when possible.
Rebase/merge from trunk frequently.
Use stacked PRs for larger features.
Avoid concurrent edits to same files when multiple agents work.

If you use git, use short feature branches. If you use jj, stacked changes and automatic rebasing make this easier.

Quality Gates (Must-Have)

Before merge:

✓ Lint
✓ Typecheck
✓ Unit tests
✓ Contract tests
✓ Integration tests (for changed flows)

For API changes, add compatibility checks (breaking/non-breaking).

Common Failure Modes

Failure	Result
Spec too vague	AI invents behavior
Spec too large	Slow progress, large diffs, hard reviews
No contract schema	Frontend/backend drift
Multiple agents editing same files	Merge conflict churn
Long-lived feature branches	Integration pain
No reviewer	Subtle regressions

For multi-agent orchestration patterns, see Collaborating Agents.

Examples

Example Specificationmarkdown

# Payment Service Specification

## Goal
Process payments and refunds via Stripe.

## Non-Goals
- Subscription management (separate service)
- Invoice generation

## Interface
- processPayment(amount, currency, token) → PaymentResult
- refundPayment(transactionId, amount?) → RefundResult

## Invariants
- No payment without valid token
- Refund cannot exceed original amount
- All amounts in cents (no decimals)

## Error Cases
- INSUFFICIENT_FUNDS → decline, no charge
- CARD_DECLINED → decline, log reason
- DUPLICATE_IDEMPOTENCY → return original result

## Acceptance Tests
1. Valid card with funds → success + transactionId
2. Insufficient funds → INSUFFICIENT_FUNDS error
3. Duplicate request → same result, no double charge
4. Partial refund → success + remaining balance
5. Refund > original → REFUND_EXCEEDS_ORIGINAL error

A good specification includes: goal/non-goals, interface, invariants, error cases, and acceptance tests. If the spec does not define failure behavior, AI will guess.

openapi.yaml (Contract)yaml

openapi: 3.0.0
info:
  title: Payment API
  version: 1.0.0
paths:
  /payments:
    post:
      operationId: processPayment
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/PaymentRequest'
      responses:
        '200':
          description: Payment processed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PaymentResult'
        '402':
          description: Payment declined
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PaymentError'
components:
  schemas:
    PaymentRequest:
      type: object
      required: [amount, currency, token]
      properties:
        amount:
          type: integer
          minimum: 1
        currency:
          type: string
          enum: [USD, EUR, GBP]
        token:
          type: string
    PaymentResult:
      type: object
      properties:
        transactionId:
          type: string
        status:
          type: string
          enum: [success]
    PaymentError:
      type: object
      properties:
        code:
          type: string
          enum: [INSUFFICIENT_FUNDS, CARD_DECLINED, INVALID_TOKEN]
        message:
          type: string

Generate code from schema, not from prose alone. OpenAPI, Zod, Proto—the contract is executable and unambiguous. Frontend and backend stay in sync.

payment.contract.test.tstypescript

import { describe, it, expect } from 'vitest';
import { processPayment, refundPayment } from './payment';

describe('Payment Contract Tests', () => {
  it('processes valid payment', async () => {
    const result = await processPayment({
      amount: 1000,
      currency: 'USD',
      token: 'tok_visa',
    });
    expect(result.status).toBe('success');
    expect(result.transactionId).toBeDefined();
  });

  it('rejects insufficient funds', async () => {
    await expect(
      processPayment({
        amount: 1000,
        currency: 'USD',
        token: 'tok_insufficient_funds',
      })
    ).rejects.toMatchObject({
      code: 'INSUFFICIENT_FUNDS',
    });
  });

  it('returns same result for duplicate request', async () => {
    const idempotencyKey = 'test-123';
    const first = await processPayment({
      amount: 1000,
      currency: 'USD',
      token: 'tok_visa',
      idempotencyKey,
    });
    const second = await processPayment({
      amount: 1000,
      currency: 'USD',
      token: 'tok_visa',
      idempotencyKey,
    });
    expect(second.transactionId).toBe(first.transactionId);
  });

  it('rejects refund exceeding original', async () => {
    const payment = await processPayment({
      amount: 1000,
      currency: 'USD',
      token: 'tok_visa',
    });
    await expect(
      refundPayment(payment.transactionId, 2000)
    ).rejects.toMatchObject({
      code: 'REFUND_EXCEEDS_ORIGINAL',
    });
  });
});

Tests generated from spec acceptance criteria. These are the judge. AI implements until they pass. If a behavior is not tested, AI will not implement it.

Agent OrchestrationCollaborating Agents LLM ComparisonModel Comparison for Spec-Driven Development jj (Jujutsu)Why jj for Version Control Converge-ProviderLLM-based Agents