·12 min read·Kenneth Pernyér·1.7k views·180 appreciation

Converge-LLM: Why I Built a Contract-Driven LLM Module in Pure Rust

Deterministic reasoning infrastructure for convergent agents

llmrustagentsarchitecturecontracts

Most LLM integrations start with an API call and a prompt. That works until you try to build an agent framework where multiple models collaborate, decisions must be reproducible, outputs must be constrained and latency and uncertainty are engineering constraints, not "model vibes".

This article is the story of how converge-llm evolved into a contract-driven, Rust-native LLM subsystem designed for a broader agent framework focused on convergence: collaborating components that align on outcomes through structure, not spectacle.

The short version: a lot of reasoning does not need a giant foundation model. It needs deterministic control surfaces, structured state and tight output contracts so that small and medium models can reliably do real work.

The motivation: collaboration, not charisma

In most LLM setups, you treat the model like a charismatic oracle. You feed it a narrative prompt and hope it returns something useful. That is fine for chat.

But in an agent framework, you want something else:

  • multiple agents (sometimes multiple models) collaborate
  • decisions must be debuggable
  • failures must be attributable
  • outputs must be safe to consume by software
  • latency matters, especially when you chain steps
  • uncertainty is normal and must be handled explicitly

If your agent runtime depends on a single large model behaving perfectly, you have not built an agent framework. You have built a brittle dependency.

So I took a different approach: design the intelligence layer like infrastructure.

The Rust constraint: this is an engineering system

I have a strong bias toward pure Rust for core infrastructure. Not because Rust is fashionable, but because the benefits are concrete:

  • explicit types and invariants
  • early failure instead of late surprises
  • deterministic behavior
  • predictable performance
  • code that can be audited and evolved

This influenced the entire stack:

  • SurrealDB for storage (multi-model, flexible schemas)
  • LanceDB for vector search (fast retrieval on structured embeddings)
  • Polars for compute (feature engineering, metrics, state extraction)
  • Burn for ML (embedded inference, Rust-native control)

This is not "Rust for Rust's sake". It is Rust to keep the system honest.

The key insight: decision quality is mostly about structure

Large models provide broad capability. They also come with:

  • higher latency
  • higher cost (even locally, in compute and memory)
  • more variable behavior
  • a bigger "uncertainty surface"
  • more reliance on prompt tricks

For many agent tasks, you do not need that. You need:

  • structured input state
  • clear task framing
  • explicit output constraints
  • predictable inference envelopes
  • reproducibility for debugging

So the architecture became decision-first.

Behavior emerges from contracts, not training.

This decision-first posture is what makes small and medium models viable in practice.

The evolution: from "LLM wrapper" to "reasoning kernel"

The module did not start as a grand design. It evolved through successive constraints.

Phase 1: separating responsibilities

The first milestone was refusing to lump "LLM integration" into one file.

The module split into clear surfaces:

  • config.rs: configuration as a contract, not convenience
  • tokenizer.rs: tokenization as a correctness surface
  • prompt.rs: structured prompts as a versioned stack
  • inference.rs: deterministic inference envelopes
  • agent.rs: integration with the agent framework

This separation mattered because every later capability, validation, tracing and chain execution, depends on not having a ball of mud.

Phase 2: making it real

Architecture is cheap until you run real inference.

So I introduced the "engine boundary":

  • engine.rs: real inference via llama-burn
  • golden_test.rs: deterministic verification ("same input -> same output")

This is the moment the system became engineering-grade. If you cannot reproduce model behavior, you cannot debug an agent system.

Phase 3: building a real decision chain

Agents rarely do one thing. They chain:

  • infer signals
  • evaluate options
  • produce a plan

So the system gained:

  • DecisionChain and DecisionTrace
  • a ChainExecutor that runs a real topology
  • Reasoning to Evaluation to Planning
  • no retries
  • no self-reflection
  • no tool calling
  • fail-fast behavior
  • full observability

Now it behaved like infrastructure: predictable, inspectable, composable.

Phase 4A: stress testing contracts until they break

At this point, the question was no longer "does it compile?"

It was: are the contracts sufficient to constrain behavior under adversarial inputs?

So we added output-side stress tests and tightened contracts where they were weak:

  • Reasoning required at least one explicit reasoning step before conclusion
  • Evaluation required meaningful justifications when configured

The result: the system became robust in the one way that matters. It fails clearly when it should fail.

The contract stack: the "five invariants" that changed everything

The most important design element is the set of explicit contracts that govern behavior.

1) PromptStack: a five-layer cognitive API

Instead of a single monolithic "system prompt", the system uses a stack:

  • Priming: identity and invariants (small, stable)
  • Policy: versioned constraints
  • Task frame: per capability framing
  • State injection: structured data (from Polars)
  • Intent: minimal user ask (intent + criteria)

This prevents prompt entropy and makes cognition testable.

2) PromptVersion: prompts co-versioned with models

Prompts are part of the runtime contract. They are versioned explicitly:

  • reasoning:v1:llama3
  • deployment:v1:...

This makes upgrades and regressions traceable.

3) Config validation: fail fast on incompatibility

Config is validated at startup:

  • tokenizer/model mismatches
  • precision/LoRA incompatibilities
  • invalid context budgets
  • special token errors

This avoids "it runs but the output is garbage" failure modes.

4) InferenceEnvelope: determinism as a first-class concept

Inference is not "just generate tokens". It is a reproducible envelope:

  • tokenizer snapshot
  • generation params
  • seed policy
  • stopping criteria

This enables golden tests and debugging.

5) OutputContract: output shape is explicit

Outputs are validated against contracts, not "vibes":

  • reasoning must include steps and conclusion
  • evaluation must include scores, confidence, justification
  • planning must include ordered steps and capability references

This is the key that makes small models useful. They are not asked to be brilliant. They are asked to be bounded and consistent.

6) DecisionTrace: provenance for every step

Each chain step produces a trace:

  • inputs
  • prompt version
  • envelope
  • raw output
  • validation result

This creates a pipeline that is:

  • debuggable
  • auditable
  • training-ready (later)

This is how you build an agent system that can improve over time without guessing.

Why this enables convergence of collaborating models

The word "converge" matters.

In a collaborative agent framework, you may have:

  • a fast small model for classification or scoring
  • a medium model for planning
  • a retrieval layer for grounding
  • a database for state and memory

Convergence happens when each component has:

  • a well-defined role
  • a controlled interface
  • explicit success criteria
  • observable failure modes

Contracts are how you get that.

Without contracts, collaboration becomes:

  • prompt spaghetti
  • ad-hoc retries
  • brittle "model whispering"

With contracts, collaboration becomes:

  • predictable
  • improvable
  • composable

Why small and medium models are enough for many reasoning tasks

"Reasoning" is overloaded.

Many tasks we call reasoning are actually:

  • structured decision-making
  • constraint satisfaction
  • prioritization
  • scoring
  • drafting a plan from computed state

These tasks are not primarily limited by model IQ. They are limited by:

  • unclear inputs
  • unclear outputs
  • unclear policies

Once you fix those, the model's job becomes easier. Smaller models suddenly become viable:

  • lower latency
  • lower memory footprint
  • more predictable behavior
  • easier scaling across many agent steps

In other words:

You do not need a bigger brain. You need a better nervous system.

Polars gives you the compute substrate. Burn gives you embedded inference. Contracts give you the nervous system.

The long-term payoff: training becomes justified, not speculative

This is the part most systems miss.

Because every decision step is traced and validated, you automatically accumulate:

  • successful chains (training candidates)
  • failures with explicit reasons
  • boundary cases that expose ambiguity

So when you eventually do LoRA, it is not "let's see if training helps".

It is:

  • "Evaluation confidence is miscalibrated in these scenarios"
  • "Planning consistently invents capabilities when state is underspecified"
  • "Reasoning reaches confident conclusions under contradictory metrics"

Training becomes targeted correction on measured deficiencies.

That is how learning should be introduced.

Closing: the real goal is engineering intelligence, not calling it

The point of converge-llm is not to prove that Rust can run an LLM. It can.

The point is to prove a system design thesis:

  • structure beats scale for many agent decisions
  • contracts beat prompt artistry
  • reproducibility beats intuition
  • pure Rust makes the system honest
  • Polars + Burn turn "LLM work" into a compute-first pipeline
  • convergence emerges when collaborating components share explicit interfaces

The result is a module that behaves less like a chatbot and more like a dependable subsystem in a larger agent runtime.

That is the kind of intelligence infrastructure I want to build.

Appendix: the stack (practical summary)

  • SurrealDB: state and memory storage (flexible, multi-model)
  • LanceDB: vector retrieval (fast embedding search)
  • Polars: compute substrate (metrics, feature extraction, state injection)
  • Burn: embedded inference and future learning (pure Rust control)
  • converge-llm: prompt contracts, inference envelopes, validation, decision chains

Stockholm, Sweden

January 16, 2026

Kenneth Pernyér signature