Local LLMsai

Why Local LLMs

AI that runs on your hardware, under your control

v1.1·9 min read·Kenneth Pernyér
llmollamalocalprivacyedge

The Problem

Cloud AI APIs have three fundamental constraints: latency, privacy, and cost.

Latency: Every API call crosses the internet. For real-time applications, this adds 100-500ms minimum.

Privacy: Your prompts and data go to third-party servers. For sensitive data, this may violate policy or regulation.

Cost: Per-token pricing adds up. High-volume applications can become prohibitively expensive.

We needed AI capabilities that:

  • Run with single-digit millisecond latency
  • Keep all data on-premises
  • Have predictable costs at scale
  • Work offline or in air-gapped environments

Current Options

OptionProsCons
Cloud APIs (OpenAI, Anthropic)Frontier capabilities via API.
  • Best-in-class capabilities
  • No infrastructure to manage
  • Automatic updates and improvements
  • Latency (100ms+ minimum)
  • Data leaves your control
  • Per-token costs at scale
  • Dependency on provider availability
OllamaSimple local LLM runner. One command to run models.
  • Dead simple setup
  • Runs Llama, Mistral, and many others
  • OpenAI-compatible API
  • Active community
  • Requires capable hardware
  • Models are less capable than frontier
  • No fine-tuning support
Self-hosted vLLM/TGIProduction-grade inference servers.
  • Maximum throughput
  • Batching and optimization
  • Fine-tuned model support
  • Full control
  • Complex setup and tuning
  • Significant hardware investment
  • Operational expertise required

Future Outlook

Local LLMs are getting good enough for production use cases.

The capability gap is shrinking.

Llama 3, Mistral, and Qwen are approaching GPT-3.5 level for many tasks. For structured generation, code completion, and domain-specific tasks, local models are viable.

Hardware is improving faster than model requirements. Apple Silicon runs 7B parameter models smoothly. NVIDIA's inference optimizations keep pushing throughput higher.

The future is hybrid: frontier models for complex reasoning, local models for routine tasks, edge models for latency-critical paths. The decision is per-use-case, not all-or-nothing.

Our Decision

Why we chose this

  • Zero latencyNo network round-trip; responses in milliseconds
  • Complete privacyData never leaves your infrastructure
  • Predictable costHardware cost is fixed; no per-token fees
  • Always availableNo dependency on external services

×Trade-offs we accept

  • Lower capabilityLocal models trail frontier by 1-2 years
  • Hardware requirementsNeed GPU or Apple Silicon for good performance
  • Operational overheadMust manage model updates and deployment

Motivation

We use local LLMs for latency-sensitive and privacy-critical paths. Real-time autocomplete, on-device inference, and processing sensitive documents all benefit from local execution.

The capability tradeoff is acceptable for these use cases. We're not asking local models to write complex code or reason through novel problems. We're asking them to complete text, classify content, and extract structured data.

Keeping this option open also provides flexibility. If cloud API costs spike or availability drops, we have a fallback that keeps core functionality running.

Recommendation

Start with Ollama for experimentation. It runs on Mac, Linux, and Windows with minimal setup:

ollama run llama3.2

For production, evaluate:

  • Use case fit: Is the task simple enough for a smaller model?
  • Latency requirements: Does your use case benefit from local inference?
  • Privacy constraints: Does data sensitivity require local processing?
  • Volume: At what volume do API costs exceed hardware costs?

Common wins for local models:

  • Autocomplete and text expansion
  • Document classification
  • Entity extraction
  • Embeddings generation

Examples

lib/ai/local.tstypescript
// Ollama exposes an OpenAI-compatible API
import OpenAI from 'openai';

const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // Required but not used
});

export async function completeLocally(prompt: string): Promise<string> {
  const response = await ollama.chat.completions.create({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 256,
  });

  return response.choices[0]?.message?.content ?? '';
}

// Use for latency-sensitive paths
export async function classifyIntent(text: string): Promise<string> {
  const response = await completeLocally(`
Classify this text into one of: question, request, complaint, other.
Output only the category, nothing else.

Text: ${text}
Category:`);

  return response.trim().toLowerCase();
}

Ollama provides OpenAI-compatible API. Existing code using the OpenAI SDK works with minimal changes.

Related Articles

Stockholm, Sweden

Version 1.1

Kenneth Pernyér signature