Local LLMsai

Why Local LLMs

AI that runs on your hardware, under your control

v1.1·9 min read·Kenneth Pernyér

llmollamalocalprivacyedge

The Problem

Cloud AI APIs have three fundamental constraints: latency, privacy, and cost.

Latency: Every API call crosses the internet. For real-time applications, this adds 100-500ms minimum.

Privacy: Your prompts and data go to third-party servers. For sensitive data, this may violate policy or regulation.

Cost: Per-token pricing adds up. High-volume applications can become prohibitively expensive.

We needed AI capabilities that:

Run with single-digit millisecond latency
Keep all data on-premises
Have predictable costs at scale
Work offline or in air-gapped environments

Current Options

Option	Pros	Cons
Cloud APIs (OpenAI, Anthropic)Frontier capabilities via API.	Best-in-class capabilities No infrastructure to manage Automatic updates and improvements	Latency (100ms+ minimum) Data leaves your control Per-token costs at scale Dependency on provider availability
OllamaSimple local LLM runner. One command to run models.	Dead simple setup Runs Llama, Mistral, and many others OpenAI-compatible API Active community	Requires capable hardware Models are less capable than frontier No fine-tuning support
Self-hosted vLLM/TGIProduction-grade inference servers.	Maximum throughput Batching and optimization Fine-tuned model support Full control	Complex setup and tuning Significant hardware investment Operational expertise required

Future Outlook

Local LLMs are getting good enough for production use cases.

The capability gap is shrinking.

Llama 3, Mistral, and Qwen are approaching GPT-3.5 level for many tasks. For structured generation, code completion, and domain-specific tasks, local models are viable.

Hardware is improving faster than model requirements. Apple Silicon runs 7B parameter models smoothly. NVIDIA's inference optimizations keep pushing throughput higher.

The future is hybrid: frontier models for complex reasoning, local models for routine tasks, edge models for latency-critical paths. The decision is per-use-case, not all-or-nothing.

Our Decision

✓Why we chose this

Zero latencyNo network round-trip; responses in milliseconds
Complete privacyData never leaves your infrastructure
Predictable costHardware cost is fixed; no per-token fees
Always availableNo dependency on external services

×Trade-offs we accept

Lower capabilityLocal models trail frontier by 1-2 years
Hardware requirementsNeed GPU or Apple Silicon for good performance
Operational overheadMust manage model updates and deployment

Motivation

We use local LLMs for latency-sensitive and privacy-critical paths. Real-time autocomplete, on-device inference, and processing sensitive documents all benefit from local execution.

The capability tradeoff is acceptable for these use cases. We're not asking local models to write complex code or reason through novel problems. We're asking them to complete text, classify content, and extract structured data.

Keeping this option open also provides flexibility. If cloud API costs spike or availability drops, we have a fallback that keeps core functionality running.

Recommendation

Start with Ollama for experimentation. It runs on Mac, Linux, and Windows with minimal setup:

ollama run llama3.2

For production, evaluate:

Use case fit: Is the task simple enough for a smaller model?
Latency requirements: Does your use case benefit from local inference?
Privacy constraints: Does data sensitivity require local processing?
Volume: At what volume do API costs exceed hardware costs?

Common wins for local models:

Autocomplete and text expansion
Document classification
Entity extraction
Embeddings generation

Examples

lib/ai/local.tstypescript

// Ollama exposes an OpenAI-compatible API
import OpenAI from 'openai';

const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // Required but not used
});

export async function completeLocally(prompt: string): Promise<string> {
  const response = await ollama.chat.completions.create({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 256,
  });

  return response.choices[0]?.message?.content ?? '';
}

// Use for latency-sensitive paths
export async function classifyIntent(text: string): Promise<string> {
  const response = await completeLocally(`
Classify this text into one of: question, request, complaint, other.
Output only the category, nothing else.

Text: ${text}
Category:`);

  return response.trim().toLowerCase();
}

Ollama provides OpenAI-compatible API. Existing code using the OpenAI SDK works with minimal changes.

ClaudeWhy Claude Converge-ProviderLLM-based Agents E2E EncryptionWhy End-to-End Encryption