Why Local LLMs
AI that runs on your hardware, under your control
The Problem
Cloud AI APIs have three fundamental constraints: latency, privacy, and cost.
Latency: Every API call crosses the internet. For real-time applications, this adds 100-500ms minimum.
Privacy: Your prompts and data go to third-party servers. For sensitive data, this may violate policy or regulation.
Cost: Per-token pricing adds up. High-volume applications can become prohibitively expensive.
We needed AI capabilities that:
- Run with single-digit millisecond latency
- Keep all data on-premises
- Have predictable costs at scale
- Work offline or in air-gapped environments
Current Options
| Option | Pros | Cons |
|---|---|---|
| Cloud APIs (OpenAI, Anthropic)Frontier capabilities via API. |
|
|
| OllamaSimple local LLM runner. One command to run models. |
|
|
| Self-hosted vLLM/TGIProduction-grade inference servers. |
|
|
Future Outlook
Local LLMs are getting good enough for production use cases.
The capability gap is shrinking.
Llama 3, Mistral, and Qwen are approaching GPT-3.5 level for many tasks. For structured generation, code completion, and domain-specific tasks, local models are viable.
Hardware is improving faster than model requirements. Apple Silicon runs 7B parameter models smoothly. NVIDIA's inference optimizations keep pushing throughput higher.
The future is hybrid: frontier models for complex reasoning, local models for routine tasks, edge models for latency-critical paths. The decision is per-use-case, not all-or-nothing.
Our Decision
✓Why we chose this
- Zero latencyNo network round-trip; responses in milliseconds
- Complete privacyData never leaves your infrastructure
- Predictable costHardware cost is fixed; no per-token fees
- Always availableNo dependency on external services
×Trade-offs we accept
- Lower capabilityLocal models trail frontier by 1-2 years
- Hardware requirementsNeed GPU or Apple Silicon for good performance
- Operational overheadMust manage model updates and deployment
Motivation
We use local LLMs for latency-sensitive and privacy-critical paths. Real-time autocomplete, on-device inference, and processing sensitive documents all benefit from local execution.
The capability tradeoff is acceptable for these use cases. We're not asking local models to write complex code or reason through novel problems. We're asking them to complete text, classify content, and extract structured data.
Keeping this option open also provides flexibility. If cloud API costs spike or availability drops, we have a fallback that keeps core functionality running.
Recommendation
Start with Ollama for experimentation. It runs on Mac, Linux, and Windows with minimal setup:
ollama run llama3.2
For production, evaluate:
- Use case fit: Is the task simple enough for a smaller model?
- Latency requirements: Does your use case benefit from local inference?
- Privacy constraints: Does data sensitivity require local processing?
- Volume: At what volume do API costs exceed hardware costs?
Common wins for local models:
- Autocomplete and text expansion
- Document classification
- Entity extraction
- Embeddings generation
Examples
// Ollama exposes an OpenAI-compatible API
import OpenAI from 'openai';
const ollama = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // Required but not used
});
export async function completeLocally(prompt: string): Promise<string> {
const response = await ollama.chat.completions.create({
model: 'llama3.2',
messages: [{ role: 'user', content: prompt }],
max_tokens: 256,
});
return response.choices[0]?.message?.content ?? '';
}
// Use for latency-sensitive paths
export async function classifyIntent(text: string): Promise<string> {
const response = await completeLocally(`
Classify this text into one of: question, request, complaint, other.
Output only the category, nothing else.
Text: ${text}
Category:`);
return response.trim().toLowerCase();
}Ollama provides OpenAI-compatible API. Existing code using the OpenAI SDK works with minimal changes.