From a single GPU server to a multi-device stack: local LLMs, voice assistants, document intelligence, a high-memory workstation, and automation — all on your own hardware, no cloud required.
Cloud AI is convenient. It's also opaque about data handling, expensive at heavy usage, and entirely dependent on someone else's infrastructure and policies. A well-built local stack gives you something different: models that run on your hardware, answer to no one, and keep every conversation off someone else's servers.
The economics shifted in 2025–2026. Quantized open-weight models running on consumer hardware now match or exceed GPT-4-class quality on most everyday tasks. The hardware is widely available. The tooling — llama.cpp, Open WebUI, Whisper, Piper — is mature and actively developed.
Every token stays on your hardware. No telemetry, no training on your data, no Terms of Service changes that affect what you can ask.
Hardware pays for itself at heavy usage. No per-token fees, no subscription tiers, no bills that scale with usage.
Swap models freely. Tune context windows. Run specialized variants. Integrate with anything via OpenAI-compatible APIs.
Everything works on your local network. No internet dependency for inference — better reliability and lower latency.
This isn't a single-machine setup — it's a distributed stack that grew to cover different use cases. Each piece solves a specific problem and they work together through standard HTTP APIs.
The workhorse of the stack. A desktop machine with a 24GB NVIDIA GPU running llama.cpp as a system service. It handles primary chat inference, the voice backend, embeddings, and document Q&A simultaneously.
| CPU | 16-core desktop processor (AM5 platform) |
| GPU | NVIDIA RTX 3090 Ti, 24GB GDDR6X |
| RAM | 64GB DDR5-6000 |
| Storage | Two NVMe SSDs — OS on one, data and models on the other |
| OS | Ubuntu Server 22.04 LTS (headless) |
llama.cpp runs as a systemd service, exposing an OpenAI-compatible HTTP API internally and (via reverse proxy) externally with API key authentication.
cmake .. -DGGML_CUDA=ON cmake --build . --config Release -j$(nproc) # Key flags for 24GB VRAM + 26B MoE model llama-server \ --model model.gguf \ --port 8001 \ --n-gpu-layers 999 \ --ctx-size 65536 \ --jinja \ -fa on \ -ctk q8_0 -ctv q8_0 \ --metrics
| Architecture | Google Gemma 4, MoE — 26B total parameters, 4B active per token |
| Context | 65,536 tokens (~50,000 words in a single conversation) |
| KV cache | q8_0 quantization — higher quality than q4_0, enables the large context window |
| Flash attention | Enabled — reduces memory pressure on long contexts |
| Thinking mode | On in web chat, off in voice/task extraction paths (3–10x speed difference) |
With a 26B model, Whisper, and an embedding server all running, the 24GB card stays near capacity. Key strategies that make it work:
-ctk/-ctv q8_0) trades a small quality loss for major VRAM savings--ctx-size to 32768 if out-of-memory errors appearGemma 4 26B-A4B — MoE, excellent reasoning, 65K+ context
GLM-4.7-Flash or Mistral Small 3 24B — ~100–120 tok/s, ideal for real-time voice
Qwen2.5-Coder 32B — specialized training, strong completions and refactoring
Qwen3-30B-A3B with thinking mode — MoE efficiency with extended reasoning chains
A fundamentally different approach. The AMD Ryzen AI Max+ integrates an iGPU that can access the entire system memory pool — 128GB — eliminating the VRAM ceiling that constrains discrete GPU setups.
Models that would require a multi-GPU server run on a single consumer device. Qwen 3.5 122B at Q4_K_M (~70GB) loads with room to spare. 128K context windows are practical. At idle, the whole system draws about 5W.
| Chip | AMD Ryzen AI Max+ 395 — Zen 5 CPU cores, RDNA 3.5 iGPU, XDNA2 NPU |
| Memory | 128GB LPDDR5X-8000 unified — CPU and GPU share one pool |
| GPU access | ~118GB available to the iGPU via GTT (after minimal UMA frame buffer) |
| OS | Ubuntu 26.04 LTS Server |
| AI stack | ROCm 7.x via apt, llama.cpp + whisper.cpp with HIP support |
# Install ROCm — no DKMS needed on Ubuntu 26.04 sudo apt install rocm # Build llama.cpp with HIP backend cmake .. \ -DGGML_HIP=ON \ -DGGML_RPC=ON \ -DAMDGPU_TARGETS=gfx1151 cmake --build . -j$(nproc) # Required in systemd service file: Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
| Idle power | ~4–5W — efficient enough for always-on deployment |
| Load power | ~120W during active inference |
| Prompt speed | ~67 tok/s (fills 128K context quickly) |
| Gen speed | ~10–20 tok/s — fine for async tasks, slower for chat |
| Best use cases | Long document analysis, deep reasoning, tasks needing massive context |
When the LLM and Whisper services start simultaneously at boot, they race for ROCm initialization and Whisper loses, falling back to CPU. Fix by sequencing the services:
# whisper-server.service [Unit] section: After=network.target llama-server.service # [Service] section: ExecStartPre=/bin/sleep 15 # Without this, whisper runs on CPU after every reboot.
HSA_OVERRIDE_GFX_VERSION is required for gfx1151. Power monitoring via amd-smi returns N/A — use sensors | grep PPT. TDP is set via the front panel button, not software.Open WebUI is the primary chat interface. It runs in Docker, connects to llama.cpp's OpenAI-compatible API, and handles multiple users with separate conversation histories.
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "8080:8080"
volumes:
- /data/open-webui:/app/backend/data
restart: always
docker compose up -d
# Update: docker compose pull && docker compose down && docker compose up -d
A custom FastAPI service wrapping Whisper and the LLM into a multi-purpose voice processing API. One service handles transcription, AI task extraction from voice memos, the voice assistant endpoint, and a web dashboard.
Audio → Whisper transcription → LLM extracts tasks with owners and deadlines → saved to SQLite → structured JSON response.
Transcription only. Returns transcript, language, and segments. No LLM call, no DB write.
Voice command handler for the kitchen assistant. Classifies intent and routes to the right handler — often without any LLM call.
Password-protected interface showing all recordings, transcripts, extracted tasks, and audio stats. Web recorder built in.
| Time / date | System clock directly. ~0ms. No LLM. |
| Weather | SearXNG web search. No LLM. |
| Timers | JSON response only — managed on Pi. No LLM. |
| Time-sensitive | SearXNG + current date appended. |
| General queries | Full LLM inference. ~7s total with Whisper. |
| Whisper | ~1 second (GPU, large-v3-turbo) |
| Task extraction | ~6 seconds (thinking mode disabled) |
| Full /process | ~7 seconds end-to-end |
| Search query | ~10–15 seconds (includes web search) |
Whisper hallucination filtering rejects transcripts with too few ASCII alpha characters before they reach the LLM — preventing garbage responses from silence or non-speech audio.
Supported audio formats: .wav .mp3 .m4a .ogg .flac .webm
A Raspberry Pi 4 on the kitchen counter running a fully local voice assistant. No Alexa. No Google Home. Wake word detection, speech recognition, AI responses, text-to-speech, and timer management — all self-hosted.
| Device | Raspberry Pi 4 (4GB RAM) |
| Audio | USB speakerphone — mic and speaker in one unit, single cable |
| Processing | Wake word and TTS run on Pi CPU; heavy inference offloaded to the server |
Porcupine v4 with a custom "Hey Helix" model. Runs on Pi CPU continuously at negligible power draw.
Audio sent to server Whisper endpoint over HTTPS. ~1 second round-trip.
Voice model loaded once at startup. Raw PCM streamed directly to speaker — near-instant playback, no cold-start.
Timers are managed entirely on the Pi without server round-trips:
Linux assigns audio device numbers dynamically at boot. A USB speakerphone that's card 2 today may be card 3 after adding a different USB device. Configure by device name, not number, for boot stability:
# ~/.asoundrc — use name, not card number
pcm.!default {
type hw
card Plus # stable across reboots
device 0
}
ExecStartPre=/bin/sleep 10 to let USB audio enumerate fully before the Python script opens it. Without this, the service frequently fails on boot.A second Raspberry Pi built as a dedicated voice recorder for capturing thoughts, tasks, and meeting notes. Press a button → record → release → automatically transcribed and task-extracted within seconds.
| Device | Raspberry Pi 4 |
| Microphone | USB cardioid microphone |
| Button | GPIO momentary switch with internal pull-up |
| LED | GPIO indicator — off (idle), solid (recording), blinking (uploading) |
Button held → recording starts (LED solid)
Button release → recording stops
→ audio chunked + sent to /process endpoint
→ Whisper transcription + LLM task extraction
→ saved to dashboard database
→ LED off
Audio is written in chunks to handle long recordings without memory issues. Recovery scripts handle failed uploads and split oversized recordings into smaller segments automatically.
A custom FastAPI + LlamaIndex service handles document Q&A across multiple independent collections. Query your own documents in natural language — manuals, PDFs, notes, research — with semantic search and LLM synthesis.
| Framework | FastAPI + LlamaIndex |
| Embeddings | nomic-embed-text-v1.5 via local embedding server |
| LLM | Gemma 4 via local llama.cpp |
| Collections | Multiple independent indexes — one per document set |
| Chunk size | 512 tokens with 50-token overlap |
| Top-k | 3 most relevant chunks per query |
POST /upload/{collection} # Upload documents
POST /query # Natural language query
GET /indexes # List collections
POST /rebuild/{collection} # Rebuild after adding files
GET /health # Health check
A self-hosted SearXNG instance aggregates results from DuckDuckGo, Brave, Startpage, and Wikipedia. No tracking. Used by both the voice assistant and Open WebUI. Exposes a simple JSON API.
A dedicated Proxmox VM runs supporting services — workflow automation, document RAG, and container management — isolated from the primary inference server.
Web-based document RAG with workspace management. Connects to the main LLM and embedding server.
Visual workflow automation. Running — workflows not yet configured.
Docker container management UI. Check container status without SSH.
General purpose assistant bot. Skills include weather queries, service health checks, terminal session management, and task workflows. Full 65K context via Gemma 4.
Self-improving memory agent. Builds and refines a persistent memory store across conversations — designed for long-term context retention across sessions. Routes long context tasks to the AMD workstation.
Internal ports are never exposed directly. All external traffic enters through Nginx Proxy Manager, which handles SSL termination for every domain. Nginx on the primary server sits behind NPM and handles internal routing and API key enforcement.
Internet → NPM (SSL termination, all domains)
│
├── yourdomain.com → server :80 → Nginx → static / Open WebUI
├── api.yourdomain.com → server :8002 → Nginx (API key check) → LLM
└── voice.yourdomain.com → server :8005 → voice backend
Nginx Proxy Manager handles all SSL certificates via Let's Encrypt and routes every public domain. The GUI makes adding proxy hosts and managing certs straightforward. All subdomains managed here — no separate Certbot needed.
# Extended timeouts required for voice endpoint (long audio uploads) proxy_read_timeout 3600; proxy_connect_timeout 3600; proxy_send_timeout 3600; # Without these: 504 Gateway Timeout on recordings over 60 seconds
Nginx sits behind NPM on the server, handling API key validation before proxying requests to llama.cpp. NPM forwards requests; Nginx checks the key. Ensure NPM passes custom headers through unchanged.
# nginx.conf — API key enforcement via map directive
map $http_x_api_key $api_key_valid {
default 0;
"your-secret-key" 1;
}
# In server block for LLM port:
if ($api_key_valid = 0) { return 401; }
# Test the full chain:
curl -v https://api.yourdomain.com/v1/models \
-H "X-API-Key: your-key" 2>&1 | grep -E "HTTP|401|200"
BorgBackup provides local backups with deduplication, AES-256 encryption, and mountable archives.
sudo borg init --encryption=repokey /path/to/repo sudo borg create /path/to/repo::$(date +%Y-%m-%d) / sudo borg list /path/to/repo sudo borg mount /path/to/repo::archive-name /mnt/restore
| Always back up | Service configs, app data, API keys, Docker volumes, RAG indexes, voice scripts, databases |
| Skip | Model files (10–70GB each) — re-downloadable from Hugging Face if needed |
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Internet │ │ Kitchen Pi │ │ Recorder Pi │
└──────┬──────┘ │ Hey Helix │ │ GPIO button │
│ │ TTS · Timers│ │ audio memo │
│ └──────┬───────┘ └──────┬───────┘
│ │ │
└──────────┬────────┘ │
└───────────────┬────────────┘
│
┌────────────────────▼────────────────────┐
│ Reverse Proxy │
│ NPM — SSL termination │
│ Nginx — API key auth │
└──┬──────────────┬─────────────┬──────────┘
│ │ │
┌──────────▼──┐ ┌────▼───┐ ┌───▼─────────────┐ ┌──────────────────────────────┐
│ Open WebUI │ │ LLM │ │ Voice Backend │ │ Automation VM │
│ Web·Search │ │ API │ │ Whisper · STT │ │ OpenClaw · Hermes agent │
└─────────────┘ └────┬───┘ │ Dashboard │ │ AnythingLLM · n8n │
│ └────────┬─────────┘ └──────────────┬───────────────┘
│ │ │
└─────────────────┼─────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Inference Layer │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ GPU Server │ │ AMD Workstation │ │
│ │ Gemma 4 26B │ │ Qwen 3.5 122B │ │
│ │ 65K context │ │ 128K context │ │
│ │ fast chat · voice │ │ long context │ │
│ │ embeddings · RAG │ │ agents · reasoning │ │
│ └────────────┬────────────┘ └─────────────────────────┘ │
└────────────────┼───────────────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
┌────▼─────┐ ┌─────▼───┐
│Embeddings│ │ RAG API │
│ nomic │ │ Docs │
└──────────┘ └─────────┘
This stack has run five different main models in under a year. Keep old ExecStart lines commented in service files. Keep backup models on disk. A model swap should take minutes. The architecture outlasts any particular model.
Thinking mode produces better output but takes 3–10x longer. Route time, weather, and timer queries through fast paths that skip the LLM entirely. Reserve extended reasoning for queries that actually benefit from it.
When GPU services start simultaneously at boot, they race for hardware initialization. Use After= dependencies and ExecStartPre=/bin/sleep N to sequence them. This applies to both CUDA and ROCm systems.
Linux assigns USB audio device numbers dynamically. Configure ALSA by device name, not number. One line in ~/.asoundrc eliminates an entire category of reliability issues on embedded voice hardware.
A GPIO button for voice recording removes all friction from the capture workflow. The constraint of hardware forces thinking about the actual user experience. Software always has one more step; hardware doesn't.
On 24GB, the difference between q4_0 and q8_0 KV cache is roughly 8K vs 65K usable context for a 26B model. Combined with flash attention, quantized KV cache is what makes large context windows practical on consumer hardware.
Running a 122B model locally on a single consumer device changes what's possible. Token generation is slower than a discrete GPU on smaller models, but the capability tier — and the context window — is completely different.