self-hosted AI infrastructure

Building a
Self-Hosted AI Ecosystem

From a single GPU server to a multi-device stack: local LLMs, voice assistants, document intelligence, a high-memory workstation, and automation — all on your own hardware, no cloud required.

Updated May 2026
Status Production
Primary Gemma 4 26B
Workstation Qwen 3.5 122B
// 01

Why Self-Host in 2026

Cloud AI is convenient. It's also opaque about data handling, expensive at heavy usage, and entirely dependent on someone else's infrastructure and policies. A well-built local stack gives you something different: models that run on your hardware, answer to no one, and keep every conversation off someone else's servers.

The economics shifted in 2025–2026. Quantized open-weight models running on consumer hardware now match or exceed GPT-4-class quality on most everyday tasks. The hardware is widely available. The tooling — llama.cpp, Open WebUI, Whisper, Piper — is mature and actively developed.

Privacy by architecture

Every token stays on your hardware. No telemetry, no training on your data, no Terms of Service changes that affect what you can ask.

No ongoing costs

Hardware pays for itself at heavy usage. No per-token fees, no subscription tiers, no bills that scale with usage.

Full control

Swap models freely. Tune context windows. Run specialized variants. Integrate with anything via OpenAI-compatible APIs.

Offline capable

Everything works on your local network. No internet dependency for inference — better reliability and lower latency.

// 02

Ecosystem Overview

This isn't a single-machine setup — it's a distributed stack that grew to cover different use cases. Each piece solves a specific problem and they work together through standard HTTP APIs.

GPU Server
NVIDIA · Gemma 4 26B · 65K ctx
AMD Workstation
128GB unified · Qwen 122B · 128K ctx
Open WebUI
Web chat · model switching · search
Voice Backend
Whisper · task extraction · dashboard
Voice Assistant
Pi 4 · wake word · timers · TTS
Hardware Recorder
Pi 4 · GPIO button · transcribe
RAG API
Document Q&A · multiple collections
Private Search
SearXNG · self-hosted · no tracking
Telegram Bots
Two bots · memory · skills
n8n Automation
Installed · workflows pending
// 03

Primary GPU Server

The workhorse of the stack. A desktop machine with a 24GB NVIDIA GPU running llama.cpp as a system service. It handles primary chat inference, the voice backend, embeddings, and document Q&A simultaneously.

// Hardware

CPU16-core desktop processor (AM5 platform)
GPUNVIDIA RTX 3090 Ti, 24GB GDDR6X
RAM64GB DDR5-6000
StorageTwo NVMe SSDs — OS on one, data and models on the other
OSUbuntu Server 22.04 LTS (headless)

// LLM inference — llama.cpp

llama.cpp runs as a systemd service, exposing an OpenAI-compatible HTTP API internally and (via reverse proxy) externally with API key authentication.

cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

# Key flags for 24GB VRAM + 26B MoE model
llama-server \
  --model model.gguf \
  --port 8001 \
  --n-gpu-layers 999 \
  --ctx-size 65536 \
  --jinja \
  -fa on \
  -ctk q8_0 -ctv q8_0 \
  --metrics

// Current model: Gemma 4 26B

ArchitectureGoogle Gemma 4, MoE — 26B total parameters, 4B active per token
Context65,536 tokens (~50,000 words in a single conversation)
KV cacheq8_0 quantization — higher quality than q4_0, enables the large context window
Flash attentionEnabled — reduces memory pressure on long contexts
Thinking modeOn in web chat, off in voice/task extraction paths (3–10x speed difference)

// VRAM management on 24GB

With a 26B model, Whisper, and an embedding server all running, the 24GB card stays near capacity. Key strategies that make it work:

// Model recommendations (2026, 24GB VRAM)

Best overall current

Gemma 4 26B-A4B — MoE, excellent reasoning, 65K+ context

Fastest responses

GLM-4.7-Flash or Mistral Small 3 24B — ~100–120 tok/s, ideal for real-time voice

Best for code

Qwen2.5-Coder 32B — specialized training, strong completions and refactoring

Strong reasoning

Qwen3-30B-A3B with thinking mode — MoE efficiency with extended reasoning chains

Quantization: Q4_K_M and Q4_K_L offer the best quality-to-size ratio. For MoE models, Unsloth Dynamic quants apply different quantization levels per parameter type — worth seeking out for the same file size at better quality.
// 04

AMD Unified Memory Workstation

A fundamentally different approach. The AMD Ryzen AI Max+ integrates an iGPU that can access the entire system memory pool — 128GB — eliminating the VRAM ceiling that constrains discrete GPU setups.

// What this enables

Models that would require a multi-GPU server run on a single consumer device. Qwen 3.5 122B at Q4_K_M (~70GB) loads with room to spare. 128K context windows are practical. At idle, the whole system draws about 5W.

ChipAMD Ryzen AI Max+ 395 — Zen 5 CPU cores, RDNA 3.5 iGPU, XDNA2 NPU
Memory128GB LPDDR5X-8000 unified — CPU and GPU share one pool
GPU access~118GB available to the iGPU via GTT (after minimal UMA frame buffer)
OSUbuntu 26.04 LTS Server
AI stackROCm 7.x via apt, llama.cpp + whisper.cpp with HIP support

// ROCm setup

# Install ROCm — no DKMS needed on Ubuntu 26.04
sudo apt install rocm

# Build llama.cpp with HIP backend
cmake .. \
  -DGGML_HIP=ON \
  -DGGML_RPC=ON \
  -DAMDGPU_TARGETS=gfx1151
cmake --build . -j$(nproc)

# Required in systemd service file:
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"

// Performance

Idle power~4–5W — efficient enough for always-on deployment
Load power~120W during active inference
Prompt speed~67 tok/s (fills 128K context quickly)
Gen speed~10–20 tok/s — fine for async tasks, slower for chat
Best use casesLong document analysis, deep reasoning, tasks needing massive context

// Whisper on AMD — boot sequencing fix

When the LLM and Whisper services start simultaneously at boot, they race for ROCm initialization and Whisper loses, falling back to CPU. Fix by sequencing the services:

# whisper-server.service [Unit] section:
After=network.target llama-server.service

# [Service] section:
ExecStartPre=/bin/sleep 15
# Without this, whisper runs on CPU after every reboot.
AMD quirks: HSA_OVERRIDE_GFX_VERSION is required for gfx1151. Power monitoring via amd-smi returns N/A — use sensors | grep PPT. TDP is set via the front panel button, not software.
Complementary pair: The GPU server handles fast chat and real-time voice. The AMD workstation handles models too large for 24GB VRAM and tasks benefiting from massive context. Different tiers, different strengths.
// 05

Web Interface — Open WebUI

Open WebUI is the primary chat interface. It runs in Docker, connects to llama.cpp's OpenAI-compatible API, and handles multiple users with separate conversation histories.

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "8080:8080"
    volumes:
      - /data/open-webui:/app/backend/data
    restart: always

docker compose up -d
# Update: docker compose pull && docker compose down && docker compose up -d

// Key features in use

// 06

Voice Backend — Transcription & Task Extraction

A custom FastAPI service wrapping Whisper and the LLM into a multi-purpose voice processing API. One service handles transcription, AI task extraction from voice memos, the voice assistant endpoint, and a web dashboard.

// Endpoints

/process recorder

Audio → Whisper transcription → LLM extracts tasks with owners and deadlines → saved to SQLite → structured JSON response.

/transcribe utility

Transcription only. Returns transcript, language, and segments. No LLM call, no DB write.

/assistant Helix

Voice command handler for the kitchen assistant. Classifies intent and routes to the right handler — often without any LLM call.

/dashboard web UI

Password-protected interface showing all recordings, transcripts, extracted tasks, and audio stats. Web recorder built in.

// Intent routing — avoiding unnecessary LLM calls

Time / dateSystem clock directly. ~0ms. No LLM.
WeatherSearXNG web search. No LLM.
TimersJSON response only — managed on Pi. No LLM.
Time-sensitiveSearXNG + current date appended.
General queriesFull LLM inference. ~7s total with Whisper.

// Performance

Whisper~1 second (GPU, large-v3-turbo)
Task extraction~6 seconds (thinking mode disabled)
Full /process~7 seconds end-to-end
Search query~10–15 seconds (includes web search)

Whisper hallucination filtering rejects transcripts with too few ASCII alpha characters before they reach the LLM — preventing garbage responses from silence or non-speech audio.

Supported audio formats: .wav .mp3 .m4a .ogg .flac .webm

// 07

Kitchen Voice Assistant — "Hey Helix"

A Raspberry Pi 4 on the kitchen counter running a fully local voice assistant. No Alexa. No Google Home. Wake word detection, speech recognition, AI responses, text-to-speech, and timer management — all self-hosted.

// Hardware

DeviceRaspberry Pi 4 (4GB RAM)
AudioUSB speakerphone — mic and speaker in one unit, single cable
ProcessingWake word and TTS run on Pi CPU; heavy inference offloaded to the server

// Software stack

Wake word

Porcupine v4 with a custom "Hey Helix" model. Runs on Pi CPU continuously at negligible power draw.

Transcription

Audio sent to server Whisper endpoint over HTTPS. ~1 second round-trip.

TTS — Piper

Voice model loaded once at startup. Raw PCM streamed directly to speaker — near-instant playback, no cold-start.

// Timer system

Timers are managed entirely on the Pi without server round-trips:

// ALSA — name-based config is critical

Linux assigns audio device numbers dynamically at boot. A USB speakerphone that's card 2 today may be card 3 after adding a different USB device. Configure by device name, not number, for boot stability:

# ~/.asoundrc — use name, not card number
pcm.!default {
    type hw
    card Plus    # stable across reboots
    device 0
}
Boot sequencing: The assistant service uses ExecStartPre=/bin/sleep 10 to let USB audio enumerate fully before the Python script opens it. Without this, the service frequently fails on boot.
// 08

Hardware Voice Recorder

A second Raspberry Pi built as a dedicated voice recorder for capturing thoughts, tasks, and meeting notes. Press a button → record → release → automatically transcribed and task-extracted within seconds.

// Hardware

DeviceRaspberry Pi 4
MicrophoneUSB cardioid microphone
ButtonGPIO momentary switch with internal pull-up
LEDGPIO indicator — off (idle), solid (recording), blinking (uploading)

// Flow

Button held   → recording starts (LED solid)
Button release → recording stops
               → audio chunked + sent to /process endpoint
               → Whisper transcription + LLM task extraction
               → saved to dashboard database
               → LED off

Audio is written in chunks to handle long recordings without memory issues. Recovery scripts handle failed uploads and split oversized recordings into smaller segments automatically.

Why hardware? A physical button has zero friction. When a thought needs capturing, unlocking a phone and opening an app is enough delay to lose it. The button is always there, always ready.
// 09

Document Intelligence — RAG & Private Search

A custom FastAPI + LlamaIndex service handles document Q&A across multiple independent collections. Query your own documents in natural language — manuals, PDFs, notes, research — with semantic search and LLM synthesis.

// RAG API

FrameworkFastAPI + LlamaIndex
Embeddingsnomic-embed-text-v1.5 via local embedding server
LLMGemma 4 via local llama.cpp
CollectionsMultiple independent indexes — one per document set
Chunk size512 tokens with 50-token overlap
Top-k3 most relevant chunks per query
POST /upload/{collection}  # Upload documents
POST /query                # Natural language query
GET  /indexes              # List collections
POST /rebuild/{collection} # Rebuild after adding files
GET  /health               # Health check

// Private web search — SearXNG

A self-hosted SearXNG instance aggregates results from DuckDuckGo, Brave, Startpage, and Wikipedia. No tracking. Used by both the voice assistant and Open WebUI. Exposes a simple JSON API.

// 10

Automation, Bots & Workflow

A dedicated Proxmox VM runs supporting services — workflow automation, document RAG, and container management — isolated from the primary inference server.

AnythingLLM RAG

Web-based document RAG with workspace management. Connects to the main LLM and embedding server.

n8n automation

Visual workflow automation. Running — workflows not yet configured.

Portainer ops

Docker container management UI. Check container status without SSH.

// Telegram bots

OpenClaw running

General purpose assistant bot. Skills include weather queries, service health checks, terminal session management, and task workflows. Full 65K context via Gemma 4.

Hermes running

Self-improving memory agent. Builds and refines a persistent memory store across conversations — designed for long-term context retention across sessions. Routes long context tasks to the AMD workstation.

// 11

Networking & Reverse Proxy

Internal ports are never exposed directly. All external traffic enters through Nginx Proxy Manager, which handles SSL termination for every domain. Nginx on the primary server sits behind NPM and handles internal routing and API key enforcement.

// Traffic flow

Internet → NPM (SSL termination, all domains)
         │
         ├── yourdomain.com      → server :80   → Nginx → static / Open WebUI
         ├── api.yourdomain.com  → server :8002  → Nginx (API key check) → LLM
         └── voice.yourdomain.com → server :8005 → voice backend

// NPM — the entry point

Nginx Proxy Manager handles all SSL certificates via Let's Encrypt and routes every public domain. The GUI makes adding proxy hosts and managing certs straightforward. All subdomains managed here — no separate Certbot needed.

# Extended timeouts required for voice endpoint (long audio uploads)
proxy_read_timeout 3600;
proxy_connect_timeout 3600;
proxy_send_timeout 3600;
# Without these: 504 Gateway Timeout on recordings over 60 seconds

// Nginx — API key enforcement

Nginx sits behind NPM on the server, handling API key validation before proxying requests to llama.cpp. NPM forwards requests; Nginx checks the key. Ensure NPM passes custom headers through unchanged.

# nginx.conf — API key enforcement via map directive
map $http_x_api_key $api_key_valid {
    default 0;
    "your-secret-key" 1;
}

# In server block for LLM port:
if ($api_key_valid = 0) { return 401; }

# Test the full chain:
curl -v https://api.yourdomain.com/v1/models \
  -H "X-API-Key: your-key" 2>&1 | grep -E "HTTP|401|200"
Header passthrough: If the API returns 401 with a valid key, NPM may be stripping custom headers. Fix in the proxy host Advanced tab in NPM.
// 12

Security Practices

Key rotation: API keys live across multiple devices — servers, Pi devices, desktop clients. Document every location so rotation doesn't miss one.
// 13

Backup Strategy

BorgBackup provides local backups with deduplication, AES-256 encryption, and mountable archives.

sudo borg init --encryption=repokey /path/to/repo
sudo borg create /path/to/repo::$(date +%Y-%m-%d) /
sudo borg list /path/to/repo
sudo borg mount /path/to/repo::archive-name /mnt/restore
Always back upService configs, app data, API keys, Docker volumes, RAG indexes, voice scripts, databases
SkipModel files (10–70GB each) — re-downloadable from Hugging Face if needed
Local-only backup is a single point of failure. Add offsite replication via rclone to Backblaze B2 or Wasabi. Hardware can be replaced; data and configuration cannot.
// 14

System Architecture

  ┌─────────────┐    ┌──────────────┐    ┌──────────────┐
  │   Internet  │    │  Kitchen Pi  │    │  Recorder Pi │
  └──────┬──────┘    │  Hey Helix   │    │  GPIO button │
         │           │  TTS · Timers│    │  audio memo  │
         │           └──────┬───────┘    └──────┬───────┘
         │                  │                   │
         └──────────┬────────┘                   │
                    └───────────────┬────────────┘
                                    │
               ┌────────────────────▼────────────────────┐
               │              Reverse Proxy               │
               │       NPM — SSL termination              │
               │       Nginx — API key auth               │
               └──┬──────────────┬─────────────┬──────────┘
                  │              │             │
       ┌──────────▼──┐      ┌────▼───┐    ┌───▼─────────────┐    ┌──────────────────────────────┐
       │  Open WebUI │      │  LLM   │    │  Voice Backend  │    │       Automation VM          │
       │  Web·Search │      │  API   │    │  Whisper · STT  │    │  OpenClaw · Hermes agent     │
       └─────────────┘      └────┬───┘    │  Dashboard      │    │  AnythingLLM · n8n           │
                                 │        └────────┬─────────┘    └──────────────┬───────────────┘
                                 │                 │                             │
                                 └─────────────────┼─────────────────────────────┘
                                                   │
                                                   ▼
               ┌───────────────────────────────────────────────────────────────┐
               │                       Inference Layer                         │
               │                                                               │
               │   ┌─────────────────────────┐   ┌─────────────────────────┐  │
               │   │       GPU Server        │   │    AMD Workstation      │  │
               │   │   Gemma 4 26B           │   │   Qwen 3.5 122B         │  │
               │   │   65K context           │   │   128K context          │  │
               │   │   fast chat · voice     │   │   long context          │  │
               │   │   embeddings · RAG      │   │   agents · reasoning    │  │
               │   └────────────┬────────────┘   └─────────────────────────┘  │
               └────────────────┼───────────────────────────────────────────────┘
                                │
                   ┌────────────┴────────────┐
                   │                         │
              ┌────▼─────┐             ┌─────▼───┐
              │Embeddings│             │ RAG API │
              │  nomic   │             │  Docs   │
              └──────────┘             └─────────┘
      
// 15

Lessons Learned

Build for model swappability from the start

This stack has run five different main models in under a year. Keep old ExecStart lines commented in service files. Keep backup models on disk. A model swap should take minutes. The architecture outlasts any particular model.

Separate fast paths from slow paths

Thinking mode produces better output but takes 3–10x longer. Route time, weather, and timer queries through fast paths that skip the LLM entirely. Reserve extended reasoning for queries that actually benefit from it.

Service startup ordering matters

When GPU services start simultaneously at boot, they race for hardware initialization. Use After= dependencies and ExecStartPre=/bin/sleep N to sequence them. This applies to both CUDA and ROCm systems.

Name-based audio config prevents boot instability

Linux assigns USB audio device numbers dynamically. Configure ALSA by device name, not number. One line in ~/.asoundrc eliminates an entire category of reliability issues on embedded voice hardware.

Hardware interaction reduces friction meaningfully

A GPIO button for voice recording removes all friction from the capture workflow. The constraint of hardware forces thinking about the actual user experience. Software always has one more step; hardware doesn't.

KV cache quantization unlocks context

On 24GB, the difference between q4_0 and q8_0 KV cache is roughly 8K vs 65K usable context for a 26B model. Combined with flash attention, quantized KV cache is what makes large context windows practical on consumer hardware.

AMD unified memory is a different category

Running a 122B model locally on a single consumer device changes what's possible. Token generation is slower than a discrete GPU on smaller models, but the capability tier — and the context window — is completely different.