Complete Web Interface, API Access, and Document Intelligence
A comprehensive guide to building a powerful self-hosted AI server with web-based chat interface, programmatic API access, and advanced document Q&A capabilities. This setup provides privacy-focused, high-performance AI without cloud dependencies.
Last Updated: February 2026
Updated for modern LLMs, improved RAG, and current best practices
The server is built for high-performance AI workloads with balanced component selection:
Ubuntu Server 22.04 LTS provides a stable, well-supported foundation with long-term security updates.
/data
llama.cpp provides efficient LLM inference with GPU acceleration.
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
Recommended models for 24GB VRAM:
sudo nano /etc/systemd/system/llama-server.service
[Unit]
Description=Llama.cpp API Server
After=network.target
[Service]
Type=simple
User=youruser
Group=youruser
WorkingDirectory=/home/youruser/llama.cpp/build
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
--model /path/to/model.gguf \
--port 8001 \
--host 0.0.0.0 \
--ctx-size 8192 \
--n-gpu-layers 999 \
--threads 8 \
--parallel 2
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
Embeddings convert text to vectors for semantic search and RAG applications.
sudo nano /etc/systemd/system/llama-embedding-server.service
[Unit]
Description=Llama.cpp Embedding Server
After=network.target
[Service]
Type=simple
User=youruser
Group=youruser
WorkingDirectory=/home/youruser/llama.cpp/build
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
--model /path/to/embedding-model.gguf \
--port 8003 \
--host 0.0.0.0 \
--embedding \
--n-gpu-layers 999
Restart=always
[Install]
WantedBy=multi-user.target
Recommended embedding models:
Modern web interface for LLM interaction with built-in features.
version: '3.8'
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "8080:8080"
volumes:
- /data/open-webui:/app/backend/data
- /var/run/docker.sock:/var/run/docker.sock
restart: always
docker compose up -d
Custom FastAPI service for programmatic document Q&A with multiple collections.
POST /upload/{index_name} - Upload documents
POST /query - Query documents
GET /indexes - List all indexes
POST /rebuild/{index_name} - Rebuild index
DELETE /index/{index_name} - Delete index
GET /health - Health check
Nginx provides reverse proxy, SSL/TLS termination, and API key authentication.
sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d yourdomain.com
Note: Certbot automatically renews certificates. Configure hardware firewall to allow only necessary ports.
BorgBackup provides efficient, encrypted backups with deduplication.
# Initialize repository
sudo borg init --encryption=repokey /data/backup_repo
# Create backup
sudo borg create /data/backup_repo::archive_name /
# List backups
sudo borg list /data/backup_repo/
# Mount for browsing
sudo borg mount /data/backup_repo::archive_name /mnt/restore
⚠️ Best Practice:
Add offsite backup via rclone to cloud storage (Backblaze B2, Wasabi) for disaster recovery.
┌─────────────────────────────────────────────────────┐
│ Internet │
└────────────────────┬────────────────────────────────┘
│
┌────────────▼────────────┐
│ Nginx Reverse Proxy │
│ (SSL/TLS, API Auth) │
└────────────┬────────────┘
│
┌────────────┴────────────┐
│ │
┌─────▼──────┐ ┌──────▼──────┐
│ Open WebUI │ │ LLM API │
│ (Port 8000)│ │ (Port 8002) │
└─────┬──────┘ └──────┬──────┘
│ │
└────────┬───────────────┘
│
┌────────────▼────────────┐
│ llama.cpp Server │
│ (Qwen3-30B-A3B) │
│ Port 8001 │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ NVIDIA RTX 3090 Ti │
│ 24GB VRAM │
└─────────────────────────┘
┌─────────────────────────┐
│ RAG API │
│ (Document Q&A) │
│ Port 8004 │
└──────┬──────────────────┘
│
┌──────▼──────┐
│ Embeddings │
│ Port 8003 │
└─────────────┘
Web-based interface for general AI conversations, writing assistance, and research.
Query personal documents, game guides, technical manuals, research papers via semantic search.
Connect external applications (VSCode, Obsidian, custom scripts) to local AI models.
Home Assistant + Raspberry Pi nodes for self-hosted Alexa/Google Home alternative.
All processing local - no data sent to cloud providers.
Modern models support optional "thinking mode" for complex reasoning. Disable for faster responses in casual chat (3-5x speedup).
Router mode allows multiple models on disk, auto-loading on demand. Only one large model fits in 24GB VRAM at a time.
sudo apt update && sudo apt upgradeThis self-hosted AI server provides enterprise-grade capabilities with complete privacy and control. The modular architecture allows easy upgrades and expansion as technology evolves.
Whether for personal use, small team collaboration, or development experimentation, this setup delivers powerful AI capabilities without cloud dependencies or ongoing costs.
Note:
The AI landscape evolves rapidly. Model recommendations and software versions should be verified against current releases. The architecture and principles remain consistent across updates.