Building a Self-Hosted AI Server

Complete Web Interface, API Access, and Document Intelligence

A comprehensive guide to building a powerful self-hosted AI server with web-based chat interface, programmatic API access, and advanced document Q&A capabilities. This setup provides privacy-focused, high-performance AI without cloud dependencies.

Last Updated: February 2026

Updated for modern LLMs, improved RAG, and current best practices

1. Hardware Selection

The server is built for high-performance AI workloads with balanced component selection:

2. Operating System

Ubuntu Server 22.04 LTS provides a stable, well-supported foundation with long-term security updates.

Partitioning Strategy:

3. Core AI Engine: llama.cpp

llama.cpp provides efficient LLM inference with GPU acceleration.

Installation:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

Model Selection (2026):

Recommended models for 24GB VRAM:

Running as System Service:

sudo nano /etc/systemd/system/llama-server.service

[Unit]
Description=Llama.cpp API Server
After=network.target

[Service]
Type=simple
User=youruser
Group=youruser
WorkingDirectory=/home/youruser/llama.cpp/build
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
  --model /path/to/model.gguf \
  --port 8001 \
  --host 0.0.0.0 \
  --ctx-size 8192 \
  --n-gpu-layers 999 \
  --threads 8 \
  --parallel 2
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server

4. Embedding Server

Embeddings convert text to vectors for semantic search and RAG applications.

Setup:

sudo nano /etc/systemd/system/llama-embedding-server.service

[Unit]
Description=Llama.cpp Embedding Server
After=network.target

[Service]
Type=simple
User=youruser
Group=youruser
WorkingDirectory=/home/youruser/llama.cpp/build
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
  --model /path/to/embedding-model.gguf \
  --port 8003 \
  --host 0.0.0.0 \
  --embedding \
  --n-gpu-layers 999
Restart=always

[Install]
WantedBy=multi-user.target

Recommended embedding models:

5. Open WebUI - Web Interface

Modern web interface for LLM interaction with built-in features.

Docker Deployment:

version: '3.8'
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "8080:8080"
    volumes:
      - /data/open-webui:/app/backend/data
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always
docker compose up -d

Features (2026):

6. RAG API - Document Intelligence

Custom FastAPI service for programmatic document Q&A with multiple collections.

Architecture:

API Endpoints:

POST   /upload/{index_name}     - Upload documents
POST   /query                   - Query documents
GET    /indexes                 - List all indexes
POST   /rebuild/{index_name}    - Rebuild index
DELETE /index/{index_name}      - Delete index
GET    /health                  - Health check

Optimization Settings:

7. Secure Web Access with Nginx

Nginx provides reverse proxy, SSL/TLS termination, and API key authentication.

Port Configuration:

SSL/TLS:

sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d yourdomain.com

Note: Certbot automatically renews certificates. Configure hardware firewall to allow only necessary ports.

8. Backup Strategy

BorgBackup provides efficient, encrypted backups with deduplication.

Key Features:

Basic Commands:

# Initialize repository
sudo borg init --encryption=repokey /data/backup_repo

# Create backup
sudo borg create /data/backup_repo::archive_name /

# List backups
sudo borg list /data/backup_repo/

# Mount for browsing
sudo borg mount /data/backup_repo::archive_name /mnt/restore

⚠️ Best Practice:

Add offsite backup via rclone to cloud storage (Backblaze B2, Wasabi) for disaster recovery.

System Architecture

┌─────────────────────────────────────────────────────┐
│                   Internet                          │
└────────────────────┬────────────────────────────────┘
                     │
        ┌────────────▼────────────┐
        │    Nginx Reverse Proxy   │
        │  (SSL/TLS, API Auth)     │
        └────────────┬────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
  ┌─────▼──────┐          ┌──────▼──────┐
  │ Open WebUI │          │  LLM API    │
  │  (Port 8000)│          │ (Port 8002) │
  └─────┬──────┘          └──────┬──────┘
        │                        │
        └────────┬───────────────┘
                 │
    ┌────────────▼────────────┐
    │    llama.cpp Server     │
    │   (Qwen3-30B-A3B)       │
    │      Port 8001          │
    └────────────┬────────────┘
                 │
    ┌────────────▼────────────┐
    │   NVIDIA RTX 3090 Ti    │
    │      24GB VRAM          │
    └─────────────────────────┘

    ┌─────────────────────────┐
    │      RAG API            │
    │  (Document Q&A)         │
    │    Port 8004            │
    └──────┬──────────────────┘
           │
    ┌──────▼──────┐
    │  Embeddings │
    │  Port 8003  │
    └─────────────┘
                

Use Cases & Applications

Interactive Web Chat

Web-based interface for general AI conversations, writing assistance, and research.

Document Intelligence

Query personal documents, game guides, technical manuals, research papers via semantic search.

API Integration

Connect external applications (VSCode, Obsidian, custom scripts) to local AI models.

Voice Assistant (Planned)

Home Assistant + Raspberry Pi nodes for self-hosted Alexa/Google Home alternative.

Privacy-Focused AI

All processing local - no data sent to cloud providers.

Performance Considerations

Context Window vs VRAM:

Thinking Mode:

Modern models support optional "thinking mode" for complex reasoning. Disable for faster responses in casual chat (3-5x speedup).

Model Swapping:

Router mode allows multiple models on disk, auto-loading on demand. Only one large model fits in 24GB VRAM at a time.

Security Best Practices

Future Enhancements

Conclusion

This self-hosted AI server provides enterprise-grade capabilities with complete privacy and control. The modular architecture allows easy upgrades and expansion as technology evolves.

Whether for personal use, small team collaboration, or development experimentation, this setup delivers powerful AI capabilities without cloud dependencies or ongoing costs.

Note:

The AI landscape evolves rapidly. Model recommendations and software versions should be verified against current releases. The architecture and principles remain consistent across updates.