Building a Self-Hosted RAG Chatbot From Scratch

TL;DR

I built a Retrieval-Augmented Generation (RAG) chatbot that answers questions about my professional background using my portfolio website as the single source of truth. It runs entirely on my own hardware - a K3s single-node setup with an NVIDIA RTX 3060 - with no cloud AI dependencies. The stack includes vLLM for GPU-accelerated inference, LangGraph for orchestration, Qdrant for vector search, and a Flask API with OWASP-aligned security. Everything is deployed via GitLab CI/CD across three repositories, with automated content indexing, monitoring, and zero manual intervention.

Content Chunks

API Endpoints

258

Injection Rules

1,910

Automated Tests

LangGraph Nodes

OWASP Top 10

Why Build Your Own Chatbot?

Every portfolio website has the same problem: visitors need to find specific information fast. “What technologies does this person know?” “Do they have Kubernetes experience?” “How do I contact them?”

Instead of hoping visitors click through the right pages, I built an AI assistant that answers these questions instantly - grounded in actual website content, not hallucinations.

But why self-hosted? Three reasons:

Privacy - No visitor data leaves my infrastructure. No OpenAI, no Anthropic API, no third-party inference.
Cost - After the initial hardware investment, the only ongoing cost is electricity (~17W idle, ~150W under load). No per-token pricing, no usage caps.
Learning - Building the entire stack from GPU drivers to production deployment taught me more about AI infrastructure than any course could.

Architecture

The system runs on two VMs in separate VLANs on a Proxmox hypervisor, managed across three Git repositories:

Servers:

Server	Network	Role
www-server	DMZ	Docker Compose - portfolio website, unified backend API, Cloudflare Tunnel
K8s node	Internal	K3s single-node - GPU inference, vector DB, embeddings, monitoring, auth

The two networks are deliberately isolated. The chatbot API on the www-server calls AI services on the K8s node via HTTPS through ingress endpoints - the www-server has no direct cluster access.

Repositories:

Repository	Purpose	Deploys To
`chatbot`	Backend API, frontend widget, tests	www-server (GitLab CI/CD)
`ai-infra`	Kubernetes manifests, Langflow flows, deploy scripts	K8s node (kubectl apply)
`pichler-portfolio`	Hugo website, Nginx config, Docker Compose	www-server (Hugo build + Docker image)

Each repository has its own CI/CD pipeline. When the portfolio is updated, it automatically triggers a re-scrape of the website content into the vector database - the chatbot always has up-to-date information without manual intervention.

The AI Stack

LLM: vLLM + Gemma 3 4B IT

The language model is Google’s Gemma 3 4B IT (instruction-tuned), served by vLLM - a high-throughput inference engine that provides an OpenAI-compatible /v1/chat/completions endpoint.

Why Gemma 3 4B? It’s the largest model that fits in 12 GB VRAM (RTX 3060) while maintaining good response quality for a focused domain. I tested the 12B variant, but it triggered out-of-memory errors on this GPU.

Real performance numbers from the production API:

113ms

Time to First Token

11.4 GB

VRAM Used (of 11.6)

~17W

Idle Power Draw

At idle the GPU draws ~17W; under inference load it peaks at ~150W.

Automatic Failover: If the local vLLM instance becomes unavailable, the system automatically fails over to OpenRouter (GPT-4o-mini) as a cloud backup. Health checks run with configurable timeouts, and the failover state is managed thread-safely with a cooldown before attempting recovery. Every failover event and recovery triggers a push notification via ntfy.

Embeddings: TEI + Jina Embeddings DE v2

Hugging Face Text Embeddings Inference (TEI) serves the jinaai/jina-embeddings-v2-base-de model, producing 768-dimensional vectors. It handles both the indexing pipeline (when content is scraped) and query-time embedding generation (when a user asks a question).

Vector Database: Qdrant

Qdrant stores the embedded website content. The collection uses 768 dimensions with Cosine similarity. When a user asks a question, the chatbot embeds the query via TEI, searches Qdrant for the most relevant content chunks, and feeds them to the LLM as context.

The collection currently holds 45 content chunks - automatically scraped from 9 pages including the portfolio website and blog posts, each with metadata like source URLs and section titles for grounding. Adaptive chunking splits content at semantic boundaries (h2/h3 sections) rather than fixed character counts, producing fewer but more coherent chunks.

Content Indexing: Langflow

Langflow handles the scraping and indexing pipeline. A custom WebsiteScraper component crawls pichler.dev, extracts content from each page, splits it into chunks, generates embeddings via TEI, and stores vectors in Qdrant with source metadata. It automatically discovers blog post URLs from the /blog/ index page, so new posts are indexed without manual configuration.

This runs as a webhook - triggered automatically by the portfolio CI/CD pipeline whenever the website is updated, or manually via make rescrape. The custom component is deployed as a Kubernetes ConfigMap and loaded into Langflow at startup.

The RAG Pipeline: LangGraph

The heart of the chatbot is a LangGraph StateGraph with 10 nodes - a directed graph that orchestrates the entire chat flow. Unlike simple if/else chains, the graph makes the flow explicit, testable, and easy to extend.

The 10 Nodes

validate_input - Input validation, sanitization, injection detection, session verification. Malicious inputs are blocked here before any LLM call is made.
classify_intent - Two-stage intent classification. First, fast regex-based pattern matching for common intents (handles ~40% of inputs without calling the LLM). Then, for ambiguous inputs, an LLM relevance check determines if the question relates to my professional profile.
direct_response - Handles non-RAG intents (greetings, farewells, smalltalk, contact requests, out-of-scope) with pre-defined response templates. No LLM call needed.
check_cache - Looks up the query in the response cache (exact match + semantic similarity). Cache hits skip retrieval and generation entirely, returning the cached response immediately.
retrieve_context - First rewrites vague queries via LLM (e.g. “tell me about him” → “What is Marcus Pichler’s professional background?”). Then runs hybrid search: dense embeddings via TEI + BM25 sparse vectors, fused via Reciprocal Rank Fusion. Results are reranked with a cross-encoder, compressed to fit the token budget, and enriched with source metadata.
evaluate_context - Checks whether the retrieved context is sufficient to answer the query. If insufficient and retry budget remains, triggers re-retrieval with a reformulated query. Three dimensions are scored: keyword coverage, document relevance, and context volume.
generate_response - Sends the query + retrieved context to vLLM (or the failover provider) and generates the answer. The system prompt is recruiting-optimized and bilingual (German/English). Dynamic temperature adapts to query type (tech=0.15, recruiting=0.2, general=0.3).
validate_quality - Four-layer response validation: garbage detection, hallucination detection, faithfulness checking (NLI-based), and confidence scoring. Responses that fail validation trigger a retry or fallback.
cache_and_suggest - Caches the validated response and generates dynamic follow-up suggestions based on the conversation context.
handle_error - Catches exceptions from any node and returns a graceful error response instead of crashing.

Here’s the actual graph definition from graph.py:

graph = StateGraph(ChatState)

graph.add_node("validate_input", validate_input)
graph.add_node("classify_intent", classify_intent)
graph.add_node("direct_response", direct_response)
graph.add_node("check_cache", check_cache)
graph.add_node("retrieve_context", retrieve_context)
graph.add_node("evaluate_context", evaluate_context)
graph.add_node("generate_response", generate_response)
graph.add_node("validate_quality", validate_quality)
graph.add_node("cache_and_suggest", cache_and_suggest)
graph.add_node("handle_error", handle_error)

The 7 Intent Types

Intent	Example	Handling
GREETING	“Hello”, “Hi there”	Direct response, no LLM
FAREWELL	“Goodbye”, “See you”	Direct response, no LLM
SMALLTALK	“How are you?”, “What’s your name?”	Direct response, no LLM
QUESTION	“What are Marcus’s skills?”	Full RAG pipeline
CONTACT	“How can I reach Marcus?”	Contact form redirect
PERSONAL	“What’s Marcus’s salary?”	Direct response (contact redirect)
OUT_OF_SCOPE	“What’s the weather?”	Polite redirect, no LLM

Routing Logic

After intent classification, five routing functions determine the path through the graph:

route_by_intent - QUESTION and PERSONAL intents go to check_cache; all other intents go to direct_response.
route_cache - Cache hits skip to cache_and_suggest; cache misses proceed to retrieve_context.
route_sufficiency - After context evaluation: sufficient context proceeds to generate_response; insufficient context loops back to retrieve_context with a reformulated query.
route_quality - After generation, normal responses go to validate_quality; empty context skips to cache_and_suggest (fallback); errors go to handle_error.
route_retry - Failed quality checks with retry budget go back to generate_response; exhausted retries accept the response as-is via cache_and_suggest.

Real API Response

Here’s what the chatbot returns for a real question — a curl against the live API at pichler.dev/api/chat:

Every response includes the detected intent, language, follow-up suggestions, and the grounded answer. You can also stream responses token-by-token via SSE:

curl -N -X POST https://www.pichler.dev/api/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message": "What are Marcus'\''s skills?", "session_id": "demo"}'

# Output (SSE):
# data: {"type": "token", "content": "Marcus"}
# data: {"type": "token", "content": " has"}
# data: {"type": "token", "content": " extensive"}
# ...
# data: {"type": "done", "suggestions": ["What projects has Marcus worked on?"]}

Response Quality Pipeline

Before any response reaches the user, it passes through four validation layers. This is implemented in the validate_quality node.

1. Garbage Detection (8 Methods)

Catches LLM output failures that produce nonsensical text:

Check	What It Catches
Repeated characters	5+ identical characters in a row (e.g., “aaaaa”)
Repeated patterns	2-4 character patterns repeating 3+ times
Repeated words	3+ consecutive identical words or bigrams
Long words	Words exceeding 30 characters
Consonant runs	9+ consonants without vowels
N-gram loops	Repeated 3-6 word phrases (the LLM getting “stuck”)
Question loops	3+ similar repeated questions instead of answers
Foreign script	Unexpected CJK, Cyrillic, Arabic, or Devanagari characters

2. Hallucination Detection (10 Methods)

Checks whether the response is grounded in retrieved context:

Check	What It Catches
Question as answer	Response is >30% questions or starts with a question word
Instruction hallucination	14 instruction-style patterns (“Remember:”, “Note:”, “As an AI…”)
Nonsense words	7+ consonant-only words, repeated syllables, character spam
Entity presence	Response about Marcus doesn’t mention him at all
Generic hallucination	20 corporate-speak patterns (“I’d be happy to help…”), triggers on 2+ matches
Context grounding	Mentions technologies not present in the RAG context
Domain hallucination	False employers, cities, or job titles not in the knowledge base
Prompt leakage	Detects system prompt fragments leaked into the response
Self-contradiction	Detects contradictory statements within the same response
URL hallucination	Strips fabricated URLs not present in the source material

3. Confidence Scoring

Every response gets a confidence score starting at 1.0. Each failed check applies a multiplicative penalty:

Check	Penalty	What It Catches
Garbage detection	×0.1	Loops, repeated patterns, nonsense output
Question as answer	×0.1	Response is mostly questions instead of answers
Instruction hallucination	×0.1	“Remember:”, “Note:”, “As an AI…” patterns
Nonsense words	×0.1	Consonant-only words, repeated syllables, character spam
Missing entity	×0.4	Response about Marcus doesn’t mention him
Context grounding	×0.2	Claims not supported by RAG context
Too short	×0.5	Below minimum character threshold
Too long	×0.7	Exceeds maximum character limit

The final decision is binary: a response is valid if confidence >= threshold AND issues <= 2. Failed responses trigger a fallback or regeneration attempt. There’s no letter-grade system in the pipeline - it’s pass/fail with a confidence score that drops sharply on any quality issue.

A separate quality scoring module runs after the pipeline for metrics and logging, computing a weighted A-F grade across 5 dimensions (relevance, completeness, coherence, grounding, brevity) - but this doesn’t affect the pipeline decision.

Advanced RAG: Beyond Basic Retrieval

The basic RAG pattern - embed query, search vectors, feed to LLM - works, but production quality demands more. Seven additional pipeline stages transform raw retrieval into reliable, grounded responses.

Query Rewriting

Vague or context-dependent queries like “tell me about him” or “what does he do?” fail at retrieval because there’s nothing specific to embed. The pipeline detects these cases and rewrites them via a standalone LLM call before retrieval — transforming “tell me about him” into “What is Marcus Pichler’s professional background and experience?”. This runs only for first-turn queries without conversation history, adding ~200ms but dramatically improving first-turn recall.

Hybrid Search (BM25 + Dense)

Pure semantic search struggles with exact terms — proper nouns, specific technologies, or acronyms that don’t have strong embedding representations. Hybrid search combines dense embeddings (768-dim via TEI) with BM25 sparse vectors stored directly in Qdrant. A pre-built vocabulary of 2,222 terms maps to sparse dimensions. Results from both searches are merged via Reciprocal Rank Fusion (RRF), giving the best of both worlds: semantic understanding from dense vectors and exact keyword matching from BM25.

Cross-Encoder Reranking

Vector similarity is a blunt instrument - it finds related content but doesn’t distinguish relevant from tangential. After hybrid search returns candidate chunks, a Cross-Encoder model via TEI re-scores each chunk against the actual query. This produces dramatically better ordering than cosine similarity alone, especially for nuanced questions where keyword overlap is low.

Context Compression

Not every sentence in a retrieved chunk is useful. The compression module uses embedding-based extractive compression (no LLM call, ~50ms): split documents into sentences, batch-embed them via TEI, keep only sentences above a cosine similarity threshold to the query, and enforce the token budget. Dynamic thresholds adapt per query type - recruiting queries get a lower threshold (keep more context), tech queries get a higher one (precision matters).

Context Enrichment

Raw chunks lack provenance. The enrichment module prepends metadata headers to each chunk so the LLM knows where the information comes from: [From Marcus's career timeline (pichler.dev/career) — Section: Work Experience]. This is pure metadata-based enrichment (0ms latency, no LLM call) using a static page context map.

Context Sufficiency Evaluation

The most impactful addition: a pre-generation gate that evaluates whether the retrieved context is actually sufficient to answer the query before sending it to the LLM. Three dimensions are scored - keyword coverage, document relevance, and context volume:

def check_context_sufficiency(query, docs, context):
    scores = {}
    scores["coverage"] = query_keywords_found / total_keywords
    scores["relevance"] = avg(doc.metadata["score"] for doc in docs)
    scores["volume"]   = token_count_score(context)

    overall = coverage * 0.4 + relevance * 0.35 + volume * 0.25
    return overall >= SUFFICIENCY_MIN_SCORE

If the score falls below the threshold and retry budget remains, the pipeline triggers re-retrieval with a reformulated query — expanding synonyms for low coverage or broadening terms for low relevance. This loop runs heuristically (~1-2ms latency, no LLM call) and catches cases where the initial retrieval missed relevant content.

Semantic Cache

The exact-match response cache misses paraphrases entirely. The semantic cache wraps it with an embedding-similarity layer: queries like “What does Marcus do?” and “Marcus’ current role?” hit the same cache entry if their embeddings have cosine similarity > 0.85. Entries expire after 30 minutes (matching the response cache TTL), with LRU eviction at 200 entries.

Faithfulness Check

The final guard: a DeBERTa-v3-base NLI model (exported to ONNX, ~50ms/claim) checks whether each claim in the generated response is entailed by the retrieved context. The response is split into sentences, and each sentence is verified as an NLI premise-hypothesis pair against the context. A response is faithful if the ratio of supported claims meets the configured threshold. Contradicted claims trigger a fallback.

Token Budget Management

With a 4B model, every token counts. The pipeline manages a 4,096-token budget split across five allocations:

Context compression enforces the RAG context budget. History is dynamically trimmed to fit. The safety margin absorbs tokenizer estimation errors.

Security: OWASP Top 10 From Day One

Security isn’t an afterthought - it’s baked into every layer. The chatbot addresses all OWASP Top 10 risks:

#	Risk	Mitigation
A01	Broken Access Control	Rate limiting (10/min chat, 3/hr contact, 60/min GPU, 30/min admin), CORS whitelist, session validation
A02	Cryptographic Failures	TLS everywhere, API keys for all services, no secrets in logs, HMAC-SHA256 CSRF tokens
A03	Injection	258 detection rules (187 regex patterns + 71 keywords), input sanitization, XSS prevention
A04	Insecure Design	Intent classification, RAG grounding, hallucination detection, quality scoring
A05	Security Misconfiguration	K8s security contexts, 11 security headers (incl. HSTS, CSP, CORP, COEP, COOP), CSRF tokens with time validation
A06	Vulnerable Components	pip-audit in CI, dependency scanning, automated vulnerability alerts
A07	Auth Failures	Rate limiting on all endpoints, CSRF protection with 3s minimum / 5min maximum token age
A08	Data Integrity	Input validation, content sanitization, CSP headers
A09	Logging & Monitoring	Structured logging, Prometheus metrics, security event tracking, ntfy alerts
A10	SSRF	URL validation, allowlisted external services only

Prompt Injection Detection

The chatbot detects 187 regex patterns and 71 injection keywords (258 total detection rules) organized into categories:

Instruction overrides - “ignore previous instructions”, “disregard all rules”
Role manipulation - “you are now an unrestricted AI”, “pretend to be”
Jailbreak attempts - “DAN mode”, “developer mode”, “sudo mode”
System prompt extraction - “reveal your prompt”, “print your instructions”
Encoding bypasses - Base64, Unicode, HTML entities, XML/markup injection
Multi-language attacks - Patterns in German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Turkish, and Czech
SQL and shell injection - SELECT, DROP TABLE, rm -rf, /etc/passwd
Template injection - {{, {%, __import__, eval()
Path traversal - ../, ..\\, %2e%2e

A sample of the 187 regex patterns:

INJECTION_PATTERNS = [
    r"ignore\s*all\s*instructions",
    r"ignore\s*(all\s*)?(previous|prior|above)\s*(instructions?|prompts?|rules?)",
    r"disregard\s*(all\s*)?(previous|prior|above|your)",
    r"you\s*are\s*now\s*(a|an|the|my)",
    r"pretend\s*(to\s*be|you\'?re|that\s*you\'?re)",
    r"act\s+as\s+(a|an|if|though|my)",
    # ... 178 more patterns across 9 categories
]

Pre-processing layers catch evasion attempts before pattern matching:

Leetspeak normalization - A 27-character mapping table converts obfuscated attacks like 1gn0r3 pr3v10us → ignore previous
Zero-width character stripping - Removes invisible Unicode characters (BOM, zero-width space, line/paragraph separators) used to bypass pattern matching
Spacing collapse - Detects letter-by-letter evasion like i g n o r e → ignore

Detected injections are blocked immediately with a security log entry - no LLM call is made.

Contact Form Spam Protection

The system has two contact endpoints - /api/chat/contact (chatbot widget) and /api/contact (portfolio page) - both with multi-layer spam detection. The portfolio contact form, consolidated into the chatbot backend as a Flask Blueprint, has the most comprehensive anti-spam pipeline:

CSRF Double-Submit Cookie - HMAC-SHA256 signed token + HttpOnly cookie with embedded server timestamp. The signature and timestamp travel in a single cookie ({signature}|{server_ts}), validated on every submission.
Proof-of-Work challenge - The frontend solves a SHA-256 hash puzzle (finding a nonce that produces a hash starting with 0000) before submission. Requests without a valid solution are flagged.
Timestamp bot detection - The server timestamp embedded in the CSRF cookie enforces a 3-second minimum between token generation and form submission. Automated tools that submit instantly are caught.
Honeypot fields - 5 hidden form fields (website, phone, email, url, fax) that only bots fill in. Any non-empty value silently rejects the submission.
Spam pattern detection - Regex-based content scanning for spam keywords (viagra, casino, crypto, etc.), excessive URL injection (>2 URLs), character repetition, all-caps runs, and special character floods.
Unicode sanitization - NFKC normalization strips invisible zero-width characters, line/paragraph separators, and soft hyphens that could bypass pattern matching.
Disposable email blocking - Rejects 24 known throwaway domains (tempmail.com, guerrillamail.com, mailinator.com, yopmail.com, sharklasers.com, etc.)
RFC 5322 email validation - Full regex pattern compliance, not just @ checking. Local part minimum length enforced.
Email header injection prevention - Strips \r and \n from all header-injectable fields before constructing the email.
Input length validation - Name (200 chars), subject (200 chars), message (10,000 chars), email (254 chars per RFC).
Rate limiting - 3 submissions per hour (chatbot contact), tracked with thread-safe eviction. Portfolio contact is protected by the CSRF token flow, timestamp validation, and PoW challenge instead of per-IP counting.
Origin/Referer validation - Every POST is checked against the allowlisted origins before processing.

Here’s what happens when you try to bypass these layers — real curl requests against the live API:

Security Headers

Every API response includes 11 hardened security headers — Content-Security-Policy, X-Content-Type-Options, X-Frame-Options, X-XSS-Protection, Referrer-Policy, Permissions-Policy, Strict-Transport-Security, X-Permitted-Cross-Domain-Policies, Cross-Origin-Embedder-Policy, Cross-Origin-Opener-Policy, and Cross-Origin-Resource-Policy. Additionally, the portfolio contact API sets Cache-Control: no-store, no-cache and Pragma: no-cache on all responses, and SSE streams use Cache-Control: no-cache to prevent sensitive data caching.

The chat widget is built with vanilla JavaScript - no React, no Vue, no npm dependencies. It consists of a single HTML partial, one CSS file, and one JS file, embedded into the Hugo portfolio site as a Git submodule.

Design principles:

Accessibility - WCAG 2.1 AA compliant, full keyboard navigation, screen reader support, ARIA labels
Dark theme - Matches the portfolio’s glassmorphism design
Responsive - Works on desktop, tablet, and mobile
Lightweight - Zero npm dependencies, no build step required
Session persistence - Chat history survives page navigation via sessionStorage
Streaming - Token-by-token response streaming via SSE (Server-Sent Events)

Why SSE instead of WebSocket? Because Cloudflare Tunnel’s free plan blocks WebSocket over HTTP/2. SSE provides the same streaming experience without the protocol limitations. CSRF tokens (HMAC-SHA256, 3s minimum / 5min maximum age) protect all state-changing operations.

The widget automatically detects the user’s language and responds accordingly. Follow-up suggestions adapt to the conversation context:

On mobile, the layout adapts with touch-optimized input and GPU stats in the footer bar:

Intent classification recognizes 7 types. Questions like “How can I contact Marcus?” bypass the RAG pipeline entirely and route directly to the contact form:

Infrastructure

K8s Node: 13 Pods on K3s

A K3s single-node setup with an NVIDIA RTX 3060 runs all AI services:

Pod	Purpose	Notes
vLLM	LLM inference	GPU-accelerated, OpenAI-compatible API
Qdrant	Vector database	Persistent storage, API key auth
TEI Embeddings	Text embeddings	Jina Embeddings DE v2, 768 dimensions
TEI Reranker	Cross-encoder reranking	Relevance scoring for retrieved chunks
Langflow	Content indexing	Webhook-triggered scraping pipeline
Langfuse (web + worker)	LLM tracing	Self-hosted observability for every LLM call
ClickHouse	Analytics DB	Langfuse trace storage
Redis	Cache	Langfuse job queue
MinIO	Object storage	Langfuse media/attachment storage
PostgreSQL	Database	Langflow + Keycloak + Langfuse storage
Keycloak	SSO / OIDC	Single sign-on for all web UIs
OAuth2 Proxy	Auth proxy	Protects Langflow UI

All pods run with security contexts - non-root users, dropped capabilities, read-only filesystems where possible. 13 Network Policies enforce default-deny with selective allows: pods can only communicate with the services they explicitly need. Internet egress excludes private IP ranges to prevent lateral movement.

www-server: Docker Compose

The production website runs as Docker Compose with three core containers:

portfolio - Nginx serving the Hugo static site, proxying all /api/* requests to the backend
chatbot - Unified Flask backend (port 5005) - handles both the RAG chatbot API and the portfolio contact form
cloudflared - Cloudflare Tunnel for zero-trust ingress (no exposed ports)

Plus monitoring (node-exporter, cAdvisor), analytics (Umami + umami-db), and a GitLab Runner for CI/CD - 8 containers total.

This is the result of an API consolidation - the contact form originally ran as a separate Flask container on port 5000, but since both backends shared the same SMTP configuration, security middleware, and CORS setup, I merged them into a single container. One Nginx location, one backend, fewer moving parts. The portfolio contact form is registered as a Flask Blueprint alongside the chatbot blueprint, sharing the same rate limiter and security headers.

The chatbot uses a volume-mount deployment model - the Docker image contains only Python dependencies, while the application code (configs/) is mounted as a read-only volume at runtime. Code changes deploy in ~2 seconds via rsync + container restart, with no image rebuild needed. The Docker image is only rebuilt when requirements-api.txt, Dockerfile, or .dockerignore change.

CI/CD: Three Repos, Three Pipelines, Zero Manual Steps

Every deployment is fully automated via GitLab CI/CD. The only manual step is clicking the deploy button in GitLab - code deploys via rsync to the volume mount in seconds.

Chatbot Pipeline (6 Stages)

Stage	Job	What It Does
lint	`lint:python`	Ruff linter + formatter check
test	`test:unit`, `test:security`, `test:recruiting`	1,825 tests with coverage reporting
audit	`audit:dependencies`, `audit:sast`	pip-audit vulnerability scan + Bandit SAST
build	`build:docker`	Build image, push to GitLab Container Registry (only on dependency changes, see below)
deploy	`deploy:www-server`	Rsync code to www-server, restart container (manual trigger)
post-deploy	`test:smoke:post-deploy`, `rescrape:website`	Smoke tests (DE+EN), auto-rescrape on portfolio trigger

Cross-Repo Triggers

When the portfolio website is deployed, it automatically triggers the chatbot pipeline with a RESCRAPE=true variable. The chatbot pipeline then calls the Langflow webhook to re-scrape the website content and verifies the vector count in Qdrant. The chatbot’s knowledge base stays current without any manual intervention.

Test Coverage

The test suite covers 1,825 test cases across 30 test files:

The unit and integration tests run without external dependencies - 1,748 passed, 17 skipped (tests requiring a live GPU connection or Langflow integration). The remaining 60 tests in test_live_api.py run against the production endpoint - verifying health checks, intent classification, security headers, CORS, and end-to-end chat flow on the live system.

End-to-End Testing: Playwright

Beyond unit and integration tests, 85 Playwright E2E tests across 7 specs validate the chat widget in real browsers:

Spec	Tests	What It Covers
chat-widget	Core	Widget open/close, message send/receive, SSE streaming, session persistence
debug-layout	Layout	Debug panel rendering, Tokyo Night theme, reranker info display
full-device-audit	Devices	Cross-device rendering across mobile and desktop viewports
gpu-panel	GPU	GPU stats bar, real-time updates, VRAM/temperature/power display
screenshot-audit	Visual	Screenshot consistency, visual regression detection
smoke	Critical	DE+EN flow, health check, basic chat roundtrip
wow-features	Advanced	Follow-up suggestions, language switching, contact intent routing

Tests run against multiple viewports - Desktop (1280×720), Pixel 7, Galaxy S24, and iPhone 15 Pro - ensuring the widget works across devices. The Playwright suite runs separately from pytest, giving 1,910 total automated tests (1,825 pytest + 85 Playwright).

Monitoring & Observability

Grafana Dashboards

All metrics feed into Grafana via Prometheus. The kube-prometheus-stack deployment provides 30 dashboards out of the box - Kubernetes compute resources per cluster, namespace, pod, and node, plus CoreDNS, Alertmanager, and Kubelet monitoring:

Cluster resource utilization is tracked per node, showing CPU usage, memory consumption, and pod resource allocation across the K3s single-node setup:

Two custom dashboards complement the built-in Kubernetes metrics: the AI Platform - Overview dashboard tracks chatbot-specific metrics (request volume, response times, quality scores, LLM failover status, GPU utilization), while the Portfolio - www.pichler.dev dashboard monitors the production website (uptime probes, SSL certificate expiry, container resource usage):

LLM Tracing with Langfuse

Infrastructure metrics tell you if things are running. LLM tracing tells you how well your AI is actually responding. Langfuse is a self-hosted, open-source observability platform for LLM applications - the self-hosted alternative to LangSmith.

Every LLM call is automatically traced via a LangChain CallbackHandler injected at the ChatOpenAI constructor level. Each trace captures the full system prompt, user query, retrieved context, generated response, token counts, and latency - all without touching application code beyond a single import.

Langfuse runs as 5 pods on the K8s cluster (web, worker, ClickHouse, Redis, MinIO) with tight resource limits - adding less than 1 GiB to the cluster’s memory footprint. The SDK auto-configures via environment variables, so tracing is enabled in production and silently disabled in development.

GPU Metrics API

A dedicated /api/chat/gpu-stats endpoint exposes real-time GPU telemetry as JSON and as an SSE stream (/api/chat/gpu-stats-stream) with ~2-3 updates per second. The data comes from three sources: the vLLM metrics endpoint, nvidia-smi, and the DCGM exporter. The chat widget displays a live GPU status bar at the bottom, showing VRAM usage, temperature, power draw, and inference model — giving visitors transparency into the hardware running their queries.

Alerting via ntfy

Critical events trigger push notifications via a self-hosted ntfy server:

LLM failover activated / recovered
High error rate detected
Service health check failures

Alerts are sent via HTTP POST with severity tags - no external notification service needed, no subscription fees. Each alert includes the service name, environment, timestamp, and relevant context fields (error details, affected providers, duration).

A/B Testing

An experiment framework enables testing different prompt styles, temperature settings, and context chunk counts. Sessions are assigned to variants via consistent hashing (same session always gets the same variant), with per-variant tracking of quality scores, response times, and user feedback.

Lessons Learned

1. Consumer GPUs are viable for production AI. The RTX 3060 handles Gemma 3 4B IT with ~113ms time-to-first-token. For a personal portfolio chatbot, that’s more than enough - and the ongoing cost is just electricity (~17W idle, ~150W under load, roughly €3-5/month). No per-token pricing, no usage caps.

2. RAG beats fine-tuning for domain-specific Q&A. Instead of fine-tuning a model on my resume, I index my website and retrieve relevant chunks at query time. Content updates are instant - no retraining needed, just trigger a re-scrape.

3. Security must be first, not last. Adding prompt injection detection after the fact is painful. Building it into the LangGraph pipeline from day one - as its own validation node - made it natural and testable. 258 detection rules didn’t appear overnight; they grew from production experience.

4. SSE beats WebSocket for Cloudflare Tunnel. Cloudflare’s free plan blocks WebSocket over HTTP/2. SSE provides the same streaming UX without the protocol limitations. One line change in the transport layer, zero UX regression.

5. Three repos beat a monorepo for mixed infrastructure. Separating the chatbot code, K8s manifests, and portfolio website into three repos with cross-triggers keeps each pipeline focused and independently deployable. A K8s manifest change doesn’t trigger chatbot tests.

6. Always check the proxy chain. The CORS preflight issue that took hours to debug? Nginx in the portfolio container was blocking OPTIONS requests before they reached Flask. The chatbot’s CORS middleware was correct - the problem was upstream.

7. Automate everything - including content updates. The portfolio deploy triggers a re-scrape automatically. I never have to remember to update the chatbot’s knowledge base. If it’s not automated, it’s not reliable.

8. Consolidate when the overlap is obvious. The portfolio contact form and the chatbot API shared SMTP config, security middleware, CORS setup, and rate limiting. Running them as separate containers doubled the maintenance surface for zero benefit. Merging them into a single Flask app with two Blueprints halved the container count on the www-server and simplified the Nginx proxy from two location blocks to one.

Built With Claude Code

This entire project - backend, frontend, infrastructure, CI/CD, security, tests, and this blog post - was built using Vibe Coding with Claude Code (Claude Opus).

My role: Architect and Product Owner. I defined requirements, made architecture decisions, reviewed every result, and steered direction. Claude executed - writing code, debugging infrastructure issues, configuring Kubernetes manifests, setting up CI/CD pipelines, and running comprehensive tests.

Some highlights of the AI-assisted development:

LangGraph state machine - Claude designed and implemented the 10-node chat flow graph, including the context sufficiency evaluation loop, the 187-pattern injection detection system, and the NLI-based faithfulness checker
3-repo CI/CD architecture - From Dockerfiles to GitLab pipeline configs to cross-repo triggers, fully automated with zero manual deployment steps
Kubernetes debugging - When Langflow’s API returned cryptic errors (ValueError: 'display_name'), Claude traced it through the API, found the root cause (incomplete component templates in the flow JSON), and fixed the creation script
CORS preflight fix - A 405 error on OPTIONS requests went through three layers of debugging (Flask → Nginx → Cloudflare) before Claude identified the single line in nginx.conf blocking the request method
1,910 automated tests - 1,825 pytest + 85 Playwright E2E tests covering API contracts, injection detection, session security, input sanitization, CSRF validation, spam detection, alerting, faithfulness checking, context sufficiency, and recruiting pattern recognition

This is not “AI replacing developers” - it’s AI amplifying an architect’s capabilities. I couldn’t have built this entire stack in the time I did without Claude. And Claude couldn’t have built it without someone making the right architecture decisions, reviewing code, and knowing when to push back.

The best results come from human judgment + AI execution.

What’s Next

Since publication, two major features have shipped:

~~Hybrid Search~~ ✓ - BM25 sparse vectors now run alongside dense embeddings, fused via Reciprocal Rank Fusion. Exact term matches and proper nouns that semantic search alone missed are now reliably retrieved.
~~Query Rewriting~~ ✓ - Vague first queries like “tell me about him” are rewritten by the LLM into specific, self-contained questions before retrieval, dramatically improving first-turn recall.

Still on the roadmap:

Eval Pipeline - Golden-set benchmarking with curated question-answer pairs to measure retrieval quality, response accuracy, and regression detection across pipeline changes
Streaming Citations - Inline source references in streamed responses, linking claims to specific RAG chunks in real-time
Larger model - When VRAM allows (or with quantization), moving to a 12B+ model for more nuanced responses

The Stack

Layer	Technology
LLM	vLLM + Gemma 3 4B IT (RTX 3060, 12 GB VRAM)
Embeddings	TEI + Jina Embeddings DE v2 (768 dimensions)
Vector DB	Qdrant (45 chunks, Cosine similarity, hybrid dense+sparse)
Orchestration	LangGraph StateGraph (10 nodes, 7 intent types)
API	Flask + Gunicorn (27 endpoints, 2 Blueprints)
Frontend	Vanilla JS (zero dependencies, WCAG 2.1 AA)
Content Indexing	Langflow (webhook-triggered)
Reranking	Cross-Encoder via TEI
Auth	Keycloak + OAuth2 Proxy
Quality	8 garbage checks + 10 hallucination checks + NLI faithfulness + confidence scoring
Security	258 injection rules, OWASP Top 10, 11 security headers, PoW challenges
Failover	OpenRouter (GPT-4o-mini) automatic backup
Alerting	ntfy (self-hosted push notifications)
Monitoring	Prometheus + Grafana + DCGM Exporter
LLM Tracing	Langfuse (self-hosted, OpenTelemetry)
Infrastructure	K3s (13 pods) + Docker Compose
CI/CD	GitLab (3 repos, cross-triggers, container registry)
Ingress	Cloudflare Tunnel (zero-trust, no exposed ports)
Testing	1,825 pytest + 85 Playwright E2E, pip-audit, Bandit SAST

Try it yourself - the chat widget is live at pichler.dev. Feel free to reach out if you want to discuss AI infrastructure, RAG pipelines, or self-hosted LLMs.

RAG chatbot self-hosted Kubernetes vLLM LangGraph OWASP Flask Qdrant GPU