Building a Self-Hosted RAG Chatbot From Scratch
TL;DR
I built a Retrieval-Augmented Generation (RAG) chatbot that answers questions about my professional background using my portfolio website as the single source of truth. It runs entirely on my own hardware - a K3s single-node setup with an NVIDIA RTX 3060 - with no cloud AI dependencies. The stack includes vLLM for GPU-accelerated inference, LangGraph for orchestration, Qdrant for vector search, and a Flask API with OWASP-aligned security. Everything is deployed via GitLab CI/CD across three repositories, with automated content indexing, monitoring, and zero manual intervention.
Why Build Your Own Chatbot?
Every portfolio website has the same problem: visitors need to find specific information fast. “What technologies does this person know?” “Do they have Kubernetes experience?” “How do I contact them?”
Instead of hoping visitors click through the right pages, I built an AI assistant that answers these questions instantly - grounded in actual website content, not hallucinations.
But why self-hosted? Three reasons:
- Privacy - No visitor data leaves my infrastructure. No OpenAI, no Anthropic API, no third-party inference.
- Cost - After the initial hardware investment, the only ongoing cost is electricity (~17W idle, ~150W under load). No per-token pricing, no usage caps.
- Learning - Building the entire stack from GPU drivers to production deployment taught me more about AI infrastructure than any course could.
Architecture
The system runs on two VMs in separate VLANs on a Proxmox hypervisor, managed across three Git repositories:
Servers:
| Server | Network | Role |
|---|---|---|
| www-server | DMZ | Docker Compose - portfolio website, unified backend API, Cloudflare Tunnel |
| K8s node | Internal | K3s single-node - GPU inference, vector DB, embeddings, monitoring, auth |
The two networks are deliberately isolated. The chatbot API on the www-server calls AI services on the K8s node via HTTPS through ingress endpoints - the www-server has no direct cluster access.
Repositories:
| Repository | Purpose | Deploys To |
|---|---|---|
chatbot | Backend API, frontend widget, tests | www-server (GitLab CI/CD) |
ai-infra | Kubernetes manifests, Langflow flows, deploy scripts | K8s node (kubectl apply) |
pichler-portfolio | Hugo website, Nginx config, Docker Compose | www-server (Hugo build + Docker image) |
Each repository has its own CI/CD pipeline. When the portfolio is updated, it automatically triggers a re-scrape of the website content into the vector database - the chatbot always has up-to-date information without manual intervention.
The AI Stack
LLM: vLLM + Gemma 3 4B IT
The language model is Google’s Gemma 3 4B IT (instruction-tuned), served by vLLM - a high-throughput inference engine that provides an OpenAI-compatible /v1/chat/completions endpoint.
Why Gemma 3 4B? It’s the largest model that fits in 12 GB VRAM (RTX 3060) while maintaining good response quality for a focused domain. I tested the 12B variant, but it triggered out-of-memory errors on this GPU.
Real performance numbers from the production API:
At idle the GPU draws ~17W; under inference load it peaks at ~150W.
Automatic Failover: If the local vLLM instance becomes unavailable, the system automatically fails over to OpenRouter (GPT-4o-mini) as a cloud backup. Health checks run with configurable timeouts, and the failover state is managed thread-safely with a cooldown before attempting recovery. Every failover event and recovery triggers a push notification via ntfy.
Embeddings: TEI + Jina Embeddings DE v2
Hugging Face Text Embeddings Inference (TEI) serves the jinaai/jina-embeddings-v2-base-de model, producing 768-dimensional vectors. It handles both the indexing pipeline (when content is scraped) and query-time embedding generation (when a user asks a question).
Vector Database: Qdrant
Qdrant stores the embedded website content. The collection uses 768 dimensions with Cosine similarity. When a user asks a question, the chatbot embeds the query via TEI, searches Qdrant for the most relevant content chunks, and feeds them to the LLM as context.
The collection currently holds 45 content chunks - automatically scraped from 9 pages including the portfolio website and blog posts, each with metadata like source URLs and section titles for grounding. Adaptive chunking splits content at semantic boundaries (h2/h3 sections) rather than fixed character counts, producing fewer but more coherent chunks.
Content Indexing: Langflow
Langflow handles the scraping and indexing pipeline. A custom WebsiteScraper component crawls pichler.dev, extracts content from each page, splits it into chunks, generates embeddings via TEI, and stores vectors in Qdrant with source metadata. It automatically discovers blog post URLs from the /blog/ index page, so new posts are indexed without manual configuration.
This runs as a webhook - triggered automatically by the portfolio CI/CD pipeline whenever the website is updated, or manually via make rescrape. The custom component is deployed as a Kubernetes ConfigMap and loaded into Langflow at startup.
The RAG Pipeline: LangGraph
The heart of the chatbot is a LangGraph StateGraph with 10 nodes - a directed graph that orchestrates the entire chat flow. Unlike simple if/else chains, the graph makes the flow explicit, testable, and easy to extend.
The 10 Nodes
validate_input - Input validation, sanitization, injection detection, session verification. Malicious inputs are blocked here before any LLM call is made.
classify_intent - Two-stage intent classification. First, fast regex-based pattern matching for common intents (handles ~40% of inputs without calling the LLM). Then, for ambiguous inputs, an LLM relevance check determines if the question relates to my professional profile.
direct_response - Handles non-RAG intents (greetings, farewells, smalltalk, contact requests, out-of-scope) with pre-defined response templates. No LLM call needed.
check_cache - Looks up the query in the response cache (exact match + semantic similarity). Cache hits skip retrieval and generation entirely, returning the cached response immediately.
retrieve_context - First rewrites vague queries via LLM (e.g. “tell me about him” → “What is Marcus Pichler’s professional background?”). Then runs hybrid search: dense embeddings via TEI + BM25 sparse vectors, fused via Reciprocal Rank Fusion. Results are reranked with a cross-encoder, compressed to fit the token budget, and enriched with source metadata.
evaluate_context - Checks whether the retrieved context is sufficient to answer the query. If insufficient and retry budget remains, triggers re-retrieval with a reformulated query. Three dimensions are scored: keyword coverage, document relevance, and context volume.
generate_response - Sends the query + retrieved context to vLLM (or the failover provider) and generates the answer. The system prompt is recruiting-optimized and bilingual (German/English). Dynamic temperature adapts to query type (tech=0.15, recruiting=0.2, general=0.3).
validate_quality - Four-layer response validation: garbage detection, hallucination detection, faithfulness checking (NLI-based), and confidence scoring. Responses that fail validation trigger a retry or fallback.
cache_and_suggest - Caches the validated response and generates dynamic follow-up suggestions based on the conversation context.
handle_error - Catches exceptions from any node and returns a graceful error response instead of crashing.
Here’s the actual graph definition from graph.py:
graph = StateGraph(ChatState)
graph.add_node("validate_input", validate_input)
graph.add_node("classify_intent", classify_intent)
graph.add_node("direct_response", direct_response)
graph.add_node("check_cache", check_cache)
graph.add_node("retrieve_context", retrieve_context)
graph.add_node("evaluate_context", evaluate_context)
graph.add_node("generate_response", generate_response)
graph.add_node("validate_quality", validate_quality)
graph.add_node("cache_and_suggest", cache_and_suggest)
graph.add_node("handle_error", handle_error)
The 7 Intent Types
| Intent | Example | Handling |
|---|---|---|
| GREETING | “Hello”, “Hi there” | Direct response, no LLM |
| FAREWELL | “Goodbye”, “See you” | Direct response, no LLM |
| SMALLTALK | “How are you?”, “What’s your name?” | Direct response, no LLM |
| QUESTION | “What are Marcus’s skills?” | Full RAG pipeline |
| CONTACT | “How can I reach Marcus?” | Contact form redirect |
| PERSONAL | “What’s Marcus’s salary?” | Direct response (contact redirect) |
| OUT_OF_SCOPE | “What’s the weather?” | Polite redirect, no LLM |
Routing Logic
After intent classification, five routing functions determine the path through the graph:
- route_by_intent - QUESTION and PERSONAL intents go to
check_cache; all other intents go todirect_response. - route_cache - Cache hits skip to
cache_and_suggest; cache misses proceed toretrieve_context. - route_sufficiency - After context evaluation: sufficient context proceeds to
generate_response; insufficient context loops back toretrieve_contextwith a reformulated query. - route_quality - After generation, normal responses go to
validate_quality; empty context skips tocache_and_suggest(fallback); errors go tohandle_error. - route_retry - Failed quality checks with retry budget go back to
generate_response; exhausted retries accept the response as-is viacache_and_suggest.
Real API Response
Here’s what the chatbot returns for a real question — a curl against the live API at pichler.dev/api/chat:
Every response includes the detected intent, language, follow-up suggestions, and the grounded answer. You can also stream responses token-by-token via SSE:
curl -N -X POST https://www.pichler.dev/api/chat/stream \
-H "Content-Type: application/json" \
-d '{"message": "What are Marcus'\''s skills?", "session_id": "demo"}'
# Output (SSE):
# data: {"type": "token", "content": "Marcus"}
# data: {"type": "token", "content": " has"}
# data: {"type": "token", "content": " extensive"}
# ...
# data: {"type": "done", "suggestions": ["What projects has Marcus worked on?"]}
Response Quality Pipeline
Before any response reaches the user, it passes through four validation layers. This is implemented in the validate_quality node.
1. Garbage Detection (8 Methods)
Catches LLM output failures that produce nonsensical text:
| Check | What It Catches |
|---|---|
| Repeated characters | 5+ identical characters in a row (e.g., “aaaaa”) |
| Repeated patterns | 2-4 character patterns repeating 3+ times |
| Repeated words | 3+ consecutive identical words or bigrams |
| Long words | Words exceeding 30 characters |
| Consonant runs | 9+ consonants without vowels |
| N-gram loops | Repeated 3-6 word phrases (the LLM getting “stuck”) |
| Question loops | 3+ similar repeated questions instead of answers |
| Foreign script | Unexpected CJK, Cyrillic, Arabic, or Devanagari characters |
2. Hallucination Detection (10 Methods)
Checks whether the response is grounded in retrieved context:
| Check | What It Catches |
|---|---|
| Question as answer | Response is >30% questions or starts with a question word |
| Instruction hallucination | 14 instruction-style patterns (“Remember:”, “Note:”, “As an AI…”) |
| Nonsense words | 7+ consonant-only words, repeated syllables, character spam |
| Entity presence | Response about Marcus doesn’t mention him at all |
| Generic hallucination | 20 corporate-speak patterns (“I’d be happy to help…”), triggers on 2+ matches |
| Context grounding | Mentions technologies not present in the RAG context |
| Domain hallucination | False employers, cities, or job titles not in the knowledge base |
| Prompt leakage | Detects system prompt fragments leaked into the response |
| Self-contradiction | Detects contradictory statements within the same response |
| URL hallucination | Strips fabricated URLs not present in the source material |
3. Confidence Scoring
Every response gets a confidence score starting at 1.0. Each failed check applies a multiplicative penalty:
| Check | Penalty | What It Catches |
|---|---|---|
| Garbage detection | ×0.1 | Loops, repeated patterns, nonsense output |
| Question as answer | ×0.1 | Response is mostly questions instead of answers |
| Instruction hallucination | ×0.1 | “Remember:”, “Note:”, “As an AI…” patterns |
| Nonsense words | ×0.1 | Consonant-only words, repeated syllables, character spam |
| Missing entity | ×0.4 | Response about Marcus doesn’t mention him |
| Context grounding | ×0.2 | Claims not supported by RAG context |
| Too short | ×0.5 | Below minimum character threshold |
| Too long | ×0.7 | Exceeds maximum character limit |
The final decision is binary: a response is valid if confidence >= threshold AND issues <= 2. Failed responses trigger a fallback or regeneration attempt. There’s no letter-grade system in the pipeline - it’s pass/fail with a confidence score that drops sharply on any quality issue.
A separate quality scoring module runs after the pipeline for metrics and logging, computing a weighted A-F grade across 5 dimensions (relevance, completeness, coherence, grounding, brevity) - but this doesn’t affect the pipeline decision.
Advanced RAG: Beyond Basic Retrieval
The basic RAG pattern - embed query, search vectors, feed to LLM - works, but production quality demands more. Seven additional pipeline stages transform raw retrieval into reliable, grounded responses.
Query Rewriting
Vague or context-dependent queries like “tell me about him” or “what does he do?” fail at retrieval because there’s nothing specific to embed. The pipeline detects these cases and rewrites them via a standalone LLM call before retrieval — transforming “tell me about him” into “What is Marcus Pichler’s professional background and experience?”. This runs only for first-turn queries without conversation history, adding ~200ms but dramatically improving first-turn recall.
Hybrid Search (BM25 + Dense)
Pure semantic search struggles with exact terms — proper nouns, specific technologies, or acronyms that don’t have strong embedding representations. Hybrid search combines dense embeddings (768-dim via TEI) with BM25 sparse vectors stored directly in Qdrant. A pre-built vocabulary of 2,222 terms maps to sparse dimensions. Results from both searches are merged via Reciprocal Rank Fusion (RRF), giving the best of both worlds: semantic understanding from dense vectors and exact keyword matching from BM25.
Cross-Encoder Reranking
Vector similarity is a blunt instrument - it finds related content but doesn’t distinguish relevant from tangential. After hybrid search returns candidate chunks, a Cross-Encoder model via TEI re-scores each chunk against the actual query. This produces dramatically better ordering than cosine similarity alone, especially for nuanced questions where keyword overlap is low.
Context Compression
Not every sentence in a retrieved chunk is useful. The compression module uses embedding-based extractive compression (no LLM call, ~50ms): split documents into sentences, batch-embed them via TEI, keep only sentences above a cosine similarity threshold to the query, and enforce the token budget. Dynamic thresholds adapt per query type - recruiting queries get a lower threshold (keep more context), tech queries get a higher one (precision matters).
Context Enrichment
Raw chunks lack provenance. The enrichment module prepends metadata headers to each chunk so the LLM knows where the information comes from: [From Marcus's career timeline (pichler.dev/career) — Section: Work Experience]. This is pure metadata-based enrichment (0ms latency, no LLM call) using a static page context map.
Context Sufficiency Evaluation
The most impactful addition: a pre-generation gate that evaluates whether the retrieved context is actually sufficient to answer the query before sending it to the LLM. Three dimensions are scored - keyword coverage, document relevance, and context volume:
def check_context_sufficiency(query, docs, context):
scores = {}
scores["coverage"] = query_keywords_found / total_keywords
scores["relevance"] = avg(doc.metadata["score"] for doc in docs)
scores["volume"] = token_count_score(context)
overall = coverage * 0.4 + relevance * 0.35 + volume * 0.25
return overall >= SUFFICIENCY_MIN_SCORE
If the score falls below the threshold and retry budget remains, the pipeline triggers re-retrieval with a reformulated query — expanding synonyms for low coverage or broadening terms for low relevance. This loop runs heuristically (~1-2ms latency, no LLM call) and catches cases where the initial retrieval missed relevant content.
Semantic Cache
The exact-match response cache misses paraphrases entirely. The semantic cache wraps it with an embedding-similarity layer: queries like “What does Marcus do?” and “Marcus’ current role?” hit the same cache entry if their embeddings have cosine similarity > 0.85. Entries expire after 30 minutes (matching the response cache TTL), with LRU eviction at 200 entries.
Faithfulness Check
The final guard: a DeBERTa-v3-base NLI model (exported to ONNX, ~50ms/claim) checks whether each claim in the generated response is entailed by the retrieved context. The response is split into sentences, and each sentence is verified as an NLI premise-hypothesis pair against the context. A response is faithful if the ratio of supported claims meets the configured threshold. Contradicted claims trigger a fallback.
Token Budget Management
With a 4B model, every token counts. The pipeline manages a 4,096-token budget split across five allocations:
Context compression enforces the RAG context budget. History is dynamically trimmed to fit. The safety margin absorbs tokenizer estimation errors.
Security: OWASP Top 10 From Day One
Security isn’t an afterthought - it’s baked into every layer. The chatbot addresses all OWASP Top 10 risks:
| # | Risk | Mitigation |
|---|---|---|
| A01 | Broken Access Control | Rate limiting (10/min chat, 3/hr contact, 60/min GPU, 30/min admin), CORS whitelist, session validation |
| A02 | Cryptographic Failures | TLS everywhere, API keys for all services, no secrets in logs, HMAC-SHA256 CSRF tokens |
| A03 | Injection | 258 detection rules (187 regex patterns + 71 keywords), input sanitization, XSS prevention |
| A04 | Insecure Design | Intent classification, RAG grounding, hallucination detection, quality scoring |
| A05 | Security Misconfiguration | K8s security contexts, 11 security headers (incl. HSTS, CSP, CORP, COEP, COOP), CSRF tokens with time validation |
| A06 | Vulnerable Components | pip-audit in CI, dependency scanning, automated vulnerability alerts |
| A07 | Auth Failures | Rate limiting on all endpoints, CSRF protection with 3s minimum / 5min maximum token age |
| A08 | Data Integrity | Input validation, content sanitization, CSP headers |
| A09 | Logging & Monitoring | Structured logging, Prometheus metrics, security event tracking, ntfy alerts |
| A10 | SSRF | URL validation, allowlisted external services only |
Prompt Injection Detection
The chatbot detects 187 regex patterns and 71 injection keywords (258 total detection rules) organized into categories:
- Instruction overrides - “ignore previous instructions”, “disregard all rules”
- Role manipulation - “you are now an unrestricted AI”, “pretend to be”
- Jailbreak attempts - “DAN mode”, “developer mode”, “sudo mode”
- System prompt extraction - “reveal your prompt”, “print your instructions”
- Encoding bypasses - Base64, Unicode, HTML entities, XML/markup injection
- Multi-language attacks - Patterns in German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Turkish, and Czech
- SQL and shell injection -
SELECT,DROP TABLE,rm -rf,/etc/passwd - Template injection -
{{,{%,__import__,eval() - Path traversal -
../,..\\,%2e%2e
A sample of the 187 regex patterns:
INJECTION_PATTERNS = [
r"ignore\s*all\s*instructions",
r"ignore\s*(all\s*)?(previous|prior|above)\s*(instructions?|prompts?|rules?)",
r"disregard\s*(all\s*)?(previous|prior|above|your)",
r"you\s*are\s*now\s*(a|an|the|my)",
r"pretend\s*(to\s*be|you\'?re|that\s*you\'?re)",
r"act\s+as\s+(a|an|if|though|my)",
# ... 178 more patterns across 9 categories
]
Pre-processing layers catch evasion attempts before pattern matching:
- Leetspeak normalization - A 27-character mapping table converts obfuscated attacks like
1gn0r3 pr3v10us→ignore previous - Zero-width character stripping - Removes invisible Unicode characters (BOM, zero-width space, line/paragraph separators) used to bypass pattern matching
- Spacing collapse - Detects letter-by-letter evasion like
i g n o r e→ignore
Detected injections are blocked immediately with a security log entry - no LLM call is made.
Contact Form Spam Protection
The system has two contact endpoints - /api/chat/contact (chatbot widget) and /api/contact (portfolio page) - both with multi-layer spam detection. The portfolio contact form, consolidated into the chatbot backend as a Flask Blueprint, has the most comprehensive anti-spam pipeline:
- CSRF Double-Submit Cookie - HMAC-SHA256 signed token + HttpOnly cookie with embedded server timestamp. The signature and timestamp travel in a single cookie (
{signature}|{server_ts}), validated on every submission. - Proof-of-Work challenge - The frontend solves a SHA-256 hash puzzle (finding a nonce that produces a hash starting with
0000) before submission. Requests without a valid solution are flagged. - Timestamp bot detection - The server timestamp embedded in the CSRF cookie enforces a 3-second minimum between token generation and form submission. Automated tools that submit instantly are caught.
- Honeypot fields - 5 hidden form fields (
website,phone,email,url,fax) that only bots fill in. Any non-empty value silently rejects the submission. - Spam pattern detection - Regex-based content scanning for spam keywords (viagra, casino, crypto, etc.), excessive URL injection (>2 URLs), character repetition, all-caps runs, and special character floods.
- Unicode sanitization - NFKC normalization strips invisible zero-width characters, line/paragraph separators, and soft hyphens that could bypass pattern matching.
- Disposable email blocking - Rejects 24 known throwaway domains (tempmail.com, guerrillamail.com, mailinator.com, yopmail.com, sharklasers.com, etc.)
- RFC 5322 email validation - Full regex pattern compliance, not just
@checking. Local part minimum length enforced. - Email header injection prevention - Strips
\rand\nfrom all header-injectable fields before constructing the email. - Input length validation - Name (200 chars), subject (200 chars), message (10,000 chars), email (254 chars per RFC).
- Rate limiting - 3 submissions per hour (chatbot contact), tracked with thread-safe eviction. Portfolio contact is protected by the CSRF token flow, timestamp validation, and PoW challenge instead of per-IP counting.
- Origin/Referer validation - Every POST is checked against the allowlisted origins before processing.
Here’s what happens when you try to bypass these layers — real curl requests against the live API:
Security Headers
Every API response includes 11 hardened security headers — Content-Security-Policy, X-Content-Type-Options, X-Frame-Options, X-XSS-Protection, Referrer-Policy, Permissions-Policy, Strict-Transport-Security, X-Permitted-Cross-Domain-Policies, Cross-Origin-Embedder-Policy, Cross-Origin-Opener-Policy, and Cross-Origin-Resource-Policy. Additionally, the portfolio contact API sets Cache-Control: no-store, no-cache and Pragma: no-cache on all responses, and SSE streams use Cache-Control: no-cache to prevent sensitive data caching.
The Frontend: Accessible Chat Widget
The chat widget is built with vanilla JavaScript - no React, no Vue, no npm dependencies. It consists of a single HTML partial, one CSS file, and one JS file, embedded into the Hugo portfolio site as a Git submodule.
Design principles:
- Accessibility - WCAG 2.1 AA compliant, full keyboard navigation, screen reader support, ARIA labels
- Dark theme - Matches the portfolio’s glassmorphism design
- Responsive - Works on desktop, tablet, and mobile
- Lightweight - Zero npm dependencies, no build step required
- Session persistence - Chat history survives page navigation via sessionStorage
- Streaming - Token-by-token response streaming via SSE (Server-Sent Events)
Why SSE instead of WebSocket? Because Cloudflare Tunnel’s free plan blocks WebSocket over HTTP/2. SSE provides the same streaming experience without the protocol limitations. CSRF tokens (HMAC-SHA256, 3s minimum / 5min maximum age) protect all state-changing operations.
The widget automatically detects the user’s language and responds accordingly. Follow-up suggestions adapt to the conversation context:
On mobile, the layout adapts with touch-optimized input and GPU stats in the footer bar:
Intent classification recognizes 7 types. Questions like “How can I contact Marcus?” bypass the RAG pipeline entirely and route directly to the contact form:
Infrastructure
K8s Node: 13 Pods on K3s
A K3s single-node setup with an NVIDIA RTX 3060 runs all AI services:
| Pod | Purpose | Notes |
|---|---|---|
| vLLM | LLM inference | GPU-accelerated, OpenAI-compatible API |
| Qdrant | Vector database | Persistent storage, API key auth |
| TEI Embeddings | Text embeddings | Jina Embeddings DE v2, 768 dimensions |
| TEI Reranker | Cross-encoder reranking | Relevance scoring for retrieved chunks |
| Langflow | Content indexing | Webhook-triggered scraping pipeline |
| Langfuse (web + worker) | LLM tracing | Self-hosted observability for every LLM call |
| ClickHouse | Analytics DB | Langfuse trace storage |
| Redis | Cache | Langfuse job queue |
| MinIO | Object storage | Langfuse media/attachment storage |
| PostgreSQL | Database | Langflow + Keycloak + Langfuse storage |
| Keycloak | SSO / OIDC | Single sign-on for all web UIs |
| OAuth2 Proxy | Auth proxy | Protects Langflow UI |
All pods run with security contexts - non-root users, dropped capabilities, read-only filesystems where possible. 13 Network Policies enforce default-deny with selective allows: pods can only communicate with the services they explicitly need. Internet egress excludes private IP ranges to prevent lateral movement.
www-server: Docker Compose
The production website runs as Docker Compose with three core containers:
- portfolio - Nginx serving the Hugo static site, proxying all
/api/*requests to the backend - chatbot - Unified Flask backend (port 5005) - handles both the RAG chatbot API and the portfolio contact form
- cloudflared - Cloudflare Tunnel for zero-trust ingress (no exposed ports)
Plus monitoring (node-exporter, cAdvisor), analytics (Umami + umami-db), and a GitLab Runner for CI/CD - 8 containers total.
This is the result of an API consolidation - the contact form originally ran as a separate Flask container on port 5000, but since both backends shared the same SMTP configuration, security middleware, and CORS setup, I merged them into a single container. One Nginx location, one backend, fewer moving parts. The portfolio contact form is registered as a Flask Blueprint alongside the chatbot blueprint, sharing the same rate limiter and security headers.
The chatbot uses a volume-mount deployment model - the Docker image contains only Python dependencies, while the application code (configs/) is mounted as a read-only volume at runtime. Code changes deploy in ~2 seconds via rsync + container restart, with no image rebuild needed. The Docker image is only rebuilt when requirements-api.txt, Dockerfile, or .dockerignore change.
CI/CD: Three Repos, Three Pipelines, Zero Manual Steps
Every deployment is fully automated via GitLab CI/CD. The only manual step is clicking the deploy button in GitLab - code deploys via rsync to the volume mount in seconds.
Chatbot Pipeline (6 Stages)
| Stage | Job | What It Does |
|---|---|---|
| lint | lint:python | Ruff linter + formatter check |
| test | test:unit, test:security, test:recruiting | 1,825 tests with coverage reporting |
| audit | audit:dependencies, audit:sast | pip-audit vulnerability scan + Bandit SAST |
| build | build:docker | Build image, push to GitLab Container Registry (only on dependency changes, see below) |
| deploy | deploy:www-server | Rsync code to www-server, restart container (manual trigger) |
| post-deploy | test:smoke:post-deploy, rescrape:website | Smoke tests (DE+EN), auto-rescrape on portfolio trigger |
Cross-Repo Triggers
When the portfolio website is deployed, it automatically triggers the chatbot pipeline with a RESCRAPE=true variable. The chatbot pipeline then calls the Langflow webhook to re-scrape the website content and verifies the vector count in Qdrant. The chatbot’s knowledge base stays current without any manual intervention.
Test Coverage
The test suite covers 1,825 test cases across 30 test files:
The unit and integration tests run without external dependencies - 1,748 passed, 17 skipped (tests requiring a live GPU connection or Langflow integration). The remaining 60 tests in test_live_api.py run against the production endpoint - verifying health checks, intent classification, security headers, CORS, and end-to-end chat flow on the live system.
End-to-End Testing: Playwright
Beyond unit and integration tests, 85 Playwright E2E tests across 7 specs validate the chat widget in real browsers:
| Spec | Tests | What It Covers |
|---|---|---|
| chat-widget | Core | Widget open/close, message send/receive, SSE streaming, session persistence |
| debug-layout | Layout | Debug panel rendering, Tokyo Night theme, reranker info display |
| full-device-audit | Devices | Cross-device rendering across mobile and desktop viewports |
| gpu-panel | GPU | GPU stats bar, real-time updates, VRAM/temperature/power display |
| screenshot-audit | Visual | Screenshot consistency, visual regression detection |
| smoke | Critical | DE+EN flow, health check, basic chat roundtrip |
| wow-features | Advanced | Follow-up suggestions, language switching, contact intent routing |
Tests run against multiple viewports - Desktop (1280×720), Pixel 7, Galaxy S24, and iPhone 15 Pro - ensuring the widget works across devices. The Playwright suite runs separately from pytest, giving 1,910 total automated tests (1,825 pytest + 85 Playwright).
Monitoring & Observability
Grafana Dashboards
All metrics feed into Grafana via Prometheus. The kube-prometheus-stack deployment provides 30 dashboards out of the box - Kubernetes compute resources per cluster, namespace, pod, and node, plus CoreDNS, Alertmanager, and Kubelet monitoring:
Cluster resource utilization is tracked per node, showing CPU usage, memory consumption, and pod resource allocation across the K3s single-node setup:
Two custom dashboards complement the built-in Kubernetes metrics: the AI Platform - Overview dashboard tracks chatbot-specific metrics (request volume, response times, quality scores, LLM failover status, GPU utilization), while the Portfolio - www.pichler.dev dashboard monitors the production website (uptime probes, SSL certificate expiry, container resource usage):
LLM Tracing with Langfuse
Infrastructure metrics tell you if things are running. LLM tracing tells you how well your AI is actually responding. Langfuse is a self-hosted, open-source observability platform for LLM applications - the self-hosted alternative to LangSmith.
Every LLM call is automatically traced via a LangChain CallbackHandler injected at the ChatOpenAI constructor level. Each trace captures the full system prompt, user query, retrieved context, generated response, token counts, and latency - all without touching application code beyond a single import.
Langfuse runs as 5 pods on the K8s cluster (web, worker, ClickHouse, Redis, MinIO) with tight resource limits - adding less than 1 GiB to the cluster’s memory footprint. The SDK auto-configures via environment variables, so tracing is enabled in production and silently disabled in development.
GPU Metrics API
A dedicated /api/chat/gpu-stats endpoint exposes real-time GPU telemetry as JSON and as an SSE stream (/api/chat/gpu-stats-stream) with ~2-3 updates per second. The data comes from three sources: the vLLM metrics endpoint, nvidia-smi, and the DCGM exporter. The chat widget displays a live GPU status bar at the bottom, showing VRAM usage, temperature, power draw, and inference model — giving visitors transparency into the hardware running their queries.
Alerting via ntfy
Critical events trigger push notifications via a self-hosted ntfy server:
- LLM failover activated / recovered
- High error rate detected
- Service health check failures
Alerts are sent via HTTP POST with severity tags - no external notification service needed, no subscription fees. Each alert includes the service name, environment, timestamp, and relevant context fields (error details, affected providers, duration).
A/B Testing
An experiment framework enables testing different prompt styles, temperature settings, and context chunk counts. Sessions are assigned to variants via consistent hashing (same session always gets the same variant), with per-variant tracking of quality scores, response times, and user feedback.
Lessons Learned
1. Consumer GPUs are viable for production AI. The RTX 3060 handles Gemma 3 4B IT with ~113ms time-to-first-token. For a personal portfolio chatbot, that’s more than enough - and the ongoing cost is just electricity (~17W idle, ~150W under load, roughly €3-5/month). No per-token pricing, no usage caps.
2. RAG beats fine-tuning for domain-specific Q&A. Instead of fine-tuning a model on my resume, I index my website and retrieve relevant chunks at query time. Content updates are instant - no retraining needed, just trigger a re-scrape.
3. Security must be first, not last. Adding prompt injection detection after the fact is painful. Building it into the LangGraph pipeline from day one - as its own validation node - made it natural and testable. 258 detection rules didn’t appear overnight; they grew from production experience.
4. SSE beats WebSocket for Cloudflare Tunnel. Cloudflare’s free plan blocks WebSocket over HTTP/2. SSE provides the same streaming UX without the protocol limitations. One line change in the transport layer, zero UX regression.
5. Three repos beat a monorepo for mixed infrastructure. Separating the chatbot code, K8s manifests, and portfolio website into three repos with cross-triggers keeps each pipeline focused and independently deployable. A K8s manifest change doesn’t trigger chatbot tests.
6. Always check the proxy chain. The CORS preflight issue that took hours to debug? Nginx in the portfolio container was blocking OPTIONS requests before they reached Flask. The chatbot’s CORS middleware was correct - the problem was upstream.
7. Automate everything - including content updates. The portfolio deploy triggers a re-scrape automatically. I never have to remember to update the chatbot’s knowledge base. If it’s not automated, it’s not reliable.
8. Consolidate when the overlap is obvious. The portfolio contact form and the chatbot API shared SMTP config, security middleware, CORS setup, and rate limiting. Running them as separate containers doubled the maintenance surface for zero benefit. Merging them into a single Flask app with two Blueprints halved the container count on the www-server and simplified the Nginx proxy from two location blocks to one.
Built With Claude Code
This entire project - backend, frontend, infrastructure, CI/CD, security, tests, and this blog post - was built using Vibe Coding with Claude Code (Claude Opus).
My role: Architect and Product Owner. I defined requirements, made architecture decisions, reviewed every result, and steered direction. Claude executed - writing code, debugging infrastructure issues, configuring Kubernetes manifests, setting up CI/CD pipelines, and running comprehensive tests.
Some highlights of the AI-assisted development:
- LangGraph state machine - Claude designed and implemented the 10-node chat flow graph, including the context sufficiency evaluation loop, the 187-pattern injection detection system, and the NLI-based faithfulness checker
- 3-repo CI/CD architecture - From Dockerfiles to GitLab pipeline configs to cross-repo triggers, fully automated with zero manual deployment steps
- Kubernetes debugging - When Langflow’s API returned cryptic errors (
ValueError: 'display_name'), Claude traced it through the API, found the root cause (incomplete component templates in the flow JSON), and fixed the creation script - CORS preflight fix - A 405 error on OPTIONS requests went through three layers of debugging (Flask → Nginx → Cloudflare) before Claude identified the single line in nginx.conf blocking the request method
- 1,910 automated tests - 1,825 pytest + 85 Playwright E2E tests covering API contracts, injection detection, session security, input sanitization, CSRF validation, spam detection, alerting, faithfulness checking, context sufficiency, and recruiting pattern recognition
This is not “AI replacing developers” - it’s AI amplifying an architect’s capabilities. I couldn’t have built this entire stack in the time I did without Claude. And Claude couldn’t have built it without someone making the right architecture decisions, reviewing code, and knowing when to push back.
The best results come from human judgment + AI execution.
What’s Next
Since publication, two major features have shipped:
Hybrid Search✓ - BM25 sparse vectors now run alongside dense embeddings, fused via Reciprocal Rank Fusion. Exact term matches and proper nouns that semantic search alone missed are now reliably retrieved.Query Rewriting✓ - Vague first queries like “tell me about him” are rewritten by the LLM into specific, self-contained questions before retrieval, dramatically improving first-turn recall.
Still on the roadmap:
- Eval Pipeline - Golden-set benchmarking with curated question-answer pairs to measure retrieval quality, response accuracy, and regression detection across pipeline changes
- Streaming Citations - Inline source references in streamed responses, linking claims to specific RAG chunks in real-time
- Larger model - When VRAM allows (or with quantization), moving to a 12B+ model for more nuanced responses
The Stack
| Layer | Technology |
|---|---|
| LLM | vLLM + Gemma 3 4B IT (RTX 3060, 12 GB VRAM) |
| Embeddings | TEI + Jina Embeddings DE v2 (768 dimensions) |
| Vector DB | Qdrant (45 chunks, Cosine similarity, hybrid dense+sparse) |
| Orchestration | LangGraph StateGraph (10 nodes, 7 intent types) |
| API | Flask + Gunicorn (27 endpoints, 2 Blueprints) |
| Frontend | Vanilla JS (zero dependencies, WCAG 2.1 AA) |
| Content Indexing | Langflow (webhook-triggered) |
| Reranking | Cross-Encoder via TEI |
| Auth | Keycloak + OAuth2 Proxy |
| Quality | 8 garbage checks + 10 hallucination checks + NLI faithfulness + confidence scoring |
| Security | 258 injection rules, OWASP Top 10, 11 security headers, PoW challenges |
| Failover | OpenRouter (GPT-4o-mini) automatic backup |
| Alerting | ntfy (self-hosted push notifications) |
| Monitoring | Prometheus + Grafana + DCGM Exporter |
| LLM Tracing | Langfuse (self-hosted, OpenTelemetry) |
| Infrastructure | K3s (13 pods) + Docker Compose |
| CI/CD | GitLab (3 repos, cross-triggers, container registry) |
| Ingress | Cloudflare Tunnel (zero-trust, no exposed ports) |
| Testing | 1,825 pytest + 85 Playwright E2E, pip-audit, Bandit SAST |
Try it yourself - the chat widget is live at pichler.dev. Feel free to reach out if you want to discuss AI infrastructure, RAG pipelines, or self-hosted LLMs.































