TL;DR

I built a Retrieval-Augmented Generation (RAG) chatbot that answers questions about my professional background using my portfolio website as the single source of truth. It runs entirely on my own hardware - a K3s single-node setup with an NVIDIA RTX 3060 - with no cloud AI dependencies. The stack includes vLLM for GPU-accelerated inference, LangGraph for orchestration, Qdrant for vector search, and a Flask API with OWASP-aligned security. Everything is deployed via GitLab CI/CD across three repositories, with automated content indexing, monitoring, and zero manual intervention.

45
Content Chunks
27
API Endpoints
258
Injection Rules
1,910
Automated Tests
10
LangGraph Nodes
10
OWASP Top 10

Why Build Your Own Chatbot?

Every portfolio website has the same problem: visitors need to find specific information fast. “What technologies does this person know?” “Do they have Kubernetes experience?” “How do I contact them?”

Instead of hoping visitors click through the right pages, I built an AI assistant that answers these questions instantly - grounded in actual website content, not hallucinations.

But why self-hosted? Three reasons:

  1. Privacy - No visitor data leaves my infrastructure. No OpenAI, no Anthropic API, no third-party inference.
  2. Cost - After the initial hardware investment, the only ongoing cost is electricity (~17W idle, ~150W under load). No per-token pricing, no usage caps.
  3. Learning - Building the entire stack from GPU drivers to production deployment taught me more about AI infrastructure than any course could.

The portfolio website at pichler.dev - the chatbot’s single source of truth


Architecture

The system runs on two VMs in separate VLANs on a Proxmox hypervisor, managed across three Git repositories:

Servers:

ServerNetworkRole
www-serverDMZDocker Compose - portfolio website, unified backend API, Cloudflare Tunnel
K8s nodeInternalK3s single-node - GPU inference, vector DB, embeddings, monitoring, auth

The two networks are deliberately isolated. The chatbot API on the www-server calls AI services on the K8s node via HTTPS through ingress endpoints - the www-server has no direct cluster access.

Repositories:

RepositoryPurposeDeploys To
chatbotBackend API, frontend widget, testswww-server (GitLab CI/CD)
ai-infraKubernetes manifests, Langflow flows, deploy scriptsK8s node (kubectl apply)
pichler-portfolioHugo website, Nginx config, Docker Composewww-server (Hugo build + Docker image)

Each repository has its own CI/CD pipeline. When the portfolio is updated, it automatically triggers a re-scrape of the website content into the vector database - the chatbot always has up-to-date information without manual intervention.

Architecture overview showing the complete system


The AI Stack

LLM: vLLM + Gemma 3 4B IT

The language model is Google’s Gemma 3 4B IT (instruction-tuned), served by vLLM - a high-throughput inference engine that provides an OpenAI-compatible /v1/chat/completions endpoint.

Why Gemma 3 4B? It’s the largest model that fits in 12 GB VRAM (RTX 3060) while maintaining good response quality for a focused domain. I tested the 12B variant, but it triggered out-of-memory errors on this GPU.

Real performance numbers from the production API:

113ms
Time to First Token
11.4 GB
VRAM Used (of 11.6)
~17W
Idle Power Draw

At idle the GPU draws ~17W; under inference load it peaks at ~150W.

GPU Stats API endpoint showing real-time telemetry

Automatic Failover: If the local vLLM instance becomes unavailable, the system automatically fails over to OpenRouter (GPT-4o-mini) as a cloud backup. Health checks run with configurable timeouts, and the failover state is managed thread-safely with a cooldown before attempting recovery. Every failover event and recovery triggers a push notification via ntfy.

Embeddings: TEI + Jina Embeddings DE v2

Hugging Face Text Embeddings Inference (TEI) serves the jinaai/jina-embeddings-v2-base-de model, producing 768-dimensional vectors. It handles both the indexing pipeline (when content is scraped) and query-time embedding generation (when a user asks a question).

Vector Database: Qdrant

Qdrant stores the embedded website content. The collection uses 768 dimensions with Cosine similarity. When a user asks a question, the chatbot embeds the query via TEI, searches Qdrant for the most relevant content chunks, and feeds them to the LLM as context.

The collection currently holds 45 content chunks - automatically scraped from 9 pages including the portfolio website and blog posts, each with metadata like source URLs and section titles for grounding. Adaptive chunking splits content at semantic boundaries (h2/h3 sections) rather than fixed character counts, producing fewer but more coherent chunks.

Qdrant collection detail showing vector configuration

Content Indexing: Langflow

Langflow handles the scraping and indexing pipeline. A custom WebsiteScraper component crawls pichler.dev, extracts content from each page, splits it into chunks, generates embeddings via TEI, and stores vectors in Qdrant with source metadata. It automatically discovers blog post URLs from the /blog/ index page, so new posts are indexed without manual configuration.

This runs as a webhook - triggered automatically by the portfolio CI/CD pipeline whenever the website is updated, or manually via make rescrape. The custom component is deployed as a Kubernetes ConfigMap and loaded into Langflow at startup.

Langflow flow editor showing the custom Website Scraper component


The RAG Pipeline: LangGraph

The heart of the chatbot is a LangGraph StateGraph with 10 nodes - a directed graph that orchestrates the entire chat flow. Unlike simple if/else chains, the graph makes the flow explicit, testable, and easy to extend.

LangGraph StateGraph visualization showing all 10 nodes and conditional edges

The 10 Nodes

  1. validate_input - Input validation, sanitization, injection detection, session verification. Malicious inputs are blocked here before any LLM call is made.

  2. classify_intent - Two-stage intent classification. First, fast regex-based pattern matching for common intents (handles ~40% of inputs without calling the LLM). Then, for ambiguous inputs, an LLM relevance check determines if the question relates to my professional profile.

  3. direct_response - Handles non-RAG intents (greetings, farewells, smalltalk, contact requests, out-of-scope) with pre-defined response templates. No LLM call needed.

  4. check_cache - Looks up the query in the response cache (exact match + semantic similarity). Cache hits skip retrieval and generation entirely, returning the cached response immediately.

  5. retrieve_context - First rewrites vague queries via LLM (e.g. “tell me about him” → “What is Marcus Pichler’s professional background?”). Then runs hybrid search: dense embeddings via TEI + BM25 sparse vectors, fused via Reciprocal Rank Fusion. Results are reranked with a cross-encoder, compressed to fit the token budget, and enriched with source metadata.

  6. evaluate_context - Checks whether the retrieved context is sufficient to answer the query. If insufficient and retry budget remains, triggers re-retrieval with a reformulated query. Three dimensions are scored: keyword coverage, document relevance, and context volume.

  7. generate_response - Sends the query + retrieved context to vLLM (or the failover provider) and generates the answer. The system prompt is recruiting-optimized and bilingual (German/English). Dynamic temperature adapts to query type (tech=0.15, recruiting=0.2, general=0.3).

  8. validate_quality - Four-layer response validation: garbage detection, hallucination detection, faithfulness checking (NLI-based), and confidence scoring. Responses that fail validation trigger a retry or fallback.

  9. cache_and_suggest - Caches the validated response and generates dynamic follow-up suggestions based on the conversation context.

  10. handle_error - Catches exceptions from any node and returns a graceful error response instead of crashing.

Here’s the actual graph definition from graph.py:

graph = StateGraph(ChatState)

graph.add_node("validate_input", validate_input)
graph.add_node("classify_intent", classify_intent)
graph.add_node("direct_response", direct_response)
graph.add_node("check_cache", check_cache)
graph.add_node("retrieve_context", retrieve_context)
graph.add_node("evaluate_context", evaluate_context)
graph.add_node("generate_response", generate_response)
graph.add_node("validate_quality", validate_quality)
graph.add_node("cache_and_suggest", cache_and_suggest)
graph.add_node("handle_error", handle_error)

The 7 Intent Types

IntentExampleHandling
GREETING“Hello”, “Hi there”Direct response, no LLM
FAREWELL“Goodbye”, “See you”Direct response, no LLM
SMALLTALK“How are you?”, “What’s your name?”Direct response, no LLM
QUESTION“What are Marcus’s skills?”Full RAG pipeline
CONTACT“How can I reach Marcus?”Contact form redirect
PERSONAL“What’s Marcus’s salary?”Direct response (contact redirect)
OUT_OF_SCOPE“What’s the weather?”Polite redirect, no LLM

Routing Logic

After intent classification, five routing functions determine the path through the graph:

  • route_by_intent - QUESTION and PERSONAL intents go to check_cache; all other intents go to direct_response.
  • route_cache - Cache hits skip to cache_and_suggest; cache misses proceed to retrieve_context.
  • route_sufficiency - After context evaluation: sufficient context proceeds to generate_response; insufficient context loops back to retrieve_context with a reformulated query.
  • route_quality - After generation, normal responses go to validate_quality; empty context skips to cache_and_suggest (fallback); errors go to handle_error.
  • route_retry - Failed quality checks with retry budget go back to generate_response; exhausted retries accept the response as-is via cache_and_suggest.

Real API Response

Here’s what the chatbot returns for a real question — a curl against the live API at pichler.dev/api/chat:

Real API response from the chatbot showing intent classification, grounded answer, and follow-up suggestions

Every response includes the detected intent, language, follow-up suggestions, and the grounded answer. You can also stream responses token-by-token via SSE:

curl -N -X POST https://www.pichler.dev/api/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message": "What are Marcus'\''s skills?", "session_id": "demo"}'

# Output (SSE):
# data: {"type": "token", "content": "Marcus"}
# data: {"type": "token", "content": " has"}
# data: {"type": "token", "content": " extensive"}
# ...
# data: {"type": "done", "suggestions": ["What projects has Marcus worked on?"]}

Response Quality Pipeline

Before any response reaches the user, it passes through four validation layers. This is implemented in the validate_quality node.

Quality validation pipeline with all 4 layers

1. Garbage Detection (8 Methods)

Catches LLM output failures that produce nonsensical text:

CheckWhat It Catches
Repeated characters5+ identical characters in a row (e.g., “aaaaa”)
Repeated patterns2-4 character patterns repeating 3+ times
Repeated words3+ consecutive identical words or bigrams
Long wordsWords exceeding 30 characters
Consonant runs9+ consonants without vowels
N-gram loopsRepeated 3-6 word phrases (the LLM getting “stuck”)
Question loops3+ similar repeated questions instead of answers
Foreign scriptUnexpected CJK, Cyrillic, Arabic, or Devanagari characters

2. Hallucination Detection (10 Methods)

Checks whether the response is grounded in retrieved context:

CheckWhat It Catches
Question as answerResponse is >30% questions or starts with a question word
Instruction hallucination14 instruction-style patterns (“Remember:”, “Note:”, “As an AI…”)
Nonsense words7+ consonant-only words, repeated syllables, character spam
Entity presenceResponse about Marcus doesn’t mention him at all
Generic hallucination20 corporate-speak patterns (“I’d be happy to help…”), triggers on 2+ matches
Context groundingMentions technologies not present in the RAG context
Domain hallucinationFalse employers, cities, or job titles not in the knowledge base
Prompt leakageDetects system prompt fragments leaked into the response
Self-contradictionDetects contradictory statements within the same response
URL hallucinationStrips fabricated URLs not present in the source material

3. Confidence Scoring

Every response gets a confidence score starting at 1.0. Each failed check applies a multiplicative penalty:

CheckPenaltyWhat It Catches
Garbage detection×0.1Loops, repeated patterns, nonsense output
Question as answer×0.1Response is mostly questions instead of answers
Instruction hallucination×0.1“Remember:”, “Note:”, “As an AI…” patterns
Nonsense words×0.1Consonant-only words, repeated syllables, character spam
Missing entity×0.4Response about Marcus doesn’t mention him
Context grounding×0.2Claims not supported by RAG context
Too short×0.5Below minimum character threshold
Too long×0.7Exceeds maximum character limit

The final decision is binary: a response is valid if confidence >= threshold AND issues <= 2. Failed responses trigger a fallback or regeneration attempt. There’s no letter-grade system in the pipeline - it’s pass/fail with a confidence score that drops sharply on any quality issue.

A separate quality scoring module runs after the pipeline for metrics and logging, computing a weighted A-F grade across 5 dimensions (relevance, completeness, coherence, grounding, brevity) - but this doesn’t affect the pipeline decision.


Advanced RAG: Beyond Basic Retrieval

The basic RAG pattern - embed query, search vectors, feed to LLM - works, but production quality demands more. Seven additional pipeline stages transform raw retrieval into reliable, grounded responses.

Advanced RAG Pipeline showing all stages from query to response

Query Rewriting

Vague or context-dependent queries like “tell me about him” or “what does he do?” fail at retrieval because there’s nothing specific to embed. The pipeline detects these cases and rewrites them via a standalone LLM call before retrieval — transforming “tell me about him” into “What is Marcus Pichler’s professional background and experience?”. This runs only for first-turn queries without conversation history, adding ~200ms but dramatically improving first-turn recall.

Hybrid Search (BM25 + Dense)

Pure semantic search struggles with exact terms — proper nouns, specific technologies, or acronyms that don’t have strong embedding representations. Hybrid search combines dense embeddings (768-dim via TEI) with BM25 sparse vectors stored directly in Qdrant. A pre-built vocabulary of 2,222 terms maps to sparse dimensions. Results from both searches are merged via Reciprocal Rank Fusion (RRF), giving the best of both worlds: semantic understanding from dense vectors and exact keyword matching from BM25.

Cross-Encoder Reranking

Vector similarity is a blunt instrument - it finds related content but doesn’t distinguish relevant from tangential. After hybrid search returns candidate chunks, a Cross-Encoder model via TEI re-scores each chunk against the actual query. This produces dramatically better ordering than cosine similarity alone, especially for nuanced questions where keyword overlap is low.

Context Compression

Not every sentence in a retrieved chunk is useful. The compression module uses embedding-based extractive compression (no LLM call, ~50ms): split documents into sentences, batch-embed them via TEI, keep only sentences above a cosine similarity threshold to the query, and enforce the token budget. Dynamic thresholds adapt per query type - recruiting queries get a lower threshold (keep more context), tech queries get a higher one (precision matters).

Context Enrichment

Raw chunks lack provenance. The enrichment module prepends metadata headers to each chunk so the LLM knows where the information comes from: [From Marcus's career timeline (pichler.dev/career) — Section: Work Experience]. This is pure metadata-based enrichment (0ms latency, no LLM call) using a static page context map.

Context Sufficiency Evaluation

The most impactful addition: a pre-generation gate that evaluates whether the retrieved context is actually sufficient to answer the query before sending it to the LLM. Three dimensions are scored - keyword coverage, document relevance, and context volume:

def check_context_sufficiency(query, docs, context):
    scores = {}
    scores["coverage"] = query_keywords_found / total_keywords
    scores["relevance"] = avg(doc.metadata["score"] for doc in docs)
    scores["volume"]   = token_count_score(context)

    overall = coverage * 0.4 + relevance * 0.35 + volume * 0.25
    return overall >= SUFFICIENCY_MIN_SCORE

If the score falls below the threshold and retry budget remains, the pipeline triggers re-retrieval with a reformulated query — expanding synonyms for low coverage or broadening terms for low relevance. This loop runs heuristically (~1-2ms latency, no LLM call) and catches cases where the initial retrieval missed relevant content.

Semantic Cache

The exact-match response cache misses paraphrases entirely. The semantic cache wraps it with an embedding-similarity layer: queries like “What does Marcus do?” and “Marcus’ current role?” hit the same cache entry if their embeddings have cosine similarity > 0.85. Entries expire after 30 minutes (matching the response cache TTL), with LRU eviction at 200 entries.

Faithfulness Check

The final guard: a DeBERTa-v3-base NLI model (exported to ONNX, ~50ms/claim) checks whether each claim in the generated response is entailed by the retrieved context. The response is split into sentences, and each sentence is verified as an NLI premise-hypothesis pair against the context. A response is faithful if the ratio of supported claims meets the configured threshold. Contradicted claims trigger a fallback.

Token Budget Management

With a 4B model, every token counts. The pipeline manages a 4,096-token budget split across five allocations:

Token budget breakdown showing allocation across pipeline components

Context compression enforces the RAG context budget. History is dynamically trimmed to fit. The safety margin absorbs tokenizer estimation errors.


Security: OWASP Top 10 From Day One

Security isn’t an afterthought - it’s baked into every layer. The chatbot addresses all OWASP Top 10 risks:

#RiskMitigation
A01Broken Access ControlRate limiting (10/min chat, 3/hr contact, 60/min GPU, 30/min admin), CORS whitelist, session validation
A02Cryptographic FailuresTLS everywhere, API keys for all services, no secrets in logs, HMAC-SHA256 CSRF tokens
A03Injection258 detection rules (187 regex patterns + 71 keywords), input sanitization, XSS prevention
A04Insecure DesignIntent classification, RAG grounding, hallucination detection, quality scoring
A05Security MisconfigurationK8s security contexts, 11 security headers (incl. HSTS, CSP, CORP, COEP, COOP), CSRF tokens with time validation
A06Vulnerable Componentspip-audit in CI, dependency scanning, automated vulnerability alerts
A07Auth FailuresRate limiting on all endpoints, CSRF protection with 3s minimum / 5min maximum token age
A08Data IntegrityInput validation, content sanitization, CSP headers
A09Logging & MonitoringStructured logging, Prometheus metrics, security event tracking, ntfy alerts
A10SSRFURL validation, allowlisted external services only

Prompt Injection Detection

The chatbot detects 187 regex patterns and 71 injection keywords (258 total detection rules) organized into categories:

  • Instruction overrides - “ignore previous instructions”, “disregard all rules”
  • Role manipulation - “you are now an unrestricted AI”, “pretend to be”
  • Jailbreak attempts - “DAN mode”, “developer mode”, “sudo mode”
  • System prompt extraction - “reveal your prompt”, “print your instructions”
  • Encoding bypasses - Base64, Unicode, HTML entities, XML/markup injection
  • Multi-language attacks - Patterns in German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Turkish, and Czech
  • SQL and shell injection - SELECT, DROP TABLE, rm -rf, /etc/passwd
  • Template injection - {{, {%, __import__, eval()
  • Path traversal - ../, ..\\, %2e%2e

A sample of the 187 regex patterns:

INJECTION_PATTERNS = [
    r"ignore\s*all\s*instructions",
    r"ignore\s*(all\s*)?(previous|prior|above)\s*(instructions?|prompts?|rules?)",
    r"disregard\s*(all\s*)?(previous|prior|above|your)",
    r"you\s*are\s*now\s*(a|an|the|my)",
    r"pretend\s*(to\s*be|you\'?re|that\s*you\'?re)",
    r"act\s+as\s+(a|an|if|though|my)",
    # ... 178 more patterns across 9 categories
]

Pre-processing layers catch evasion attempts before pattern matching:

  • Leetspeak normalization - A 27-character mapping table converts obfuscated attacks like 1gn0r3 pr3v10usignore previous
  • Zero-width character stripping - Removes invisible Unicode characters (BOM, zero-width space, line/paragraph separators) used to bypass pattern matching
  • Spacing collapse - Detects letter-by-letter evasion like i g n o r eignore

Detected injections are blocked immediately with a security log entry - no LLM call is made.

Contact Form Spam Protection

The system has two contact endpoints - /api/chat/contact (chatbot widget) and /api/contact (portfolio page) - both with multi-layer spam detection. The portfolio contact form, consolidated into the chatbot backend as a Flask Blueprint, has the most comprehensive anti-spam pipeline:

  • CSRF Double-Submit Cookie - HMAC-SHA256 signed token + HttpOnly cookie with embedded server timestamp. The signature and timestamp travel in a single cookie ({signature}|{server_ts}), validated on every submission.
  • Proof-of-Work challenge - The frontend solves a SHA-256 hash puzzle (finding a nonce that produces a hash starting with 0000) before submission. Requests without a valid solution are flagged.
  • Timestamp bot detection - The server timestamp embedded in the CSRF cookie enforces a 3-second minimum between token generation and form submission. Automated tools that submit instantly are caught.
  • Honeypot fields - 5 hidden form fields (website, phone, email, url, fax) that only bots fill in. Any non-empty value silently rejects the submission.
  • Spam pattern detection - Regex-based content scanning for spam keywords (viagra, casino, crypto, etc.), excessive URL injection (>2 URLs), character repetition, all-caps runs, and special character floods.
  • Unicode sanitization - NFKC normalization strips invisible zero-width characters, line/paragraph separators, and soft hyphens that could bypass pattern matching.
  • Disposable email blocking - Rejects 24 known throwaway domains (tempmail.com, guerrillamail.com, mailinator.com, yopmail.com, sharklasers.com, etc.)
  • RFC 5322 email validation - Full regex pattern compliance, not just @ checking. Local part minimum length enforced.
  • Email header injection prevention - Strips \r and \n from all header-injectable fields before constructing the email.
  • Input length validation - Name (200 chars), subject (200 chars), message (10,000 chars), email (254 chars per RFC).
  • Rate limiting - 3 submissions per hour (chatbot contact), tracked with thread-safe eviction. Portfolio contact is protected by the CSRF token flow, timestamp validation, and PoW challenge instead of per-IP counting.
  • Origin/Referer validation - Every POST is checked against the allowlisted origins before processing.

Here’s what happens when you try to bypass these layers — real curl requests against the live API:

Security defense layers in action — Origin rejection, honeypot trap, CSRF/PoW check, prompt injection and SQL injection detection

Security Headers

Every API response includes 11 hardened security headers — Content-Security-Policy, X-Content-Type-Options, X-Frame-Options, X-XSS-Protection, Referrer-Policy, Permissions-Policy, Strict-Transport-Security, X-Permitted-Cross-Domain-Policies, Cross-Origin-Embedder-Policy, Cross-Origin-Opener-Policy, and Cross-Origin-Resource-Policy. Additionally, the portfolio contact API sets Cache-Control: no-store, no-cache and Pragma: no-cache on all responses, and SSE streams use Cache-Control: no-cache to prevent sensitive data caching.

Security headers returned by the API


The Frontend: Accessible Chat Widget

The chat widget is built with vanilla JavaScript - no React, no Vue, no npm dependencies. It consists of a single HTML partial, one CSS file, and one JS file, embedded into the Hugo portfolio site as a Git submodule.

Design principles:

  • Accessibility - WCAG 2.1 AA compliant, full keyboard navigation, screen reader support, ARIA labels
  • Dark theme - Matches the portfolio’s glassmorphism design
  • Responsive - Works on desktop, tablet, and mobile
  • Lightweight - Zero npm dependencies, no build step required
  • Session persistence - Chat history survives page navigation via sessionStorage
  • Streaming - Token-by-token response streaming via SSE (Server-Sent Events)

Why SSE instead of WebSocket? Because Cloudflare Tunnel’s free plan blocks WebSocket over HTTP/2. SSE provides the same streaming experience without the protocol limitations. CSRF tokens (HMAC-SHA256, 3s minimum / 5min maximum age) protect all state-changing operations.

Chat widget running on pichler.dev

The widget automatically detects the user’s language and responds accordingly. Follow-up suggestions adapt to the conversation context:

Follow-up conversation about Kubernetes experience

German language chat about Kubernetes experience

On mobile, the layout adapts with touch-optimized input and GPU stats in the footer bar:

German language response on mobile

Intent classification recognizes 7 types. Questions like “How can I contact Marcus?” bypass the RAG pipeline entirely and route directly to the contact form:

Contact intent redirecting to the contact form


Infrastructure

K8s Node: 13 Pods on K3s

A K3s single-node setup with an NVIDIA RTX 3060 runs all AI services:

PodPurposeNotes
vLLMLLM inferenceGPU-accelerated, OpenAI-compatible API
QdrantVector databasePersistent storage, API key auth
TEI EmbeddingsText embeddingsJina Embeddings DE v2, 768 dimensions
TEI RerankerCross-encoder rerankingRelevance scoring for retrieved chunks
LangflowContent indexingWebhook-triggered scraping pipeline
Langfuse (web + worker)LLM tracingSelf-hosted observability for every LLM call
ClickHouseAnalytics DBLangfuse trace storage
RedisCacheLangfuse job queue
MinIOObject storageLangfuse media/attachment storage
PostgreSQLDatabaseLangflow + Keycloak + Langfuse storage
KeycloakSSO / OIDCSingle sign-on for all web UIs
OAuth2 ProxyAuth proxyProtects Langflow UI

All pods run with security contexts - non-root users, dropped capabilities, read-only filesystems where possible. 13 Network Policies enforce default-deny with selective allows: pods can only communicate with the services they explicitly need. Internet egress excludes private IP ranges to prevent lateral movement.

Kubernetes Network Policies enforcing pod isolation

13 Kubernetes pods running in the ai-platform namespace

www-server: Docker Compose

The production website runs as Docker Compose with three core containers:

  • portfolio - Nginx serving the Hugo static site, proxying all /api/* requests to the backend
  • chatbot - Unified Flask backend (port 5005) - handles both the RAG chatbot API and the portfolio contact form
  • cloudflared - Cloudflare Tunnel for zero-trust ingress (no exposed ports)

Plus monitoring (node-exporter, cAdvisor), analytics (Umami + umami-db), and a GitLab Runner for CI/CD - 8 containers total.

This is the result of an API consolidation - the contact form originally ran as a separate Flask container on port 5000, but since both backends shared the same SMTP configuration, security middleware, and CORS setup, I merged them into a single container. One Nginx location, one backend, fewer moving parts. The portfolio contact form is registered as a Flask Blueprint alongside the chatbot blueprint, sharing the same rate limiter and security headers.

The chatbot uses a volume-mount deployment model - the Docker image contains only Python dependencies, while the application code (configs/) is mounted as a read-only volume at runtime. Code changes deploy in ~2 seconds via rsync + container restart, with no image rebuild needed. The Docker image is only rebuilt when requirements-api.txt, Dockerfile, or .dockerignore change.

Docker containers running on the production www-server

Health check endpoint showing all services operational


CI/CD: Three Repos, Three Pipelines, Zero Manual Steps

Every deployment is fully automated via GitLab CI/CD. The only manual step is clicking the deploy button in GitLab - code deploys via rsync to the volume mount in seconds.

Chatbot Pipeline (6 Stages)

StageJobWhat It Does
lintlint:pythonRuff linter + formatter check
testtest:unit, test:security, test:recruiting1,825 tests with coverage reporting
auditaudit:dependencies, audit:sastpip-audit vulnerability scan + Bandit SAST
buildbuild:dockerBuild image, push to GitLab Container Registry (only on dependency changes, see below)
deploydeploy:www-serverRsync code to www-server, restart container (manual trigger)
post-deploytest:smoke:post-deploy, rescrape:websiteSmoke tests (DE+EN), auto-rescrape on portfolio trigger

GitLab CI/CD pipeline with all 6 stages

GitLab Container Registry for the chatbot image

Cross-Repo Triggers

When the portfolio website is deployed, it automatically triggers the chatbot pipeline with a RESCRAPE=true variable. The chatbot pipeline then calls the Langflow webhook to re-scrape the website content and verifies the vector count in Qdrant. The chatbot’s knowledge base stays current without any manual intervention.

Portfolio pipeline triggering the chatbot rescrape job

Test Coverage

The test suite covers 1,825 test cases across 30 test files:

Test suite overview showing all 30 test files sorted by count

The unit and integration tests run without external dependencies - 1,748 passed, 17 skipped (tests requiring a live GPU connection or Langflow integration). The remaining 60 tests in test_live_api.py run against the production endpoint - verifying health checks, intent classification, security headers, CORS, and end-to-end chat flow on the live system.

Terminal showing pytest output with all tests passing

Coverage report showing per-file test coverage with color-coded percentages

End-to-End Testing: Playwright

Beyond unit and integration tests, 85 Playwright E2E tests across 7 specs validate the chat widget in real browsers:

SpecTestsWhat It Covers
chat-widgetCoreWidget open/close, message send/receive, SSE streaming, session persistence
debug-layoutLayoutDebug panel rendering, Tokyo Night theme, reranker info display
full-device-auditDevicesCross-device rendering across mobile and desktop viewports
gpu-panelGPUGPU stats bar, real-time updates, VRAM/temperature/power display
screenshot-auditVisualScreenshot consistency, visual regression detection
smokeCriticalDE+EN flow, health check, basic chat roundtrip
wow-featuresAdvancedFollow-up suggestions, language switching, contact intent routing

Tests run against multiple viewports - Desktop (1280×720), Pixel 7, Galaxy S24, and iPhone 15 Pro - ensuring the widget works across devices. The Playwright suite runs separately from pytest, giving 1,910 total automated tests (1,825 pytest + 85 Playwright).


Monitoring & Observability

Grafana Dashboards

All metrics feed into Grafana via Prometheus. The kube-prometheus-stack deployment provides 30 dashboards out of the box - Kubernetes compute resources per cluster, namespace, pod, and node, plus CoreDNS, Alertmanager, and Kubelet monitoring:

AI Platform Overview dashboard with Service Health, GPU metrics, and traffic data

Cluster resource utilization is tracked per node, showing CPU usage, memory consumption, and pod resource allocation across the K3s single-node setup:

Kubernetes cluster resources dashboard showing CPU, memory, and namespace breakdown

Two custom dashboards complement the built-in Kubernetes metrics: the AI Platform - Overview dashboard tracks chatbot-specific metrics (request volume, response times, quality scores, LLM failover status, GPU utilization), while the Portfolio - www.pichler.dev dashboard monitors the production website (uptime probes, SSL certificate expiry, container resource usage):

Portfolio monitoring dashboard with website and chat API status

LLM Tracing with Langfuse

Infrastructure metrics tell you if things are running. LLM tracing tells you how well your AI is actually responding. Langfuse is a self-hosted, open-source observability platform for LLM applications - the self-hosted alternative to LangSmith.

Every LLM call is automatically traced via a LangChain CallbackHandler injected at the ChatOpenAI constructor level. Each trace captures the full system prompt, user query, retrieved context, generated response, token counts, and latency - all without touching application code beyond a single import.

Langfuse trace detail showing a ChatOpenAI call with input, output, and metadata

Langfuse runs as 5 pods on the K8s cluster (web, worker, ClickHouse, Redis, MinIO) with tight resource limits - adding less than 1 GiB to the cluster’s memory footprint. The SDK auto-configures via environment variables, so tracing is enabled in production and silently disabled in development.

GPU Metrics API

A dedicated /api/chat/gpu-stats endpoint exposes real-time GPU telemetry as JSON and as an SSE stream (/api/chat/gpu-stats-stream) with ~2-3 updates per second. The data comes from three sources: the vLLM metrics endpoint, nvidia-smi, and the DCGM exporter. The chat widget displays a live GPU status bar at the bottom, showing VRAM usage, temperature, power draw, and inference model — giving visitors transparency into the hardware running their queries.

Chat widget with GPU stats bar visible

Alerting via ntfy

Critical events trigger push notifications via a self-hosted ntfy server:

  • LLM failover activated / recovered
  • High error rate detected
  • Service health check failures

Alerts are sent via HTTP POST with severity tags - no external notification service needed, no subscription fees. Each alert includes the service name, environment, timestamp, and relevant context fields (error details, affected providers, duration).

A/B Testing

An experiment framework enables testing different prompt styles, temperature settings, and context chunk counts. Sessions are assigned to variants via consistent hashing (same session always gets the same variant), with per-variant tracking of quality scores, response times, and user feedback.


Lessons Learned

1. Consumer GPUs are viable for production AI. The RTX 3060 handles Gemma 3 4B IT with ~113ms time-to-first-token. For a personal portfolio chatbot, that’s more than enough - and the ongoing cost is just electricity (~17W idle, ~150W under load, roughly €3-5/month). No per-token pricing, no usage caps.

2. RAG beats fine-tuning for domain-specific Q&A. Instead of fine-tuning a model on my resume, I index my website and retrieve relevant chunks at query time. Content updates are instant - no retraining needed, just trigger a re-scrape.

3. Security must be first, not last. Adding prompt injection detection after the fact is painful. Building it into the LangGraph pipeline from day one - as its own validation node - made it natural and testable. 258 detection rules didn’t appear overnight; they grew from production experience.

4. SSE beats WebSocket for Cloudflare Tunnel. Cloudflare’s free plan blocks WebSocket over HTTP/2. SSE provides the same streaming UX without the protocol limitations. One line change in the transport layer, zero UX regression.

5. Three repos beat a monorepo for mixed infrastructure. Separating the chatbot code, K8s manifests, and portfolio website into three repos with cross-triggers keeps each pipeline focused and independently deployable. A K8s manifest change doesn’t trigger chatbot tests.

6. Always check the proxy chain. The CORS preflight issue that took hours to debug? Nginx in the portfolio container was blocking OPTIONS requests before they reached Flask. The chatbot’s CORS middleware was correct - the problem was upstream.

7. Automate everything - including content updates. The portfolio deploy triggers a re-scrape automatically. I never have to remember to update the chatbot’s knowledge base. If it’s not automated, it’s not reliable.

8. Consolidate when the overlap is obvious. The portfolio contact form and the chatbot API shared SMTP config, security middleware, CORS setup, and rate limiting. Running them as separate containers doubled the maintenance surface for zero benefit. Merging them into a single Flask app with two Blueprints halved the container count on the www-server and simplified the Nginx proxy from two location blocks to one.


Built With Claude Code

This entire project - backend, frontend, infrastructure, CI/CD, security, tests, and this blog post - was built using Vibe Coding with Claude Code (Claude Opus).

My role: Architect and Product Owner. I defined requirements, made architecture decisions, reviewed every result, and steered direction. Claude executed - writing code, debugging infrastructure issues, configuring Kubernetes manifests, setting up CI/CD pipelines, and running comprehensive tests.

Some highlights of the AI-assisted development:

  • LangGraph state machine - Claude designed and implemented the 10-node chat flow graph, including the context sufficiency evaluation loop, the 187-pattern injection detection system, and the NLI-based faithfulness checker
  • 3-repo CI/CD architecture - From Dockerfiles to GitLab pipeline configs to cross-repo triggers, fully automated with zero manual deployment steps
  • Kubernetes debugging - When Langflow’s API returned cryptic errors (ValueError: 'display_name'), Claude traced it through the API, found the root cause (incomplete component templates in the flow JSON), and fixed the creation script
  • CORS preflight fix - A 405 error on OPTIONS requests went through three layers of debugging (Flask → Nginx → Cloudflare) before Claude identified the single line in nginx.conf blocking the request method
  • 1,910 automated tests - 1,825 pytest + 85 Playwright E2E tests covering API contracts, injection detection, session security, input sanitization, CSRF validation, spam detection, alerting, faithfulness checking, context sufficiency, and recruiting pattern recognition

This is not “AI replacing developers” - it’s AI amplifying an architect’s capabilities. I couldn’t have built this entire stack in the time I did without Claude. And Claude couldn’t have built it without someone making the right architecture decisions, reviewing code, and knowing when to push back.

The best results come from human judgment + AI execution.


What’s Next

Since publication, two major features have shipped:

  • Hybrid Search ✓ - BM25 sparse vectors now run alongside dense embeddings, fused via Reciprocal Rank Fusion. Exact term matches and proper nouns that semantic search alone missed are now reliably retrieved.
  • Query Rewriting ✓ - Vague first queries like “tell me about him” are rewritten by the LLM into specific, self-contained questions before retrieval, dramatically improving first-turn recall.

Still on the roadmap:

  • Eval Pipeline - Golden-set benchmarking with curated question-answer pairs to measure retrieval quality, response accuracy, and regression detection across pipeline changes
  • Streaming Citations - Inline source references in streamed responses, linking claims to specific RAG chunks in real-time
  • Larger model - When VRAM allows (or with quantization), moving to a 12B+ model for more nuanced responses

The Stack

LayerTechnology
LLMvLLM + Gemma 3 4B IT (RTX 3060, 12 GB VRAM)
EmbeddingsTEI + Jina Embeddings DE v2 (768 dimensions)
Vector DBQdrant (45 chunks, Cosine similarity, hybrid dense+sparse)
OrchestrationLangGraph StateGraph (10 nodes, 7 intent types)
APIFlask + Gunicorn (27 endpoints, 2 Blueprints)
FrontendVanilla JS (zero dependencies, WCAG 2.1 AA)
Content IndexingLangflow (webhook-triggered)
RerankingCross-Encoder via TEI
AuthKeycloak + OAuth2 Proxy
Quality8 garbage checks + 10 hallucination checks + NLI faithfulness + confidence scoring
Security258 injection rules, OWASP Top 10, 11 security headers, PoW challenges
FailoverOpenRouter (GPT-4o-mini) automatic backup
Alertingntfy (self-hosted push notifications)
MonitoringPrometheus + Grafana + DCGM Exporter
LLM TracingLangfuse (self-hosted, OpenTelemetry)
InfrastructureK3s (13 pods) + Docker Compose
CI/CDGitLab (3 repos, cross-triggers, container registry)
IngressCloudflare Tunnel (zero-trust, no exposed ports)
Testing1,825 pytest + 85 Playwright E2E, pip-audit, Bandit SAST

Try it yourself - the chat widget is live at pichler.dev. Feel free to reach out if you want to discuss AI infrastructure, RAG pipelines, or self-hosted LLMs.