TL;DR

I built a production-grade monitoring and SIEM platform for my entire homelab infrastructure running on a single-node K3s cluster. The system combines Prometheus for metrics, Grafana for visualization, Loki for log aggregation, and Wazuh for security event management — all deployed via Ansible and Helm with full Infrastructure as Code.

Key Metrics:

  • 42 Prometheus scrape targets
  • 33 custom Grafana dashboards + 10 community imports
  • 52 custom alert rules with intelligent inhibition
  • 22 Wazuh security agents across 3 OS families
  • 8 agent groups with specialized detection rules
  • ~100 custom Wazuh rules (IDs 100100–100499)
  • 3 Wazuh Grafana dashboards (SIEM, Compliance, Vulnerabilities)
  • 14-day log retention in Loki
  • Multi-tier alerting: Email + ntfy mobile push

Why Build Enterprise Monitoring for a Homelab?

Many homelabs run blind. Services crash, disks fill up, certificates expire, and you only notice when something stops working. I wanted the opposite: know about problems before they become outages.

Three goals drove the implementation:

Visibility: Every host, every service, every metric — in one place. From GPU temperature on the AI VM to battery charge on the UPS, from LoRaWAN sensor signal strength to Proxmox guest status.

Security: Running 20+ hosts with internet-facing services demands real intrusion detection, not just hoping firewalls are enough. Wazuh provides file integrity monitoring, vulnerability scanning, and active response across the entire fleet.

Automation: No manual log checking, no SSH-ing into boxes to check disk space. Alerts come to my phone. Dashboards show the full picture. Problems get detected and — in some cases — resolved automatically.


Architecture

The monitoring stack runs entirely within the monitoring namespace on a single-node K3s cluster, while Wazuh operates as an All-in-One LXC container on Proxmox. Both systems feed into the same Grafana instance.

Infrastructure:

ComponentLocationRole
PrometheusK3s PodMetrics collection, 15-day retention, 15Gi storage
GrafanaK3s PodVisualization, 33+ custom dashboards
LokiK3s PodLog aggregation, single-binary, 14-day retention
AlertmanagerK3s PodAlert routing, email + ntfy
PromtailK3s DaemonSet + external agentsLog shipping from pods, PVE hosts, Wazuh
Wazuh ManagerLXC on ProxmoxSIEM: Manager + Indexer + Dashboard
8 ExportersK3s PodsPVE, UniFi, Blackbox, SNMP, NUT, PBS, AdGuard, Speedtest

Deployment Method:

StackToolSource
Prometheus + Grafana + AlertmanagerHelm (kube-prometheus-stack)kubernetes/monitoring/install.sh
Loki + PromtailHelmkubernetes/monitoring/install.sh
All exportersKustomizekubernetes/monitoring/<exporter>/
Wazuh ManagerAnsibleansible/playbooks/configure-wazuh-manager.yml
Wazuh AgentsAnsibleansible/playbooks/setup-wazuh-agents.yml
External monitoring agentsShell scriptsscripts/setup-monitoring-hosts.sh

Everything lives in a single Git repository — true Infrastructure as Code with Ansible playbooks for Wazuh and Helm/Kustomize for Kubernetes workloads.


The Monitoring Stack

Prometheus: 42 Scrape Targets

Prometheus sits at the center, scraping metrics from every layer of the infrastructure. The configuration in values.yaml defines 42 jobs organized by category:

Infrastructure Hosts (7 targets):

TargetHostPortInterval
pve-nodes3 Proxmox hypervisors910030s
ai-vmAI/GPU VM910030s
relayMail relay LXC910030s
opnsenseFirewall910030s
smartctlStorage host9633300s

API Exporters (8 targets):

TargetWhat It MonitorsPortInterval
pve-exporterProxmox API (all VMs/CTs)922160s
pbs-exporterProxmox Backup Server1001960s
unifi-pollerUniFi Controller (APs, clients)913030s
snmp-switchD-Link switch911660s
snmp-truenasTrueNAS SNMP911660s
nut-exporterEaton UPS (battery, load)919930s
adguard-exporterAdGuard DNS analytics961830s
speedtestInternet bandwidth (every 4h)9798300s

Service Health Probes (Blackbox Exporter):

Probe TypeTargetsModule
ICMP Ping21 hosts (all infrastructure)icmp_ping
HTTP 2xx12 internal serviceshttp_2xx
HTTP AnyOPNsense (Let’s Encrypt)http_any
DNSOPNsense Unbounddns_test
SMTPInternal mail relaysmtp_relay

Application-Specific (20+ targets):

GitLab alone exposes 5 scrape targets (exporter, webservice, gitaly, postgresql, redis). Additional targets include Traefik ingress metrics, cert-manager certificate lifecycle, Cloudflare Tunnel stats, ChirpStack LoRaWAN, MQTT sensor exporter, NVIDIA DCGM GPU metrics (15s interval!), Home Assistant, Wazuh SIEM exporter, and external services like the portfolio chatbot API with bearer token authentication.

The full target list reads like a network inventory — because it essentially is one.

Grafana: 33 Custom Dashboards

Grafana Dashboard Folders — AI Platform, Homelab, Infrastructure, Services

Every dashboard is deployed as a Kubernetes ConfigMap with the grafana_dashboard: "1" label, automatically discovered by Grafana’s sidecar. No manual import, no clicking through UIs — git push deploys dashboards.

Custom Dashboards (ConfigMap-based):

DashboardKey PanelsData Source
Homelab OverviewService status grid, host health, quick linksPrometheus
Wazuh SIEMAgent fleet, alert categories, top rulesPrometheus (Wazuh exporter)
Wazuh Compliance & ThreatsSCA scores, MITRE ATT&CK tactics, auth eventsPrometheus
Wazuh Vulnerability Deep DiveCVE counts by severity, per-host breakdown, trendsPrometheus
AI Platform OverviewGPU temp/utilization/VRAM, inference latencyPrometheus-AI
Portfolio www.pichler.devHTTP probe phases, SSL expiry, Docker metricsPrometheus
OPNsense FirewallInterface throughput, packet stats, rulesPrometheus
PBS BackupBackup/verify age, datastore usage, job statusPrometheus
SMART & ZFS HealthDisk temperatures, pool status, error countsPrometheus
NUT UPSBattery charge, runtime, load, input voltagePrometheus
Traefik IngressRequest rate, latency percentiles, error codesPrometheus
Loki & PromtailIngestion rate, query latency, dropped logsPrometheus + Loki
LoRaWAN SensorsBattery %, RSSI, SNR, last seenPrometheus
Power Cost & EnergyUPS consumption → €/month estimationPrometheus
SLO & Uptime TrackingService availability percentagesPrometheus
Network Map & StatusL2 topology, link utilizationPrometheus

Plus 10 community dashboards imported by gnetId:

DashboardgnetIdPurpose
Node Exporter Full1860 (rev 42)Comprehensive host metrics
Proxmox VE Cluster10347VM/CT overview
UniFi Client/UAP/USW/Sites11315/11314/11312/11311Network analytics
Blackbox Exporter7587Probe results
MinIO13502Object storage
SNMP Stats11169Switch metrics
cert-manager20842Certificate lifecycle

Wazuh SIEM Grafana Dashboard — Manager UP, 19 agents, 236k alerts, pie charts for categories and severity

Alertmanager: 52 Custom Rules with Smart Routing

Alerts aren’t useful if they wake you up for non-issues. The alerting system uses inhibition rules to suppress noise — if a host is down, don’t also alert about its services being unreachable.

Alert Groups (20 categories, 52 rules):

GroupRulesExamples
host-alerts11HostDown, HighCPU, HighMemory, DiskSpaceCritical/Warning, SmartDiskErrors
service-alerts5ServiceDown, SlowResponse, SmtpRelayDown, DnsDown
ups-alerts3UpsOnBattery (immediate!), UpsLowBattery, UpsBatteryReplace
kubernetes-alerts2K3sNodeNotReady, PodCrashLooping
wazuh-alerts4WazuhManagerDown, AgentDisconnected, HighCriticalCVEs, AlertSpike
gpu-alerts4GPUHighTemperature, GPUCriticalTemperature, GPUMemoryHigh
backup-alerts4PbsBackupStale, PbsVerifyStale, K3sBackupStale
certificate-alerts2CertExpiringSoon (30d), CertExpiryCritical (7d)
chatbot-alerts3ChatbotAPIDown, HighErrorRate, HighResponseTime
lorawan-alerts2LoRaSensorOffline (>2h), LoRaSensorBatteryLow (<15%)

Inhibition Logic (5 rules):

HostDown         → suppresses all warnings for that host
UpsOnBattery     → suppresses non-critical alerts
DiskCritical     → suppresses DiskWarning (same mountpoint)
CertCritical     → suppresses CertWarning (same instance)
DiskFill3Days    → suppresses DiskFill7Days

Notification Routing:

SeverityChannelTiming
CriticalEmail + ntfy pushImmediate, repeat 7d
Warningntfy push onlyNo repeat
UpsOnBatteryEmail + ntfygroup_wait: 0s, repeat: 5min
InfoSilenceDashboard only

The ntfy-bridge is a custom Python service that translates Alertmanager webhooks into ntfy push notifications with priority mapping. Critical alerts get priority 5 (urgent), warnings priority 3 (default), and resolved notifications priority 2 (low). Separate topics for homelab-critical and homelab-warnings keep the notification channels clean.

Loki: Centralized Log Aggregation

Loki runs in single-binary mode — all components in one pod. For a homelab, this is the sweet spot between simplicity and capability.

Log Sources (3 tiers):

Tier 1 — K3s Pods (Promtail DaemonSet): All pod logs from /var/log/pods are automatically collected with Kubernetes label enrichment. Zero configuration per service.

Tier 2 — PVE Hosts (External Promtail agents): Installed via scripts/install-node-exporter.sh --with-promtail on pve1, pve2, pve3. Ships syslog, auth.log, pveproxy, pvedaemon, kernel, and systemd journal to Loki’s NodePort (31000).

Tier 3 — Wazuh Manager (Dedicated Promtail): The most interesting source. Promtail on the Wazuh LXC ships:

Log SourceFormatContent
wazuh-alertsJSONSecurity alerts with rule IDs, severity levels
wazuh-managerTextManager operational logs (ossec.log)
active-responsesTextIP blocks, firewall drops
wazuh-apiTextDashboard/API access logs
syslog + authTextSystem and authentication events
systemd journalStructuredService lifecycle events

This creates a powerful correlation capability: Prometheus shows you what is happening (metrics), Loki shows you why (logs), and Wazuh shows you who is responsible (security events).

Configuration:

# Loki retention
limits_config:
  retention_period: 336h  # 14 days
  max_query_series: 50000
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_streams_per_user: 10000

External Host Monitoring

Not everything runs in Kubernetes. The PVE hypervisors, OPNsense firewall, and Wazuh LXC need monitoring agents installed directly.

Installation:

# One script to rule them all
./scripts/setup-monitoring-hosts.sh

# Installs on pve1/2/3:
#   - node_exporter (port 9100) — CPU, RAM, disk, network
#   - promtail (log shipping to Loki NodePort)
# Installs on pve3 additionally:
#   - smartctl_exporter (port 9633) — SMART disk health

OPNsense uses its native os-node_exporter plugin — installed via System → Firmware → Plugins. No SSH needed.

TrueNAS and MinIO expose Prometheus endpoints natively — just enable in their respective UIs.


Wazuh SIEM: Security for Every Host

Wazuh Dashboard Overview — 19 active agents, alert severity breakdown, endpoint and threat intelligence modules

Why Wazuh?

Open-source SIEM that combines log analysis, intrusion detection, file integrity monitoring, vulnerability detection, and active response in one platform. For a homelab with internet-facing services, this isn’t optional — it’s essential.

Deployment: All-in-One LXC

PropertyValue
PlatformLXC on Proxmox
Resources4 vCPU, 6GB RAM, 50GB Disk
VersionWazuh 4.14.3
ComponentsManager + OpenSearch Indexer + Dashboard

Deployed and configured entirely via Ansible:

# Deploy agents to all Linux hosts
ansible-playbook -i inventory.yml playbooks/setup-wazuh-agents.yml

# Configure Manager (groups, rules, active response, email)
ansible-playbook -i inventory.yml playbooks/configure-wazuh-manager.yml

Agent Fleet: 22 Agents, 8 Groups

Every host in the homelab runs a Wazuh agent. Agents are organized into groups with tailored configurations:

Agent Groups:

GroupAgentsSpecialization
proxmox3 hypervisorsHypervisor config monitoring, /etc/pve FIM
kubernetesK3s nodeK3s audit logs, K8s events, pod status, containerd
storage4 storage hostsBackup configs, ZFS settings, storage credentials
network3 network servicesNetwork service configs, Docker listener
services10 agentsDocker lifecycle, Nginx logs, service configs
ai-workloadGPU VMIgnores large model files (.gguf, .safetensors)
windows1 Windows hostWindows Event Logs, Sysmon (pending deployment)
siemWazuh (self)Self-monitoring: Manager, Dashboard, OpenSearch configs

Each group has its own agent.conf in ansible/files/wazuh/shared/<group>/, defining:

  • FIM (File Integrity Monitoring): Which paths to watch in realtime vs. scheduled scans
  • Syscollector: Hardware/software inventory intervals
  • Localfile: Which logs to collect and parse
  • Docker listener: Container lifecycle events (enabled on 6 hosts)
  • Vulnerability detection: OS and package scanning

Example — Kubernetes Group Configuration:

The kubernetes agent does the heavy lifting for cluster security:

<!-- K3s audit log (JSON) -->
<localfile>
  <log_format>json</log_format>
  <location>/var/log/k3s-audit.log</location>
</localfile>

<!-- K8s Warning events (streamed) -->
<localfile>
  <log_format>json</log_format>
  <location>/var/log/k8s-events.log</location>
</localfile>

<!-- Pod status check (every 2 min) -->
<localfile>
  <log_format>full_command</log_format>
  <command>kubectl get pods --all-namespaces -o json | jq ...</command>
  <frequency>120</frequency>
</localfile>

<!-- Container status check (every 2 min) -->
<localfile>
  <log_format>full_command</log_format>
  <command>crictl ps -a -o json | jq ...</command>
  <frequency>120</frequency>
</localfile>

Custom Rules: 100 Rules Across 5 Categories

Every custom rule has a specific ID range and severity level. Severity determines action: level 10+ gets logged prominently, level 13+ triggers email alerts.

Proxmox Rules (100100–100199):

Rule IDEventLevel
100100-101VM/CT start/stop5
100102VM migration failure10
100103Cluster membership change8
100104-105Backup job failed/success10/3
100106Storage config change7
100107Ceph degradation12

K3s Audit Rules (100200–100206):

Rule IDEventLevel
100200Secret access/modification10
100201Pod deletion8
100202RBAC denial (403)12
100203Namespace operations7
100204Workload changes (deploy/ds/sts)5
100205RBAC config changes10
100206kubectl exec into pods8

Kubernetes Container Monitoring (100210–100241):

Rule IDEventLevel
100211OOMKilled12
100212CrashLoopBackOff10
100213Image pull errors8
100215Pod evictions10
100219Node not ready12
100240Multiple OOMKills (3+ in 10min)13 (email!)
100241Multiple CrashLoopBackOff (3+ in 5min)12

Docker Container Monitoring (100250–100256):

Rule IDEventLevel
100250Container start5
100252Container died unexpectedly10
100253Container OOMKilled8
100255Command executed in container10
100256Multiple container deaths (3+ in 10min)12

Homelab Security Rules (100400–100410):

Rule IDEventLevel
100400/450SSH from non-local (suppressed for trusted subnets)8/0
100404SSH brute force detection10 → Active Response
100405New user account creation8
100408Disk space exhausted12
100409OOM killer triggered12
100410Certificate expiration warning8

Custom Decoders

Seven custom decoders parse non-standard log formats:

proxmox-task    → pvedaemon/pveproxy/pvestatd/pveceph logs
pbs-task        → Proxmox Backup Server proxy/manager logs
k3s-audit       → K3s API server audit events (JSON)
k8s-event       → Kubernetes Warning events from k8s-event-logger
k8s-pod-status  → kubectl pod status JSON output
containerd-status → crictl container status JSON output
opnsense-filterlog → OPNsense packet filter log parsing

Active Response: Automated Threat Mitigation

Wazuh doesn’t just detect — it responds. The active response system implements escalating IP blocking:

SSH Brute Force Response:

Trigger: Repeated failed SSH login attempts
Action: firewall-drop (iptables block)

Escalation: Increasing block durations from minutes to hours
Port Scan: Automatic temporary block

An IP whitelist protects infrastructure hosts (hypervisors, K3s master, Wazuh manager) from accidental self-lockout — a lesson learned the hard way in many SIEM deployments.

K3s Audit Logging

Kubernetes API audit logging captures every request to the K3s API server. The audit policy defines four levels:

LevelEvents
NoneHealth checks, list/watch, system service accounts
MetadataSecret ops, RBAC changes, namespace ops, workload changes
RequestResponsePod exec/attach/portforward
RequestResponseRBAC modifications

Implementation:

# WARNING: ~30s API downtime during deployment
ansible-playbook -i inventory.yml playbooks/setup-k3s-audit-logging.yml

The playbook configures K3s with an audit policy file and log rotation (7 days, 100MB max, 3 backups). Wazuh’s kubernetes agent reads /var/log/k3s-audit.log via the custom k3s-audit decoder.

This catches critical events: someone accessing Secrets, RBAC permission denials (potential privilege escalation attempts), unauthorized kubectl exec into pods, and workload modifications.

K8s Container Monitoring

A two-part system provides container-level visibility:

Part 1 — k8s-event-logger (systemd service):

# Streams K8s Warning events in real-time
kubectl get events --all-namespaces \
  --field-selector type!=Normal \
  --watch-only \
  -o json >> /var/log/k8s-events.log

This captures OOMKilled, CrashLoopBackOff, ImagePullBackOff, scheduling failures, and evictions as they happen.

Part 2 — Periodic Status Checks (every 2 minutes):

# Pods with restartCount > 3, waiting containers, or OOMKilled
kubectl get pods --all-namespaces -o json | jq '...'

# Non-running containers via containerd
crictl ps -a -o json | jq '...'

The combination of real-time event streaming and periodic health checks ensures nothing slips through.

Docker Container Monitoring

Six hosts run Docker alongside the Wazuh agent. The Docker listener wodle captures container lifecycle events:

Host TypeContainers Monitored
Web serverPortfolio, chatbot, tunnel, analytics
NetworkUniFi Controller
SecurityPassword manager
MediaMedia server
CommunicationMatrix/Synapse
InfrastructureUPS notification service

Events tracked: start, stop, die (unexpected), oom, pull, exec_start. An unexpected container death (rule 100252, level 10) gets immediate attention; three deaths in 10 minutes (rule 100256, level 12) indicates a systemic problem.

OPNsense: FreeBSD Agent

OPNsense Firewall Grafana Dashboard — interface throughput, packet statistics

OPNsense requires special handling — it’s FreeBSD, not Linux. Deployment uses the native plugin:

System → Firmware → Plugins → os-wazuh-agent → Install
Services → Wazuh Agent → Settings → Manager: <wazuh-ip> → Enable

The custom opnsense-filterlog decoder parses OPNsense’s unique packet filter log format, extracting rule numbers, interfaces, source/destination IPs, and actions.

Email Alerting

Email alerting is configured with a high severity threshold — only critical events (multiple OOMKills in rapid succession, RBAC denials, Ceph degradation, or brute force escalations) trigger email notifications. This prevents warning spam. A daily summary report covers lower-severity events for non-urgent review.

Vulnerability Detection

Wazuh Vulnerability Deep Dive — 14,339 total CVEs: 76 Critical, 2000 High, per-host breakdown and trends

Wazuh Vulnerability Detection Dashboard — CVE severity counters, top vulnerabilities, CVSS scores, trends by year

Wazuh scans every agent for known CVEs using package inventory data and vulnerability feeds (updated hourly). The Grafana Vulnerability Deep Dive dashboard visualizes:

  • Total CVE count by severity (Critical/High/Medium/Low)
  • Top 15 CVEs by affected host count
  • Per-host vulnerability breakdown
  • CVE trends over time
  • Agent keepalive staleness (detecting disconnected agents)

This drives our vulnerability remediation workflow:

# Patch all hosts (rolling update, one at a time)
ansible-playbook -i inventory.yml playbooks/patch-vulnerabilities.yml

Wazuh ↔ Grafana Integration

Wazuh Compliance & Threats Dashboard — SCA scores, MITRE ATT&CK tactics, authentication events

The bridge between Wazuh and Grafana is a custom Prometheus exporter running on the Wazuh LXC. It queries the Wazuh Manager API and exposes 33 metric families:

Key Metrics Exported:

MetricDescription
wazuh_agents_activeNumber of connected agents
wazuh_agents_disconnectedAgents that lost connection
wazuh_alerts_24hAlert volume (last 24 hours)
wazuh_alerts_by_levelAlerts grouped by severity
wazuh_vulnerabilities_by_severityCVE counts (critical/high/medium/low)
wazuh_sca_scoreSecurity Configuration Assessment score per agent
wazuh_mitre_tactic_countMITRE ATT&CK tactic distribution
wazuh_fim_entriesFile integrity monitoring file count
wazuh_active_response_24hAutomated blocks in last 24 hours
wazuh_agent_keepalive_age_secondsAgent staleness indicator

Three Dedicated Dashboards:

1. Wazuh SIEM Dashboard (uid: wazuh-siem)

  • SIEM Overview: Manager status, agent counts, 24h alert volume
  • Security Alerts: By agent (pie), by category (pie), by severity
  • Events Breakdown: SCA, Rootcheck, FIM, AppArmor, K8s Audit stats
  • Vulnerability Assessment: Severity distribution, per-host CVE bars
  • Agent Fleet: Status grid (online/offline), OS distribution
  • Trends: Alert rate, agent fleet, FIM entries over time

2. Wazuh Compliance & Threats (uid: wazuh-compliance)

  • SCA Compliance: Score per agent (bar gauge), pass/fail breakdown
  • MITRE ATT&CK: Tactic distribution, top 15 techniques, trends
  • Authentication: Success/failed counters, active response count

3. Wazuh Vulnerability Deep Dive (uid: wazuh-vulns)

  • Total CVEs with critical/high/medium/low breakdown
  • Top 15 CVEs by count
  • Per-host analysis with severity breakdown
  • Agent health via keepalive age

The Complete Data Flow

Complete data flow: Infrastructure fleet → Wazuh SIEM + Exporters + Promtail → Prometheus, Loki, Alertmanager → Grafana dashboards and notifications


Lessons Learned

1. Inhibition rules are non-negotiable. Without them, a single host going down generates 5+ alerts (host down + services down + probes failing). With inhibition, you get one alert.

2. Separate critical from noise early. Wazuh’s email threshold at level 13 and Alertmanager’s critical/warning split prevent alert fatigue. If everything is urgent, nothing is.

3. Single-binary Loki is perfect for homelabs. The microservices deployment mode is overkill. Single-binary with 14-day retention on local storage handles everything a homelab needs.

4. Custom decoders make or break Wazuh. Out-of-the-box Wazuh doesn’t understand Proxmox, K3s audit, or OPNsense logs. Seven custom decoders were needed to make the data useful.

5. Active response needs a whitelist. Without one, Wazuh’s SSH brute force blocking will eventually block your management IPs. The whitelist for PVE hosts and K3s master prevents self-lockout.

6. GitOps dashboards > manual dashboards. ConfigMap-based Grafana dashboards survive pod restarts, are version-controlled, and deploy automatically. Never create dashboards through the UI in production.

7. The Prometheus exporter bridge is worth building. Wazuh’s native dashboard is great for investigation, but Grafana provides the unified view. A custom exporter bridging the two gives you the best of both worlds.

8. Monitor the monitoring. Dedicated alerts for LokiDown, PromtailDown, WazuhManagerDown, WazuhAgentDisconnected, and PbsExporterDown ensure the observability stack itself stays healthy.


Built With Claude Code

The entire monitoring and SIEM stack — all Ansible playbooks, Helm values, custom rules, decoders, dashboards, exporters, and alert configurations — was built using Claude Code (Claude Opus).

My role: Architect defining requirements, reviewing outputs, and making security decisions.

Claude’s execution:

  • 1,596-line Helm values.yaml with 42 scrape targets and 52 alert rules
  • 100+ Wazuh custom rules with proper severity levels and escalation
  • 7 custom decoders for non-standard log formats
  • 33 Grafana dashboard ConfigMaps with PromQL queries
  • 8 exporter Kustomize deployments with proper resource limits
  • Ansible playbooks for automated agent deployment across Linux, FreeBSD, and Windows
  • ntfy-bridge Python service for mobile push notifications
  • Shell scripts for external monitoring agent installation

This is infrastructure-as-code at scale — the kind of work that would take weeks manually, delivered in days with AI-assisted development.


What’s Next

Shipped:

  • 42 Prometheus targets covering all infrastructure
  • 22 Wazuh agents with specialized group configs
  • 52 custom alert rules with inhibition
  • 33 Grafana dashboards
  • Multi-tier alerting (email + ntfy)
  • Active response (SSH brute force, port scan blocking)
  • Restore testing automation (verify-backup.sh — monthly random VM restore from PBS, boot verification, auto-cleanup, ntfy alerts)
  • Server closet sensor (LoRaWAN sensors for temperature, humidity, CO2, VOC via ChirpStack MQTT exporter)
  • UPS auto-shutdown orchestration (NUT client on all PVE hosts, graceful shutdown on LOWBATT, ntfy notifications)
  • Incident Timeline dashboard (alert correlation, Loki error logs, ArgoCD deploy tracking, system resources)

Roadmap:

  • Server room sensor: LoRaWAN sensor in the server closet measuring temperature, humidity (and depending on model CO2/VOC), displayed via ChirpStack → MQTT Exporter → Prometheus → Grafana

The Stack (Complete Reference)

LayerTechnology
MetricsPrometheus (kube-prometheus-stack Helm, 15d retention)
VisualizationGrafana (33 custom + 10 community dashboards)
LogsLoki single-binary (14d retention, 10Gi storage)
Log ShippingPromtail (K3s DaemonSet + 4 external agents)
SIEMWazuh 4.14.3 All-in-One (LXC, 22 agents, 8 groups)
AlertingAlertmanager → Email (critical) + ntfy (all)
Host Metricsnode_exporter on 7 hosts + OPNsense plugin
Proxmoxpve-exporter (API scrape, cluster mode)
Backup Monitoringpbs-exporter (Proxmox Backup Server)
NetworkUniFi Poller + SNMP Exporter (switch, TrueNAS)
UPSNUT Exporter (Eaton Ellipse PRO 850)
DNSAdGuard Exporter
ProbingBlackbox Exporter (21 ICMP, 12 HTTP, DNS, SMTP)
Disk Healthsmartctl_exporter on pve3
BandwidthSpeedtest Exporter (4h intervals)
GPUDCGM Exporter (NVIDIA RTX 3060, 15s scrape)
LoRaWANChirpStack MQTT Exporter
Security Rules100+ custom Wazuh rules (IDs 100100–100499)
Active ResponseSSH brute force + port scan blocking with escalation
VulnerabilityWazuh CVE scanning + Grafana deep-dive dashboard
DeploymentAnsible (Wazuh) + Helm/Kustomize (K8s)
IngressTraefik (internal) + Cloudflare Tunnel (external)
TLScert-manager with Cloudflare DNS-01 wildcard
InfrastructureProxmox → Ubuntu 24.04 → K3s single-node