Enterprise-Grade Monitoring & SIEM for a Homelab - From Zero to 52 Alert Rules

TL;DR

I built a production-grade monitoring and SIEM platform for my entire homelab infrastructure running on a single-node K3s cluster. The system combines Prometheus for metrics, Grafana for visualization, Loki for log aggregation, and Wazuh for security event management — all deployed via Ansible and Helm with full Infrastructure as Code.

Key Metrics:

42 Prometheus scrape targets
33 custom Grafana dashboards + 10 community imports
52 custom alert rules with intelligent inhibition
22 Wazuh security agents across 3 OS families
8 agent groups with specialized detection rules
~100 custom Wazuh rules (IDs 100100–100499)
3 Wazuh Grafana dashboards (SIEM, Compliance, Vulnerabilities)
14-day log retention in Loki
Multi-tier alerting: Email + ntfy mobile push

Why Build Enterprise Monitoring for a Homelab?

Many homelabs run blind. Services crash, disks fill up, certificates expire, and you only notice when something stops working. I wanted the opposite: know about problems before they become outages.

Three goals drove the implementation:

Visibility: Every host, every service, every metric — in one place. From GPU temperature on the AI VM to battery charge on the UPS, from LoRaWAN sensor signal strength to Proxmox guest status.

Security: Running 20+ hosts with internet-facing services demands real intrusion detection, not just hoping firewalls are enough. Wazuh provides file integrity monitoring, vulnerability scanning, and active response across the entire fleet.

Automation: No manual log checking, no SSH-ing into boxes to check disk space. Alerts come to my phone. Dashboards show the full picture. Problems get detected and — in some cases — resolved automatically.

Architecture

The monitoring stack runs entirely within the monitoring namespace on a single-node K3s cluster, while Wazuh operates as an All-in-One LXC container on Proxmox. Both systems feed into the same Grafana instance.

Infrastructure:

Component	Location	Role
Prometheus	K3s Pod	Metrics collection, 15-day retention, 15Gi storage
Grafana	K3s Pod	Visualization, 33+ custom dashboards
Loki	K3s Pod	Log aggregation, single-binary, 14-day retention
Alertmanager	K3s Pod	Alert routing, email + ntfy
Promtail	K3s DaemonSet + external agents	Log shipping from pods, PVE hosts, Wazuh
Wazuh Manager	LXC on Proxmox	SIEM: Manager + Indexer + Dashboard
8 Exporters	K3s Pods	PVE, UniFi, Blackbox, SNMP, NUT, PBS, AdGuard, Speedtest

Deployment Method:

Stack	Tool	Source
Prometheus + Grafana + Alertmanager	Helm (kube-prometheus-stack)	`kubernetes/monitoring/install.sh`
Loki + Promtail	Helm	`kubernetes/monitoring/install.sh`
All exporters	Kustomize	`kubernetes/monitoring/<exporter>/`
Wazuh Manager	Ansible	`ansible/playbooks/configure-wazuh-manager.yml`
Wazuh Agents	Ansible	`ansible/playbooks/setup-wazuh-agents.yml`
External monitoring agents	Shell scripts	`scripts/setup-monitoring-hosts.sh`

Everything lives in a single Git repository — true Infrastructure as Code with Ansible playbooks for Wazuh and Helm/Kustomize for Kubernetes workloads.

The Monitoring Stack

Prometheus: 42 Scrape Targets

Prometheus sits at the center, scraping metrics from every layer of the infrastructure. The configuration in values.yaml defines 42 jobs organized by category:

Infrastructure Hosts (7 targets):

Target	Host	Port	Interval
pve-nodes	3 Proxmox hypervisors	9100	30s
ai-vm	AI/GPU VM	9100	30s
relay	Mail relay LXC	9100	30s
opnsense	Firewall	9100	30s
smartctl	Storage host	9633	300s

API Exporters (8 targets):

Target	What It Monitors	Port	Interval
pve-exporter	Proxmox API (all VMs/CTs)	9221	60s
pbs-exporter	Proxmox Backup Server	10019	60s
unifi-poller	UniFi Controller (APs, clients)	9130	30s
snmp-switch	D-Link switch	9116	60s
snmp-truenas	TrueNAS SNMP	9116	60s
nut-exporter	Eaton UPS (battery, load)	9199	30s
adguard-exporter	AdGuard DNS analytics	9618	30s
speedtest	Internet bandwidth (every 4h)	9798	300s

Service Health Probes (Blackbox Exporter):

Probe Type	Targets	Module
ICMP Ping	21 hosts (all infrastructure)	icmp_ping
HTTP 2xx	12 internal services	http_2xx
HTTP Any	OPNsense (Let’s Encrypt)	http_any
DNS	OPNsense Unbound	dns_test
SMTP	Internal mail relay	smtp_relay

Application-Specific (20+ targets):

GitLab alone exposes 5 scrape targets (exporter, webservice, gitaly, postgresql, redis). Additional targets include Traefik ingress metrics, cert-manager certificate lifecycle, Cloudflare Tunnel stats, ChirpStack LoRaWAN, MQTT sensor exporter, NVIDIA DCGM GPU metrics (15s interval!), Home Assistant, Wazuh SIEM exporter, and external services like the portfolio chatbot API with bearer token authentication.

The full target list reads like a network inventory — because it essentially is one.

Grafana: 33 Custom Dashboards

Every dashboard is deployed as a Kubernetes ConfigMap with the grafana_dashboard: "1" label, automatically discovered by Grafana’s sidecar. No manual import, no clicking through UIs — git push deploys dashboards.

Custom Dashboards (ConfigMap-based):

Dashboard	Key Panels	Data Source
Homelab Overview	Service status grid, host health, quick links	Prometheus
Wazuh SIEM	Agent fleet, alert categories, top rules	Prometheus (Wazuh exporter)
Wazuh Compliance & Threats	SCA scores, MITRE ATT&CK tactics, auth events	Prometheus
Wazuh Vulnerability Deep Dive	CVE counts by severity, per-host breakdown, trends	Prometheus
AI Platform Overview	GPU temp/utilization/VRAM, inference latency	Prometheus-AI
Portfolio www.pichler.dev	HTTP probe phases, SSL expiry, Docker metrics	Prometheus
OPNsense Firewall	Interface throughput, packet stats, rules	Prometheus
PBS Backup	Backup/verify age, datastore usage, job status	Prometheus
SMART & ZFS Health	Disk temperatures, pool status, error counts	Prometheus
NUT UPS	Battery charge, runtime, load, input voltage	Prometheus
Traefik Ingress	Request rate, latency percentiles, error codes	Prometheus
Loki & Promtail	Ingestion rate, query latency, dropped logs	Prometheus + Loki
LoRaWAN Sensors	Battery %, RSSI, SNR, last seen	Prometheus
Power Cost & Energy	UPS consumption → €/month estimation	Prometheus
SLO & Uptime Tracking	Service availability percentages	Prometheus
Network Map & Status	L2 topology, link utilization	Prometheus

Plus 10 community dashboards imported by gnetId:

Dashboard	gnetId	Purpose
Node Exporter Full	1860 (rev 42)	Comprehensive host metrics
Proxmox VE Cluster	10347	VM/CT overview
UniFi Client/UAP/USW/Sites	11315/11314/11312/11311	Network analytics
Blackbox Exporter	7587	Probe results
MinIO	13502	Object storage
SNMP Stats	11169	Switch metrics
cert-manager	20842	Certificate lifecycle

Alertmanager: 52 Custom Rules with Smart Routing

Alerts aren’t useful if they wake you up for non-issues. The alerting system uses inhibition rules to suppress noise — if a host is down, don’t also alert about its services being unreachable.

Alert Groups (20 categories, 52 rules):

Group	Rules	Examples
host-alerts	11	HostDown, HighCPU, HighMemory, DiskSpaceCritical/Warning, SmartDiskErrors
service-alerts	5	ServiceDown, SlowResponse, SmtpRelayDown, DnsDown
ups-alerts	3	UpsOnBattery (immediate!), UpsLowBattery, UpsBatteryReplace
kubernetes-alerts	2	K3sNodeNotReady, PodCrashLooping
wazuh-alerts	4	WazuhManagerDown, AgentDisconnected, HighCriticalCVEs, AlertSpike
gpu-alerts	4	GPUHighTemperature, GPUCriticalTemperature, GPUMemoryHigh
backup-alerts	4	PbsBackupStale, PbsVerifyStale, K3sBackupStale
certificate-alerts	2	CertExpiringSoon (30d), CertExpiryCritical (7d)
chatbot-alerts	3	ChatbotAPIDown, HighErrorRate, HighResponseTime
lorawan-alerts	2	LoRaSensorOffline (>2h), LoRaSensorBatteryLow (<15%)

Inhibition Logic (5 rules):

HostDown         → suppresses all warnings for that host
UpsOnBattery     → suppresses non-critical alerts
DiskCritical     → suppresses DiskWarning (same mountpoint)
CertCritical     → suppresses CertWarning (same instance)
DiskFill3Days    → suppresses DiskFill7Days

Notification Routing:

Severity	Channel	Timing
Critical	Email + ntfy push	Immediate, repeat 7d
Warning	ntfy push only	No repeat
UpsOnBattery	Email + ntfy	group_wait: 0s, repeat: 5min
Info	Silence	Dashboard only

The ntfy-bridge is a custom Python service that translates Alertmanager webhooks into ntfy push notifications with priority mapping. Critical alerts get priority 5 (urgent), warnings priority 3 (default), and resolved notifications priority 2 (low). Separate topics for homelab-critical and homelab-warnings keep the notification channels clean.

Loki: Centralized Log Aggregation

Loki runs in single-binary mode — all components in one pod. For a homelab, this is the sweet spot between simplicity and capability.

Log Sources (3 tiers):

Tier 1 — K3s Pods (Promtail DaemonSet): All pod logs from /var/log/pods are automatically collected with Kubernetes label enrichment. Zero configuration per service.

Tier 2 — PVE Hosts (External Promtail agents): Installed via scripts/install-node-exporter.sh --with-promtail on pve1, pve2, pve3. Ships syslog, auth.log, pveproxy, pvedaemon, kernel, and systemd journal to Loki’s NodePort (31000).

Tier 3 — Wazuh Manager (Dedicated Promtail): The most interesting source. Promtail on the Wazuh LXC ships:

Log Source	Format	Content
wazuh-alerts	JSON	Security alerts with rule IDs, severity levels
wazuh-manager	Text	Manager operational logs (ossec.log)
active-responses	Text	IP blocks, firewall drops
wazuh-api	Text	Dashboard/API access logs
syslog + auth	Text	System and authentication events
systemd journal	Structured	Service lifecycle events

This creates a powerful correlation capability: Prometheus shows you what is happening (metrics), Loki shows you why (logs), and Wazuh shows you who is responsible (security events).

Configuration:

# Loki retention
limits_config:
  retention_period: 336h  # 14 days
  max_query_series: 50000
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_streams_per_user: 10000

External Host Monitoring

Not everything runs in Kubernetes. The PVE hypervisors, OPNsense firewall, and Wazuh LXC need monitoring agents installed directly.

Installation:

# One script to rule them all
./scripts/setup-monitoring-hosts.sh

# Installs on pve1/2/3:
#   - node_exporter (port 9100) — CPU, RAM, disk, network
#   - promtail (log shipping to Loki NodePort)
# Installs on pve3 additionally:
#   - smartctl_exporter (port 9633) — SMART disk health

OPNsense uses its native os-node_exporter plugin — installed via System → Firmware → Plugins. No SSH needed.

TrueNAS and MinIO expose Prometheus endpoints natively — just enable in their respective UIs.

Wazuh SIEM: Security for Every Host

Why Wazuh?

Open-source SIEM that combines log analysis, intrusion detection, file integrity monitoring, vulnerability detection, and active response in one platform. For a homelab with internet-facing services, this isn’t optional — it’s essential.

Deployment: All-in-One LXC

Property	Value
Platform	LXC on Proxmox
Resources	4 vCPU, 6GB RAM, 50GB Disk
Version	Wazuh 4.14.3
Components	Manager + OpenSearch Indexer + Dashboard

Deployed and configured entirely via Ansible:

# Deploy agents to all Linux hosts
ansible-playbook -i inventory.yml playbooks/setup-wazuh-agents.yml

# Configure Manager (groups, rules, active response, email)
ansible-playbook -i inventory.yml playbooks/configure-wazuh-manager.yml

Agent Fleet: 22 Agents, 8 Groups

Every host in the homelab runs a Wazuh agent. Agents are organized into groups with tailored configurations:

Agent Groups:

Group	Agents	Specialization
proxmox	3 hypervisors	Hypervisor config monitoring, /etc/pve FIM
kubernetes	K3s node	K3s audit logs, K8s events, pod status, containerd
storage	4 storage hosts	Backup configs, ZFS settings, storage credentials
network	3 network services	Network service configs, Docker listener
services	10 agents	Docker lifecycle, Nginx logs, service configs
ai-workload	GPU VM	Ignores large model files (.gguf, .safetensors)
windows	1 Windows host	Windows Event Logs, Sysmon (pending deployment)
siem	Wazuh (self)	Self-monitoring: Manager, Dashboard, OpenSearch configs

Each group has its own agent.conf in ansible/files/wazuh/shared/<group>/, defining:

FIM (File Integrity Monitoring): Which paths to watch in realtime vs. scheduled scans
Syscollector: Hardware/software inventory intervals
Localfile: Which logs to collect and parse
Docker listener: Container lifecycle events (enabled on 6 hosts)
Vulnerability detection: OS and package scanning

Example — Kubernetes Group Configuration:

The kubernetes agent does the heavy lifting for cluster security:

<!-- K3s audit log (JSON) -->
<localfile>
  <log_format>json</log_format>
  <location>/var/log/k3s-audit.log</location>
</localfile>

<!-- K8s Warning events (streamed) -->
<localfile>
  <log_format>json</log_format>
  <location>/var/log/k8s-events.log</location>
</localfile>

<!-- Pod status check (every 2 min) -->
<localfile>
  <log_format>full_command</log_format>
  <command>kubectl get pods --all-namespaces -o json | jq ...</command>
  <frequency>120</frequency>
</localfile>

<!-- Container status check (every 2 min) -->
<localfile>
  <log_format>full_command</log_format>
  <command>crictl ps -a -o json | jq ...</command>
  <frequency>120</frequency>
</localfile>

Custom Rules: 100 Rules Across 5 Categories

Every custom rule has a specific ID range and severity level. Severity determines action: level 10+ gets logged prominently, level 13+ triggers email alerts.

Proxmox Rules (100100–100199):

Rule ID	Event	Level
100100-101	VM/CT start/stop	5
100102	VM migration failure	10
100103	Cluster membership change	8
100104-105	Backup job failed/success	10/3
100106	Storage config change	7
100107	Ceph degradation	12

K3s Audit Rules (100200–100206):

Rule ID	Event	Level
100200	Secret access/modification	10
100201	Pod deletion	8
100202	RBAC denial (403)	12
100203	Namespace operations	7
100204	Workload changes (deploy/ds/sts)	5
100205	RBAC config changes	10
100206	kubectl exec into pods	8

Kubernetes Container Monitoring (100210–100241):

Rule ID	Event	Level
100211	OOMKilled	12
100212	CrashLoopBackOff	10
100213	Image pull errors	8
100215	Pod evictions	10
100219	Node not ready	12
100240	Multiple OOMKills (3+ in 10min)	13 (email!)
100241	Multiple CrashLoopBackOff (3+ in 5min)	12

Docker Container Monitoring (100250–100256):

Rule ID	Event	Level
100250	Container start	5
100252	Container died unexpectedly	10
100253	Container OOMKilled	8
100255	Command executed in container	10
100256	Multiple container deaths (3+ in 10min)	12

Homelab Security Rules (100400–100410):

Rule ID	Event	Level
100400/450	SSH from non-local (suppressed for trusted subnets)	8/0
100404	SSH brute force detection	10 → Active Response
100405	New user account creation	8
100408	Disk space exhausted	12
100409	OOM killer triggered	12
100410	Certificate expiration warning	8

Custom Decoders

Seven custom decoders parse non-standard log formats:

proxmox-task    → pvedaemon/pveproxy/pvestatd/pveceph logs
pbs-task        → Proxmox Backup Server proxy/manager logs
k3s-audit       → K3s API server audit events (JSON)
k8s-event       → Kubernetes Warning events from k8s-event-logger
k8s-pod-status  → kubectl pod status JSON output
containerd-status → crictl container status JSON output
opnsense-filterlog → OPNsense packet filter log parsing

Active Response: Automated Threat Mitigation

Wazuh doesn’t just detect — it responds. The active response system implements escalating IP blocking:

SSH Brute Force Response:

Trigger: Repeated failed SSH login attempts
Action: firewall-drop (iptables block)

Escalation: Increasing block durations from minutes to hours
Port Scan: Automatic temporary block

An IP whitelist protects infrastructure hosts (hypervisors, K3s master, Wazuh manager) from accidental self-lockout — a lesson learned the hard way in many SIEM deployments.

K3s Audit Logging

Kubernetes API audit logging captures every request to the K3s API server. The audit policy defines four levels:

Level	Events
None	Health checks, list/watch, system service accounts
Metadata	Secret ops, RBAC changes, namespace ops, workload changes
RequestResponse	Pod exec/attach/portforward
RequestResponse	RBAC modifications

Implementation:

# WARNING: ~30s API downtime during deployment
ansible-playbook -i inventory.yml playbooks/setup-k3s-audit-logging.yml

The playbook configures K3s with an audit policy file and log rotation (7 days, 100MB max, 3 backups). Wazuh’s kubernetes agent reads /var/log/k3s-audit.log via the custom k3s-audit decoder.

This catches critical events: someone accessing Secrets, RBAC permission denials (potential privilege escalation attempts), unauthorized kubectl exec into pods, and workload modifications.

K8s Container Monitoring

A two-part system provides container-level visibility:

Part 1 — k8s-event-logger (systemd service):

# Streams K8s Warning events in real-time
kubectl get events --all-namespaces \
  --field-selector type!=Normal \
  --watch-only \
  -o json >> /var/log/k8s-events.log

This captures OOMKilled, CrashLoopBackOff, ImagePullBackOff, scheduling failures, and evictions as they happen.

Part 2 — Periodic Status Checks (every 2 minutes):

# Pods with restartCount > 3, waiting containers, or OOMKilled
kubectl get pods --all-namespaces -o json | jq '...'

# Non-running containers via containerd
crictl ps -a -o json | jq '...'

The combination of real-time event streaming and periodic health checks ensures nothing slips through.

Docker Container Monitoring

Six hosts run Docker alongside the Wazuh agent. The Docker listener wodle captures container lifecycle events:

Host Type	Containers Monitored
Web server	Portfolio, chatbot, tunnel, analytics
Network	UniFi Controller
Security	Password manager
Media	Media server
Communication	Matrix/Synapse
Infrastructure	UPS notification service

Events tracked: start, stop, die (unexpected), oom, pull, exec_start. An unexpected container death (rule 100252, level 10) gets immediate attention; three deaths in 10 minutes (rule 100256, level 12) indicates a systemic problem.

OPNsense: FreeBSD Agent

OPNsense requires special handling — it’s FreeBSD, not Linux. Deployment uses the native plugin:

System → Firmware → Plugins → os-wazuh-agent → Install
Services → Wazuh Agent → Settings → Manager: <wazuh-ip> → Enable

The custom opnsense-filterlog decoder parses OPNsense’s unique packet filter log format, extracting rule numbers, interfaces, source/destination IPs, and actions.

Email Alerting

Email alerting is configured with a high severity threshold — only critical events (multiple OOMKills in rapid succession, RBAC denials, Ceph degradation, or brute force escalations) trigger email notifications. This prevents warning spam. A daily summary report covers lower-severity events for non-urgent review.

Vulnerability Detection

Wazuh scans every agent for known CVEs using package inventory data and vulnerability feeds (updated hourly). The Grafana Vulnerability Deep Dive dashboard visualizes:

Total CVE count by severity (Critical/High/Medium/Low)
Top 15 CVEs by affected host count
Per-host vulnerability breakdown
CVE trends over time
Agent keepalive staleness (detecting disconnected agents)

This drives our vulnerability remediation workflow:

# Patch all hosts (rolling update, one at a time)
ansible-playbook -i inventory.yml playbooks/patch-vulnerabilities.yml

Wazuh ↔ Grafana Integration

The bridge between Wazuh and Grafana is a custom Prometheus exporter running on the Wazuh LXC. It queries the Wazuh Manager API and exposes 33 metric families:

Key Metrics Exported:

Metric	Description
wazuh_agents_active	Number of connected agents
wazuh_agents_disconnected	Agents that lost connection
wazuh_alerts_24h	Alert volume (last 24 hours)
wazuh_alerts_by_level	Alerts grouped by severity
wazuh_vulnerabilities_by_severity	CVE counts (critical/high/medium/low)
wazuh_sca_score	Security Configuration Assessment score per agent
wazuh_mitre_tactic_count	MITRE ATT&CK tactic distribution
wazuh_fim_entries	File integrity monitoring file count
wazuh_active_response_24h	Automated blocks in last 24 hours
wazuh_agent_keepalive_age_seconds	Agent staleness indicator

Three Dedicated Dashboards:

1. Wazuh SIEM Dashboard (uid: wazuh-siem)

SIEM Overview: Manager status, agent counts, 24h alert volume
Security Alerts: By agent (pie), by category (pie), by severity
Events Breakdown: SCA, Rootcheck, FIM, AppArmor, K8s Audit stats
Vulnerability Assessment: Severity distribution, per-host CVE bars
Agent Fleet: Status grid (online/offline), OS distribution
Trends: Alert rate, agent fleet, FIM entries over time

2. Wazuh Compliance & Threats (uid: wazuh-compliance)

SCA Compliance: Score per agent (bar gauge), pass/fail breakdown
MITRE ATT&CK: Tactic distribution, top 15 techniques, trends
Authentication: Success/failed counters, active response count

3. Wazuh Vulnerability Deep Dive (uid: wazuh-vulns)

Total CVEs with critical/high/medium/low breakdown
Top 15 CVEs by count
Per-host analysis with severity breakdown
Agent health via keepalive age

The Complete Data Flow

Lessons Learned

1. Inhibition rules are non-negotiable. Without them, a single host going down generates 5+ alerts (host down + services down + probes failing). With inhibition, you get one alert.

2. Separate critical from noise early. Wazuh’s email threshold at level 13 and Alertmanager’s critical/warning split prevent alert fatigue. If everything is urgent, nothing is.

3. Single-binary Loki is perfect for homelabs. The microservices deployment mode is overkill. Single-binary with 14-day retention on local storage handles everything a homelab needs.

4. Custom decoders make or break Wazuh. Out-of-the-box Wazuh doesn’t understand Proxmox, K3s audit, or OPNsense logs. Seven custom decoders were needed to make the data useful.

5. Active response needs a whitelist. Without one, Wazuh’s SSH brute force blocking will eventually block your management IPs. The whitelist for PVE hosts and K3s master prevents self-lockout.

6. GitOps dashboards > manual dashboards. ConfigMap-based Grafana dashboards survive pod restarts, are version-controlled, and deploy automatically. Never create dashboards through the UI in production.

7. The Prometheus exporter bridge is worth building. Wazuh’s native dashboard is great for investigation, but Grafana provides the unified view. A custom exporter bridging the two gives you the best of both worlds.

8. Monitor the monitoring. Dedicated alerts for LokiDown, PromtailDown, WazuhManagerDown, WazuhAgentDisconnected, and PbsExporterDown ensure the observability stack itself stays healthy.

Built With Claude Code

The entire monitoring and SIEM stack — all Ansible playbooks, Helm values, custom rules, decoders, dashboards, exporters, and alert configurations — was built using Claude Code (Claude Opus).

My role: Architect defining requirements, reviewing outputs, and making security decisions.

Claude’s execution:

1,596-line Helm values.yaml with 42 scrape targets and 52 alert rules
100+ Wazuh custom rules with proper severity levels and escalation
7 custom decoders for non-standard log formats
33 Grafana dashboard ConfigMaps with PromQL queries
8 exporter Kustomize deployments with proper resource limits
Ansible playbooks for automated agent deployment across Linux, FreeBSD, and Windows
ntfy-bridge Python service for mobile push notifications
Shell scripts for external monitoring agent installation

This is infrastructure-as-code at scale — the kind of work that would take weeks manually, delivered in days with AI-assisted development.

What’s Next

Shipped:

42 Prometheus targets covering all infrastructure
22 Wazuh agents with specialized group configs
52 custom alert rules with inhibition
33 Grafana dashboards
Multi-tier alerting (email + ntfy)
Active response (SSH brute force, port scan blocking)
Restore testing automation (verify-backup.sh — monthly random VM restore from PBS, boot verification, auto-cleanup, ntfy alerts)
Server closet sensor (LoRaWAN sensors for temperature, humidity, CO2, VOC via ChirpStack MQTT exporter)
UPS auto-shutdown orchestration (NUT client on all PVE hosts, graceful shutdown on LOWBATT, ntfy notifications)
Incident Timeline dashboard (alert correlation, Loki error logs, ArgoCD deploy tracking, system resources)

Roadmap:

Server room sensor: LoRaWAN sensor in the server closet measuring temperature, humidity (and depending on model CO2/VOC), displayed via ChirpStack → MQTT Exporter → Prometheus → Grafana

The Stack (Complete Reference)

Layer	Technology
Metrics	Prometheus (kube-prometheus-stack Helm, 15d retention)
Visualization	Grafana (33 custom + 10 community dashboards)
Logs	Loki single-binary (14d retention, 10Gi storage)
Log Shipping	Promtail (K3s DaemonSet + 4 external agents)
SIEM	Wazuh 4.14.3 All-in-One (LXC, 22 agents, 8 groups)
Alerting	Alertmanager → Email (critical) + ntfy (all)
Host Metrics	node_exporter on 7 hosts + OPNsense plugin
Proxmox	pve-exporter (API scrape, cluster mode)
Backup Monitoring	pbs-exporter (Proxmox Backup Server)
Network	UniFi Poller + SNMP Exporter (switch, TrueNAS)
UPS	NUT Exporter (Eaton Ellipse PRO 850)
DNS	AdGuard Exporter
Probing	Blackbox Exporter (21 ICMP, 12 HTTP, DNS, SMTP)
Disk Health	smartctl_exporter on pve3
Bandwidth	Speedtest Exporter (4h intervals)
GPU	DCGM Exporter (NVIDIA RTX 3060, 15s scrape)
LoRaWAN	ChirpStack MQTT Exporter
Security Rules	100+ custom Wazuh rules (IDs 100100–100499)
Active Response	SSH brute force + port scan blocking with escalation
Vulnerability	Wazuh CVE scanning + Grafana deep-dive dashboard
Deployment	Ansible (Wazuh) + Helm/Kustomize (K8s)
Ingress	Traefik (internal) + Cloudflare Tunnel (external)
TLS	cert-manager with Cloudflare DNS-01 wildcard
Infrastructure	Proxmox → Ubuntu 24.04 → K3s single-node

monitoring SIEM Grafana Prometheus Wazuh Loki Kubernetes homelab security