๐งจ Cluster Outage Postmortem: Missing Keepalived
๐ 2026-01-14 10:04
๐งน Incident Trigger
This outage was not caused by a power failure, kernel panic, or misconfiguration.
It was caused by my wife shutting down my main server ( Tiny) while vacuuming the living room.
Tiny was not just another node. Tiny happened to be the entry point of the entire cluster. All external port forwarding rules were redirected to Tiny:
- HTTPS
- SSH
- Service endpoints
- Gemini
- Gopher
- Finger
This meant that when Tiny went down, NOTHING could be reached from outside my home network, even though other nodes were still up and healthy.
โ ๏ธ Root Cause
The cluster had no high - availability IP management. No:
- Virtual IP (VIP)
- Automatic failover
- Master election
- Redundancy at the network layer
All services ports were implicitly tied to a single machine.
This meant: One shutdown = total service loss.
๐ก Detection
The failure was immediately visible:
- All services unreachable
- External access dead
- Internal routing broken
- No automatic recovery
This confirmed the presence of a hard single point of failure at the network ingress layer.
๐ ๏ธ Resolution: Keepalived Deployment
Keepalived is a Linux daemon providing high availability (HA) and load balancing for server clusters by implementing the Virtual Router Redundancy Protocol (VRRP), allowing a Virtual IP (VIP) to seamlessly failover between servers, ensuring continuous service even if a master node fails.
๐ KeepAlived Official Website
I installed keepalived on all 5 machines in the cluster.
Each node now participates in a VRRP group.
New Setup
- 5 nodes
- 1 shared Virtual IP (VIP)
- Automatic master election
- Automatic failover
- Health-based priority handling
If the active node goes down, another node takes over the VIP within seconds. No manual intervention required.
๐ช What Is VRRP?
VRRP (Virtual Router Redundancy Protocol) allows multiple machines to share a single IP address.
At any given time:
- One node is MASTER
- The others are BACKUP
- All nodes advertise their state via multicast
- Priority determines who becomes master
- If the master disappears, the highest-priority backup takes over
The VIP is moved automatically between machines. To the network, nothing changes. To clients, nothing breaks.
๐ Network Architecture Update
All port forwardings now target the shared virtual IP, not a specific machine. This removes the dependency on any single node acting as the gateway.
Benefits:
- No more node specific NAT rules
- No more hardcoded ingress points
- Transparent failover
- Stable external endpoint
- Services survive node shutdowns
From the outside, the cluster now appears as one resilient system.
๐ Keepalived Configuration Examples
MASTER Node Example
BACKUP Node Example
๐ Notes
- virtual_router_id must match on all nodes
- Highest priority wins
- Network Interface must be correct for each machine
- VIP (shared ip address) should not be statically assigned anywhere
With 5 nodes, I simply stagger priorities.
๐ง Lessons Learned
High availability is not about having multiple machines. It's about:
- eliminating single points of failure
- automating failover
- designing for accidental shutdowns
- assuming humans will unplug things
๐ชฆ Epilogue: The Sacrifice of Tiny
Tiny did not crash. Tiny was chosen. Chosen by fate. Chosen by dust. Chosen by a power button.
And from its untimely shutdown, a highly available cluster was born.
Your sacrifice will be remembered, Tiny.