🧨 Cluster Outage Postmortem: Missing Keepalived

📆 2026-01-14 10:04

🧹 Incident Trigger

This outage was not caused by a power failure, kernel panic, or misconfiguration.

It was caused by my wife shutting down my main server ( Tiny) while vacuuming the living room.

Tiny was not just another node. Tiny happened to be the entry point of the entire cluster. All external port forwarding rules were redirected to Tiny:

HTTPS
SSH
Service endpoints
Gemini
Gopher
Finger

This meant that when Tiny went down, NOTHING could be reached from outside my home network, even though other nodes were still up and healthy.

☠️ Root Cause

The cluster had no high - availability IP management. No:

Virtual IP (VIP)
Automatic failover
Master election
Redundancy at the network layer

All services ports were implicitly tied to a single machine.

This meant: One shutdown = total service loss.

📡 Detection

The failure was immediately visible:

All services unreachable
External access dead
Internal routing broken
No automatic recovery

This confirmed the presence of a hard single point of failure at the network ingress layer.

🛠️ Resolution: Keepalived Deployment

Keepalived is a Linux daemon providing high availability (HA) and load balancing for server clusters by implementing the Virtual Router Redundancy Protocol (VRRP), allowing a Virtual IP (VIP) to seamlessly failover between servers, ensuring continuous service even if a master node fails.

🔖 KeepAlived Official Website

I installed keepalived on all 5 machines in the cluster.

sudo apt install keepalived

Each node now participates in a VRRP group.

New Setup

5 nodes
1 shared Virtual IP (VIP)
Automatic master election
Automatic failover
Health-based priority handling

If the active node goes down, another node takes over the VIP within seconds. No manual intervention required.

🚪 What Is VRRP?

VRRP (Virtual Router Redundancy Protocol) allows multiple machines to share a single IP address.

At any given time:

One node is MASTER
The others are BACKUP
All nodes advertise their state via multicast
Priority determines who becomes master
If the master disappears, the highest-priority backup takes over

The VIP is moved automatically between machines. To the network, nothing changes. To clients, nothing breaks.

🌐 Network Architecture Update

All port forwardings now target the shared virtual IP, not a specific machine. This removes the dependency on any single node acting as the gateway.

Benefits:

No more node specific NAT rules
No more hardcoded ingress points
Transparent failover
Stable external endpoint
Services survive node shutdowns

From the outside, the cluster now appears as one resilient system.

📄 Keepalived Configuration Examples

MASTER Node Example

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 41
    priority 200
    advert_int 1
    virtual_ipaddress {
        192.168.1.255/24
    }
}

BACKUP Node Example

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 41
    priority 100
    advert_int 1
    virtual_ipaddress {
        192.168.1.255/24
    }
}

📘 Notes

virtual_router_id must match on all nodes
Highest priority wins
Network Interface must be correct for each machine
VIP (shared ip address) should not be statically assigned anywhere

With 5 nodes, I simply stagger priorities.

🧠 Lessons Learned

High availability is not about having multiple machines. It's about:

eliminating single points of failure
automating failover
designing for accidental shutdowns
assuming humans will unplug things

🪦 Epilogue: The Sacrifice of Tiny

Tiny did not crash. Tiny was chosen. Chosen by fate. Chosen by dust. Chosen by a power button.

And from its untimely shutdown, a highly available cluster was born.

Your sacrifice will be remembered, Tiny.

🚶 Back to my blog