๐Ÿงจ Cluster Outage Postmortem: Missing Keepalived

๐Ÿ“† 2026-01-14 10:04

๐Ÿงน Incident Trigger

This outage was not caused by a power failure, kernel panic, or misconfiguration.

It was caused by my wife shutting down my main server ( Tiny) while vacuuming the living room.

Tiny was not just another node. Tiny happened to be the entry point of the entire cluster. All external port forwarding rules were redirected to Tiny:

This meant that when Tiny went down, NOTHING could be reached from outside my home network, even though other nodes were still up and healthy.

โ˜ ๏ธ Root Cause

The cluster had no high - availability IP management. No:

All services ports were implicitly tied to a single machine.

This meant: One shutdown = total service loss.

๐Ÿ“ก Detection

The failure was immediately visible:

This confirmed the presence of a hard single point of failure at the network ingress layer.

๐Ÿ› ๏ธ Resolution: Keepalived Deployment

Keepalived is a Linux daemon providing high availability (HA) and load balancing for server clusters by implementing the Virtual Router Redundancy Protocol (VRRP), allowing a Virtual IP (VIP) to seamlessly failover between servers, ensuring continuous service even if a master node fails.

๐Ÿ”– KeepAlived Official Website

I installed keepalived on all 5 machines in the cluster.

Each node now participates in a VRRP group.

New Setup

If the active node goes down, another node takes over the VIP within seconds. No manual intervention required.

๐Ÿšช What Is VRRP?

VRRP (Virtual Router Redundancy Protocol) allows multiple machines to share a single IP address.

At any given time:

The VIP is moved automatically between machines. To the network, nothing changes. To clients, nothing breaks.

๐ŸŒ Network Architecture Update

All port forwardings now target the shared virtual IP, not a specific machine. This removes the dependency on any single node acting as the gateway.

Benefits:

From the outside, the cluster now appears as one resilient system.

๐Ÿ“„ Keepalived Configuration Examples

MASTER Node Example

BACKUP Node Example

๐Ÿ“˜ Notes

With 5 nodes, I simply stagger priorities.

๐Ÿง  Lessons Learned

High availability is not about having multiple machines. It's about:

๐Ÿชฆ Epilogue: The Sacrifice of Tiny

Tiny did not crash. Tiny was chosen. Chosen by fate. Chosen by dust. Chosen by a power button.

And from its untimely shutdown, a highly available cluster was born.

Your sacrifice will be remembered, Tiny.

๐Ÿšถ Back to my blog