High-availability GNU/Linux with Heartbeat

Introduction

Is the mean time between failure of your GNU/Linux box unacceptable? Do you need to provide continuous service, but recognise the need for kernel upgrades? Heartbeat may be what you need.

The Heartbeat software allows two computers to act as one. One computer does all the work whilst the other waits for it to fail. When a failure is detected, the other computer takes over. This is called Active-Passive, because at any one time one computer is active whilst the other isn't.

Actually, Heartbeat is more flexible than this and allows various scenarios involving more than one computer in active-active states too, but for this example this was unnecessary.

Scenario

For this project I need to connect two networks together over a WAN link. The WAN link is a 1000MBit LES across Leeds provided by British Telecom. This is to be achieved with a GNU/Linux router at each end, providing firewalling and QoS (to guarantee VoIP bandwidth). As all our voice lines and a lot of important data will be travelling over this link, it needs to be reliable.

Even with redundant powersupplies and hard disks, we still need downtime for kernel upgrades and if somebody makes a mistake with a config it could take time to fix. Step in Heartbeat (after writing various procedures and policies to minimize mistakes ;).

Now we'll have 2 machines at each end of the WAN, each pair in an active-passive partnership. If one machine fails or somebody seriously breaks something, the other machine takes over the IP address and continues routing and firewalling with minimal interruption to service). If we need to upgrade a kernel, we can manually switch-over the active server and reboot the passive machine.

see larger diagram

So, at each end we have 2 machines acting as one. I've chosen names for the machines to reflect this.

The machines julius and caesar provide the virtual machine juliuscaesar. At the other end, the machines marcus and aurelius provide the virtual machine marcusaurelius

In addition to the virtual IP addresses that can float between machines, each needs an IP address of their own. Without these static IP addresses, we wouldn't be able to connect to the passive machine to admin it. Plus SSH gets upset when IP addresses change machine.

For example, on the LAN side juliuscaesar has the (virtual) IP address of 10.0.0.254, but in addition, julius has the static address 10.0.0.252 and caesar has the static address 10.0.0.253. Whenever we need to add firewall rules or upgrade kernels or whatever, we log in to these static IP addresses. See the larger diagram for all the other IP addresses.

Heartbeat is configured to communicate over a RS232 serial cable between the machines. A message is sent back and forth over this link every few seconds. If either box ceases sending these messages, the other knows it has died. Manual switch-over commands are also received over this link. These messages can be sent other the ethernet links if necessary, but I've chosen serial.

Generic configuration

The config file /etc/ha.d/ha.cf on julius and caesar looks like this:

serial  /dev/ttyS0
auto_failback off
node julius
node caesar
keepalive 5
deadtime 15
warntime 10
initdead 30

We specify which serial device to use and not to fail back when the failed machine comes back alive. We then specify the hostnames of all machines in the group. These names must match the output of the hostname command on each box.

We then specify that "keepalive" heartbeat messages should be sent every 5 seconds, that a host is declared dead when it isn't heard from for 15 seconds, and that a warning is logged if a host takes 10 seconds or more to answer (this might help flag potential problems). We then specify than on first boot, heartbeat sits quietly for at least 30 seconds before deciding who is live and who is dead. This is important as some networks can take a little while before hosts appear properly.

This config looks the same on marcus and aurelius except for the node declarations.

Resource configuration

The floating virtual IP addresses are known as a heartbeat resources. Heartbeat resources are defined in /etc/ha.d/haresources. This config file must be identical on all hosts in each group. The resource config for julius and caesar looks like this:

julius IPaddr::10.0.0.254 IPaddr::1.1.1.254 MailTo::alerts@myworkaddress.com::JuliusCaesar

This states that the IP addresses 10.0.0.254 and 1.1.1.254 are heartbeat resources to exist on only the active machine, and the preferred active machine is julius. This means if both machines are booted at the same time, julius should end up with the resources to begin with (assuming it's not failed). If we had auto_failback on set, it would mean julius would always have the resources as long as it hadn't failed.

Installing Heartbeat