© Copyright John Leach <john@johnleach.co.uk>
Last updated: $Date: 2005-06-18 15:39:55 +0100 (Sat, 18 Jun 2005) $
Is the mean time between failure of your GNU/Linux box unacceptable? Do you need to provide continuous service, but recognise the need for kernel upgrades? Heartbeat may be what you need.
The Heartbeat software allows two computers to act as one. One computer does all the work whilst the other waits for it to fail. When a failure is detected, the other computer takes over. This is called Active-Passive, because at any one time one computer is active whilst the other isn't.
Actually, Heartbeat is more flexible than this and allows various scenarios involving more than one computer in active-active states too, but for this example this was unnecessary.
For this project I need to connect two networks together over a WAN link. The WAN link is a 1000MBit LES across Leeds provided by British Telecom. This is to be achieved with a GNU/Linux router at each end, providing firewalling and QoS (to guarantee VoIP bandwidth). As all our voice lines and a lot of important data will be travelling over this link, it needs to be reliable.
Even with redundant powersupplies and hard disks, we still need downtime for kernel upgrades and if somebody makes a mistake with a config it could take time to fix. Step in Heartbeat (after writing various procedures and policies to minimize mistakes ;).
Now we'll have 2 machines at each end of the WAN, each pair in an active-passive partnership. If one machine fails or somebody seriously breaks something, the other machine takes over the IP address and continues routing and firewalling with minimal interruption to service). If we need to upgrade a kernel, we can manually switch-over the active server and reboot the passive machine.
So, at each end we have 2 machines acting as one. I've chosen names for the machines to reflect this.
The machines julius and caesar provide the virtual machine juliuscaesar. At the other end, the machines marcus and aurelius provide the virtual machine marcusaurelius
In addition to the virtual IP addresses that can float between machines, each needs an IP address of their own. Without these static IP addresses, we wouldn't be able to connect to the passive machine to admin it. Plus SSH gets upset when IP addresses change machine.
For example, on the LAN side juliuscaesar has the
(virtual) IP address of 10.0.0.254
, but in addition,
julius has the static address 10.0.0.252
and
caesar has the static address 10.0.0.253
.
Whenever we need to add firewall rules or upgrade kernels or whatever, we log
in to these static IP addresses. See the larger diagram for all the other IP
addresses.
Heartbeat is configured to communicate over a RS232 serial cable between the machines. A message is sent back and forth over this link every few seconds. If either box ceases sending these messages, the other knows it has died. Manual switch-over commands are also received over this link. These messages can be sent other the ethernet links if necessary, but I've chosen serial.
The config file /etc/ha.d/ha.cf
on julius and
caesar looks like this:
serial /dev/ttyS0
auto_failback off
node julius
node caesar
keepalive 5
deadtime 15
warntime 10
initdead 30
We specify which serial device to use and not to fail back when the failed
machine comes back alive. We then specify the hostnames of all machines in the
group. These names must match the output of the hostname
command on each box.
We then specify that "keepalive" heartbeat messages should be sent every 5 seconds, that a host is declared dead when it isn't heard from for 15 seconds, and that a warning is logged if a host takes 10 seconds or more to answer (this might help flag potential problems). We then specify than on first boot, heartbeat sits quietly for at least 30 seconds before deciding who is live and who is dead. This is important as some networks can take a little while before hosts appear properly.
This config looks the same on marcus and
aurelius except for the node
declarations.
The floating virtual IP addresses are known as a heartbeat resources. Heartbeat
resources are defined in /etc/ha.d/haresources
. This config file
must be identical on all hosts in each group. The resource config for
julius and caesar looks like this:
julius IPaddr::10.0.0.254 IPaddr::1.1.1.254 MailTo::alerts@myworkaddress.com::JuliusCaesar
This states that the IP addresses 10.0.0.254
and
1.1.1.254
are heartbeat resources to exist on only the active
machine, and the preferred active machine is julius. This
means if both machines are booted at the same time, julius
should end up with the resources to begin with (assuming it's not
failed). If we had auto_failback on
set, it would mean
julius would always have the resources as long as it hadn't
failed.
I'm using the Debian GNU/Linux
distribution which comes with install packages for heartbeat (apt-get
install heartbeat
). Packages for RedHat/Fedora can be obtained from Ultramonkey.
Go back to johnleach.co.uk