Implement a High-Availability Cluster with Pacemaker and Corosync

Introduction

Building a highly available cluster with Pacemaker and Corosync ensures that critical services remain online even when individual nodes fail. This article dives deep into the architecture, configuration, best practices, and advanced topics to help you design and operate a robust HA environment.

1. Architectural Overview

1.1 Components

Corosync: Cluster messaging and membership management.
Pacemaker: Resource manager and fencing coordination.
STONITH: Shoot-The-Other-Node-In-The-Head for safe node isolation.
Resources: Services (IP, Filesystem, Databases) managed by Pacemaker.

1.2 Cluster Communication

Corosync establishes a reliable ring or multicast network. It detects membership changes, network partitions, and propagates heartbeat messages. Pacemaker reads this information and orchestrates resource failover.

2. Prerequisites and Environment

2.1 Hardware and Network

Minimum two nodes (three recommended for quorum).
Private network interface for cluster heartbeat.
Public interface for client traffic.

2.2 Software Requirements

Linux distribution (RHEL, CentOS, Ubuntu LTS).
Pacemaker and Corosync packages from official repos.
SSH key-based trust between nodes.
Optional VPN for secure cross-site cluster communication:
OpenVPN,
WireGuard,
IPSec.

3. Installing Pacemaker and Corosync

Update package index: yum update or apt-get update.
Install components:
- yum install pacemaker corosync pcs fence-agents-all -y
- apt-get install pacemaker corosync pcs fence-agents-all -y
Enable and start services:
- systemctl enable corosync pacemaker pcsd
- systemctl start corosync pacemaker pcsd
Set password for hacluster user: passwd hacluster.

4. Basic Cluster Configuration

4.1 Authentication and Setup

pcs cluster auth node1 node2 [node3] -u hacluster -p password

pcs cluster setup --name mycluster node1 node2 [node3]

4.2 Starting the Cluster

pcs cluster start --all

Verify with:

pcs status --full

5. Resource Configuration

Define your services as resources. Common examples:

Resource Type	Example CLI
Virtual IP	`pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.0.100 cidr_netmask=24 op monitor interval=30s`
Filesystem	`pcs resource create sharedfs ocf:heartbeat:Filesystem device=/dev/sdb1 directory=/mnt/shared fstype=ext4 op monitor interval=20s`
MySQL	`pcs resource create mysql ocf:heartbeat:mysql binary=/usr/bin/mysqld config=/etc/my.cnf op monitor interval=30s`

5.1 Constraints and Ordering

Control start/stop order and colocation:

pcs constraint order start sharedfs then mysql
pcs constraint colocation add mysql with vip INFINITY

6. Fencing and STONITH

Implement reliable fencing to avoid split-brain scenarios:

Choose a fencing agent (iLO, IPMI, DRAC).
Configure with pcs stonith create:

pcs stonith create fence-node1 fence_ipmilan pcmk_host_list=node1 ipaddr=192.168.1.100 login=ADMIN passwd=secret op monitor interval=60s

7. Testing and Failure Scenarios

Simulate node failure:

Stop corosync on a node: systemctl stop corosync.
Observe resource failover with pcs status.
Rejoin the node: pcs cluster start corosync.

Simulate fencing by powering off via agent and verify safe reboot and reintegration.

8. Monitoring and Maintenance

Use crm_mon -A for live status.
Integrate with Nagios, Zabbix, or Prometheus.
Regularly test fencing and failover drills.
Keep CIB backups: pcs cluster cib-push backup_name.

9. Advanced Topics

9.1 Multi-Site Cluster

Extend cluster across datacenters with stretched Corosync rings or VPN tunnels:
OpenVPN or
WireGuard to secure traffic.

9.2 Quorum and Ticketing

Use pcs quorum to configure quorum devices. Employ Pacemaker ticket-based fencing for advanced split-brain avoidance.

10. Security Considerations

Encrypt Corosync traffic (crypto_cipher and crypto_hash settings).
Restrict SSH and management interfaces.
Use VPNs:
IPSec,
WireGuard.
Regularly patch cluster nodes and agents.

Conclusion

Implementing a high-availability cluster with Pacemaker and Corosync requires careful planning, reliable fencing, and continuous testing. Following best practices—secure communication, proper resource constraints, and scheduled failover drills—ensures your services remain resilient in the face of hardware or software failures. Armed with this guide, you can architect and maintain a production-grade HA cluster ready to handle real-world challenges.

LINUXMIND.DEV