Introduction
Building a highly available cluster with Pacemaker and Corosync ensures that critical services remain online even when individual nodes fail. This article dives deep into the architecture, configuration, best practices, and advanced topics to help you design and operate a robust HA environment.
1. Architectural Overview
1.1 Components
- Corosync: Cluster messaging and membership management.
- Pacemaker: Resource manager and fencing coordination.
- STONITH: Shoot-The-Other-Node-In-The-Head for safe node isolation.
- Resources: Services (IP, Filesystem, Databases) managed by Pacemaker.
1.2 Cluster Communication
Corosync establishes a reliable ring or multicast network. It detects membership changes, network partitions, and propagates heartbeat messages. Pacemaker reads this information and orchestrates resource failover.
2. Prerequisites and Environment
2.1 Hardware and Network
- Minimum two nodes (three recommended for quorum).
- Private network interface for cluster heartbeat.
- Public interface for client traffic.
2.2 Software Requirements
- Linux distribution (RHEL, CentOS, Ubuntu LTS).
- Pacemaker and Corosync packages from official repos.
- SSH key-based trust between nodes.
- Optional VPN for secure cross-site cluster communication:
OpenVPN,
WireGuard,
IPSec.
3. Installing Pacemaker and Corosync
- Update package index:
yum updateorapt-get update. - Install components:
yum install pacemaker corosync pcs fence-agents-all -yapt-get install pacemaker corosync pcs fence-agents-all -y
- Enable and start services:
systemctl enable corosync pacemaker pcsdsystemctl start corosync pacemaker pcsd
- Set password for
haclusteruser:passwd hacluster.
4. Basic Cluster Configuration
4.1 Authentication and Setup
pcs cluster auth node1 node2 [node3] -u hacluster -p password
pcs cluster setup --name mycluster node1 node2 [node3]
4.2 Starting the Cluster
pcs cluster start --all
Verify with:
pcs status --full
5. Resource Configuration
Define your services as resources. Common examples:
| Resource Type | Example CLI |
|---|---|
| Virtual IP | pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.0.100 cidr_netmask=24 op monitor interval=30s |
| Filesystem | pcs resource create sharedfs ocf:heartbeat:Filesystem device=/dev/sdb1 directory=/mnt/shared fstype=ext4 op monitor interval=20s |
| MySQL | pcs resource create mysql ocf:heartbeat:mysql binary=/usr/bin/mysqld config=/etc/my.cnf op monitor interval=30s |
5.1 Constraints and Ordering
Control start/stop order and colocation:
pcs constraint order start sharedfs then mysqlpcs constraint colocation add mysql with vip INFINITY
6. Fencing and STONITH
Implement reliable fencing to avoid split-brain scenarios:
- Choose a fencing agent (iLO, IPMI, DRAC).
- Configure with
pcs stonith create:
pcs stonith create fence-node1 fence_ipmilan pcmk_host_list=node1 ipaddr=192.168.1.100 login=ADMIN passwd=secret op monitor interval=60s
7. Testing and Failure Scenarios
Simulate node failure:
- Stop corosync on a node:
systemctl stop corosync. - Observe resource failover with
pcs status. - Rejoin the node:
pcs cluster start corosync.
Simulate fencing by powering off via agent and verify safe reboot and reintegration.
8. Monitoring and Maintenance
- Use
crm_mon -Afor live status. - Integrate with Nagios, Zabbix, or Prometheus.
- Regularly test fencing and failover drills.
- Keep CIB backups:
pcs cluster cib-push backup_name.
9. Advanced Topics
9.1 Multi-Site Cluster
Extend cluster across datacenters with stretched Corosync rings or VPN tunnels:
OpenVPN or
WireGuard to secure traffic.
9.2 Quorum and Ticketing
Use pcs quorum to configure quorum devices. Employ Pacemaker ticket-based fencing for advanced split-brain avoidance.
10. Security Considerations
- Encrypt Corosync traffic (
crypto_cipherandcrypto_hashsettings). - Restrict SSH and management interfaces.
- Use VPNs:
IPSec,
WireGuard. - Regularly patch cluster nodes and agents.
Conclusion
Implementing a high-availability cluster with Pacemaker and Corosync requires careful planning, reliable fencing, and continuous testing. Following best practices—secure communication, proper resource constraints, and scheduled failover drills—ensures your services remain resilient in the face of hardware or software failures. Armed with this guide, you can architect and maintain a production-grade HA cluster ready to handle real-world challenges.
Leave a Reply