Change or Add a Corosync Ring on a Proxmox Cluster
A Proxmox cluster relies on Corosync for its quorum. When the network carrying the ring goes down β typically an OVH vRack link disappearing because of a hardware fault β the cluster wobbles. Here is the procedure we ran in production to add a second ring, move ZFS replication onto the public IP and restore service in degraded mode, while waiting for the hardware to be repaired.
The setup: 2 OVH nodes, ring over the vRack
The starting architecture is a classic OVH one: two Proxmox servers interconnected by a vRack (OVH's private L2 network between servers). Corosync uses that vRack as its single ring (link 0), and ZFS replication between the two nodes runs over those same private IPs. The public interface carries no ring: it serves the VMs' traffic, not the cluster.
Before / after the vRack failure
Nominal
vRack card failed
Goal: add the public interface as a ring and bring ZFS replication onto it, to get the service back without waiting for the vRack card repair.
Two files, two readers: the basics
This is what wastes the most time. Corosync on Proxmox handles two distinct files, read by different components:
| File | Read by | Role |
|---|---|---|
| /etc/corosync/corosync.conf | corosync daemon | Quorum, knet transport, rings |
| /etc/pve/corosync.conf | pmxcfs master + PVE tools | Source of truth, propagated to the cluster; remote_node_ip, replication, migration |
Golden rule: as long as the cluster is quorate, you edit only /etc/pve/corosync.conf (the master) and pmxcfs automatically propagates it to each node's local file. Without quorum, /etc/pve is read-only: you must edit the local file and copy it by hand (see the "degraded mode" step).
The gotchas that cost you 2 hours
- Replication / migration follows the primary IP =
ring0_addr, not the corosync runtime state nor/etc/hosts. - The live reload only applies if
config_versionincreases. When in doubt: restart. - After changing
ring0_addr, pmxcfs keeps the old nodelist cached: you must restartpve-clusterto refresh/etc/pve/.members. - The migration network in
datacenter.cfgonly works if target IPs fit in a single CIDR. Public IPs in disjoint /24s β unusable, hence putting the desired network inring0_addr.
Step 1 β Edit the configuration
Edit /etc/pve/corosync.conf if you are quorate, otherwise /etc/corosync/corosync.conf locally (see degraded mode below). Principles:
ring0_addr= primary link (the one replication will follow).ring1_addr= backup link.- Declare one
interfaceblock per linknumber used (0 and 1). - Increment
config_versionon every edit.
Before picking the new version, check the one actually loaded by the daemon:
corosync-cmapctl -g totem.config_versionIn our case, we go from a single ring (vRack) to two rings by adding the public IP. To move replication onto the public network, we set the public IP as ring0_addr and the vRack (still alive on the other node) as backup:
For reference, here is the starting configuration (single ring over the vRack), handy to compare:
Step 2 β Restart Corosync
The live reload is finicky; we prefer a full restart which reloads the file regardless of version. Do it on both nodes:
systemctl restart corosync # on BOTH nodes
corosync-cfgtool -s # LINK 0 = primary, status: connected
pvecm status # Quorate: Yes, without 'expected 1'Step 3 β Refresh the pmxcfs nodelist
Essential after changing ring0_addr: otherwise PVE tools keep the old IP cached and replication keeps targeting the dead vRack.
systemctl restart pve-cluster # on BOTH nodes
cat /etc/pve/.members # the "ip" fields must show the new primaryThen restart the other PVE services so they reload the topology:
systemctl restart pvestatd pvedaemon pveproxy
# if HA is enabled:
systemctl restart pve-ha-lrm pve-ha-crmStep 4 β ZFS replication switches to the public IP
Why that's enough
Proxmox ZFS replication (pvesr) opens an SSH session to the target node's primary IP = its ring0_addr. By setting the public IP as ring0 and restarting pve-cluster, the zfs send/recv jobs resume automatically over the public network β without touching the job definitions.
If the node IPs changed, refresh the cluster SSH host keys, then force a job to validate:
pvecm updatecerts # refreshes the cluster ssh_known_hosts
pvesr run --id <vmid-job> --verbose # ssh must target the new public IPDegraded case: no quorum at all
If the primary link is dead on both sides (or the failure already dropped the quorum before you intervened),/etc/pve is read-only and cluster propagation no longer works. You must push the config by hand:
Once the new ring is established, quorum returns (no more expected 1). Then run steps 2 to 4 (restart corosync, pve-cluster and services) so that pmxcfs takes back control and replication resumes.
Final verification
Exit checklist
corosync-cfgtool -s: LINK 0 connected on the right network, LINK 1 listed.pvecm status: Quorate: Yes, no leftoverexpected 1.cat /etc/pve/.members: the IPs shown are the expected primary IPs.pvesr run --id <job> --verbose: SSH targets the new IP and the job succeeds.
Network / firewall prerequisites
- corosync / knet = UDP 5405 (up to 5412). Open it between node IPs, restricted to the peer (not the world) if going over public.
- If node IPs changed:
pvecm updatecertsto regenerate thessh_known_hosts. - Running quorum over the Internet is a temporary degraded mode: variable latency, exposure. Keep it only while the vRack is being repaired, then move the private link back to ring0.
This scenario β an OVH node suddenly losing its vRack link β is exactly the kind of incident we handle on-call for our clients. We operate Proxmox clusters hosted on OVHcloud under management : ring failover, ZFS replication, follow-up on the hardware repair with the datacenter, and the return to nominal.
Frequently asked questions
Official documentation
Related articles
Disaster Recovery with Proxmox: multi-site
Ceph replication, PBS, RPO/RTO and DORA/NIS2 compliance.
Our 3-2-1 backup strategy
PBS, deduplication, verify jobs and ransomware protection.
Proxmox 8 to 9 migration
Hands-on experience and upgrade methodology in production.
Public NTP on a Proxmox VM
Clock drift, Chrony vs ntpd, VM vs bare-metal.
A Proxmox cluster on OVHcloud to harden?
We design and operate redundant Proxmox clusters (multiple rings, ZFS replication, HA), and respond on-call to network and quorum failures β OVH vRack included. Let's talk about your infrastructure.