Technical GuideJune 26, 202613 min read

    Change or Add a Corosync Ring on a Proxmox Cluster

    A Proxmox cluster relies on Corosync for its quorum. When the network carrying the ring goes down β€” typically an OVH vRack link disappearing because of a hardware fault β€” the cluster wobbles. Here is the procedure we ran in production to add a second ring, move ZFS replication onto the public IP and restore service in degraded mode, while waiting for the hardware to be repaired.

    The setup: 2 OVH nodes, ring over the vRack

    The starting architecture is a classic OVH one: two Proxmox servers interconnected by a vRack (OVH's private L2 network between servers). Corosync uses that vRack as its single ring (link 0), and ZFS replication between the two nodes runs over those same private IPs. The public interface carries no ring: it serves the VMs' traffic, not the cluster.

    Before / after the vRack failure

    Nominal

    Ring 0 = vRack (private network)
    ZFS replication over the vRack
    Public interface = VM traffic only
    Quorum: 2/2 nodes

    vRack card failed

    Ring 0 down on one node
    Quorum lost β†’ /etc/pve read-only
    ZFS replication blocked
    Public interface still up

    Goal: add the public interface as a ring and bring ZFS replication onto it, to get the service back without waiting for the vRack card repair.

    Two files, two readers: the basics

    This is what wastes the most time. Corosync on Proxmox handles two distinct files, read by different components:

    FileRead byRole
    /etc/corosync/corosync.confcorosync daemonQuorum, knet transport, rings
    /etc/pve/corosync.confpmxcfs master + PVE toolsSource of truth, propagated to the cluster; remote_node_ip, replication, migration

    Golden rule: as long as the cluster is quorate, you edit only /etc/pve/corosync.conf (the master) and pmxcfs automatically propagates it to each node's local file. Without quorum, /etc/pve is read-only: you must edit the local file and copy it by hand (see the "degraded mode" step).

    The gotchas that cost you 2 hours

    • Replication / migration follows the primary IP = ring0_addr, not the corosync runtime state nor /etc/hosts.
    • The live reload only applies if config_version increases. When in doubt: restart.
    • After changing ring0_addr, pmxcfs keeps the old nodelist cached: you must restart pve-cluster to refresh /etc/pve/.members.
    • The migration network in datacenter.cfg only works if target IPs fit in a single CIDR. Public IPs in disjoint /24s β†’ unusable, hence putting the desired network in ring0_addr.

    Step 1 β€” Edit the configuration

    Edit /etc/pve/corosync.conf if you are quorate, otherwise /etc/corosync/corosync.conf locally (see degraded mode below). Principles:

    • ring0_addr = primary link (the one replication will follow).
    • ring1_addr = backup link.
    • Declare one interface block per linknumber used (0 and 1).
    • Increment config_version on every edit.

    Before picking the new version, check the one actually loaded by the daemon:

    corosync-cmapctl -g totem.config_version

    In our case, we go from a single ring (vRack) to two rings by adding the public IP. To move replication onto the public network, we set the public IP as ring0_addr and the vRack (still alive on the other node) as backup:

    For reference, here is the starting configuration (single ring over the vRack), handy to compare:

    Step 2 β€” Restart Corosync

    The live reload is finicky; we prefer a full restart which reloads the file regardless of version. Do it on both nodes:

    systemctl restart corosync          # on BOTH nodes
    corosync-cfgtool -s                 # LINK 0 = primary, status: connected
    pvecm status                        # Quorate: Yes, without 'expected 1'

    Step 3 β€” Refresh the pmxcfs nodelist

    Essential after changing ring0_addr: otherwise PVE tools keep the old IP cached and replication keeps targeting the dead vRack.

    systemctl restart pve-cluster       # on BOTH nodes
    cat /etc/pve/.members               # the "ip" fields must show the new primary

    Then restart the other PVE services so they reload the topology:

    systemctl restart pvestatd pvedaemon pveproxy
    # if HA is enabled:
    systemctl restart pve-ha-lrm pve-ha-crm

    Step 4 β€” ZFS replication switches to the public IP

    Why that's enough

    Proxmox ZFS replication (pvesr) opens an SSH session to the target node's primary IP = its ring0_addr. By setting the public IP as ring0 and restarting pve-cluster, the zfs send/recv jobs resume automatically over the public network β€” without touching the job definitions.

    If the node IPs changed, refresh the cluster SSH host keys, then force a job to validate:

    pvecm updatecerts                       # refreshes the cluster ssh_known_hosts
    pvesr run --id <vmid-job> --verbose     # ssh must target the new public IP

    Degraded case: no quorum at all

    If the primary link is dead on both sides (or the failure already dropped the quorum before you intervened),/etc/pve is read-only and cluster propagation no longer works. You must push the config by hand:

    Once the new ring is established, quorum returns (no more expected 1). Then run steps 2 to 4 (restart corosync, pve-cluster and services) so that pmxcfs takes back control and replication resumes.

    Final verification

    Exit checklist

    • corosync-cfgtool -s: LINK 0 connected on the right network, LINK 1 listed.
    • pvecm status: Quorate: Yes, no leftover expected 1.
    • cat /etc/pve/.members: the IPs shown are the expected primary IPs.
    • pvesr run --id <job> --verbose: SSH targets the new IP and the job succeeds.

    Network / firewall prerequisites

    • corosync / knet = UDP 5405 (up to 5412). Open it between node IPs, restricted to the peer (not the world) if going over public.
    • If node IPs changed: pvecm updatecerts to regenerate the ssh_known_hosts.
    • Running quorum over the Internet is a temporary degraded mode: variable latency, exposure. Keep it only while the vRack is being repaired, then move the private link back to ring0.

    This scenario β€” an OVH node suddenly losing its vRack link β€” is exactly the kind of incident we handle on-call for our clients. We operate Proxmox clusters hosted on OVHcloud under management : ring failover, ZFS replication, follow-up on the hardware repair with the datacenter, and the return to nominal.

    Frequently asked questions

    Official documentation

    Related articles

    A Proxmox cluster on OVHcloud to harden?

    We design and operate redundant Proxmox clusters (multiple rings, ZFS replication, HA), and respond on-call to network and quorum failures β€” OVH vRack included. Let's talk about your infrastructure.