Enable Warm Standby Replication (WSR) on RabbitMQ

Implementing a Resilient Messaging Framework using WSR on RabbitMQ

Posted by Alfus Jaganathan on Monday, June 24, 2024

Introduction

Warm Standby Replication is a resilience and continuity strategy designed for systems utilizing RabbitMQ clusters, particularly setup within kubernetes. This approach involves the continuous replication or copying of data, including schema definitions and messages, from a primary (upstream) RabbitMQ cluster to a standby (downstream) cluster. The core aim of Warm Standby Replication is to ensure minimal downtime and data loss in the event of a failure within the primary cluster.

Note: WSR feature is available, only in commercial version of RabbitMQ

For most updated information and advanced configuration, please refer to Warm Standby Replication-Tanzu Docs

Understanding Warm Standby Replication

Let’s visualize a scenario within a RabbitMQ environment with 2 clusters: Cluster-A (active, primary, or upstream) bustling with activity, and Cluster-B (passive, standby, or downstream) prepared for disaster recovery. This setup ensures business continuity and data resilience through warm replication and synchronization processes.

Behind-the-Scenes Connection and Synchronization

Cluster-B

Discreetly linked to Cluster-A, Cluster-B is an unseen yet crucial component for the disaster recovery strategy. It silently maintains a state of readiness by mirroring crucial schema and data from Cluster-A. Cluster-B is tasked with two primary synchronization duties.

(1) Schema Replication

It mirrors all changes made to the cluster’s schema, which includes configured virtual hosts, it’s related users, permissions, queues, and bindings. Notably, direct message contents are excluded from this replication.

(2) Selective Message Replication: Messages that are published to a specifically configured set of quorum queues are replicated, but stored aside outside of the queues.

Note: As of the GA releases dated February 16, 2024, this replication works exclusively on quorum queues.

Disaster Recovery and Cluster Promotion

(1) Promotion of Cluster-B

In the event of Cluster-A facing a catastrophic failure, the responsibility falls to a human operator to designate Cluster-B as the new primary cluster by following Promoting the Downstream (Standby) Cluster for Disaster Recovery. This crucial action ensures the continuation of services with minimal disruption.

(2) Configurable Recovery

The promotion process involves applying predetermined configuration parameters, such as specific time-frame and any excluded virtual hosts. This ensures that messages from the configured queues within the defined time-frame are accurately replayed and republished into the replicated queues of Cluster-B, now the new primary.

Post-Disaster Continuity

(1) Re-establishing the Standby Setup

Once Cluster-B ascends to the primary position, the focus shifts to creating and configuring new standby cluster(s) as downstream backups, ensuring the resilience cycle is perpetuated.

(2) Restoration and Role Reversal

When Cluster-A is brought back online and restored to operational status, it can seamlessly transition into the standby role, ready to step in should the need arise again or optionally, can be restored as primary cluster.

Getting Started with Configuration

Prerequisites

  • At least two RabbitMQ clusters deployed using the Cluster Operator, where one can be used as upstream and the other(s) as downstream(s). For detailed installation instructions, refer to:

    Note: For effective disaster recovery, it is advisable to install the standby RabbitMQ cluster in a Kubernetes cluster that is located in a different zone or location. This geographic distribution helps ensure that the system remains operational even if one location experiences an outage or other disaster scenarios. By doing so, you enhance the resilience and reliability of your infrastructure.

  • Required operator privileges on the kubernetes cluster for installation.

  • Cluster Essentials for VMware Tanzu installed in the Kubernetes cluster.

  • Carvel Tools installed

  • Kubectl installed

  • kapp installed

Assumptions

Before getting further down, let’s assume that the clusters are named as rabbitmq-a (primary) and rabbitmq-b (standby) within namespaces rabbitmq-clusters-a and rabbitmq-clusters-b respectively.

Summary

Below are the summary of what we will be doing next, to configure replication on each of the cluster (primary and standby). Below will be repeated on both the clusters, but with respective configurations.

  1. Configure additional plugins and additional config
  2. Create Replication User and Permissions
  3. Configure Schema Replication
  4. Configure Standby Replication

Finally, perform testing to make sure that replication works as expected.

Configure Primary Cluster for Warm Standby Replication

Let’s configure the Primary Cluster first, verify and then configure the Standby Cluster.

Configure additional plugins and additional config

Add the below plugins to the RabbitMQ cluster.

rabbitmq:
  spec:
    rabbitmq:
      additionalPlugins:
        - rabbitmq_stream
        - rabbitmq_stream_management
        - rabbitmq_schema_definition_sync
        - rabbitmq_schema_definition_sync_prometheus
        - rabbitmq_standby_replication

Similarly, add the below configuration to additionalConfig section.

rabbitmq:
  spec:
    rabbitmq:
      additionalConfig: |
        schema_definition_sync.operating_mode = upstream
        standby.replication.operating_mode = upstream
        standby.replication.retention.size_limit.messages = 5000000000
        schema_definition_sync.downstream.default_amqps_port = 5672
        standby.replication.downstream.default_stream_protocol_port_without_tls = 5552        

Make sure to apply these changes and wait for all the RabbitMQ nodes to be configured.

Create Replication User & Permissions

The below command will apply a combined configuration, creating the Kubernetes secret for storing the RabbitMQ replicator user’s credentials, defining the RabbitMQ user itself, and establishing the necessary permissions for both schema and standby replication for each Vhost in Primary Cluster rabbitmq-clusters-a. To know more about users and permissions, please scroll to the end and refer to Additional Info section.

Note (1): It’s crucial to ensure that all these resources are created within the namespace where your RabbitMQ cluster is located.

Note (2): Please make sure to replace <rabbitmq replicator password> with the right password.

Note (3): Add seperate permission for each Vhost. in this example, only default / vhost is considered.

kapp deploy -a rabbitmq-replicator-a -y -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-replicator-secret
  namespace: rabbitmq-clusters-a
type: Opaque
stringData:
  username: rabbitmq-replicator-user
  password: <rabbitmq replicator password>
---      
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
  name: rabbitmq-replicator-user
  namespace: rabbitmq-clusters-a
spec:
  rabbitmqClusterReference:
    name: rabbitmq-a
    namespace: rabbitmq-clusters-a
  importCredentialsSecret:
    name: rabbitmq-replicator-secret
    namespace: rabbitmq-clusters-a
---
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all 
  namespace: rabbitmq-clusters-a
spec:
  vhost: rabbitmq_schema_definition_sync
  userReference:
    name: rabbitmq-replicator-user 
    namespace: rabbitmq-clusters-a
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: rabbitmq-a
    namespace: rabbitmq-clusters-a
---
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.default.all
  namespace: rabbitmq-clusters-a
spec:
  vhost: "/"
  userReference:
    name: rabbitmq-replicator-user
    namespace: rabbitmq-clusters-a
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: rabbitmq-a
    namespace: rabbitmq-clusters-a
EOF

Now that user and permissions configured for primary cluster, we need to configure the replicators.

Configure Schema Replication

Create the SchemaReplication resource by running the below command in the Primary Cluster. Before executing, replace 1.0.0.0 ipaddress with the primary cluster’s kubernetes service’s external IP. You can obtain this using command kubectl get svc rabbitmq -n rabbitmq-clusters -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}"

kapp deploy -a rabbitmq-schema-replication-a -y -f - <<EOF
apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
  name: rabbitmq-schema-replication
  namespace: rabbitmq-clusters-a
spec:
  endpoints: "1.0.0.0:5672"
  upstreamSecret:
    name: rabbitmq-replicator-secret
    namespace: rabbitmq-clusters-a
  rabbitmqClusterReference:
    name: rabbitmq-a
    namespace: rabbitmq-clusters-a
EOF

Now that we have the Schema Replication configured, we have the option to create the Users and Vhosts for fine grained access, using the Topology operator. Below example will create the default vhost / using the Topology operator.

Important thing to note here is the tag named standby_replication which is the identifier for the schema replicator to pick the vhost for replication.

kapp deploy -a rabbitmq-vhosts-a -y -f - <<EOF
apiVersion: rabbitmq.com/v1beta1
kind: Vhost
metadata:
  name: default
  namespace: rabbitmq-clusters-a
spec: 
  name: "/"
  defaultQueueType: quorum
  tags: 
    - default
    - "standby_replication"
  rabbitmqClusterReference:
    name: rabbitmq-a
    namespace: rabbitmq-clusters-a
EOF

Make sure that the default vhost / and the ones added (if any) are created. This once again ensures that the Topology operator is working as expected.

Configure Standby Replication

Create the StandbyReplication resource by running the below command in the Primary Cluster.

Note (1): Operating Mode is set as upstream for the primary cluster

Note (2): Replication policies must be configured for each Vhost, which needs message replication.

kapp deploy -a rabbitmq-standby-replication-a -y -f - <<EOF
apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
  name: rabbitmq-standby-replication
  namespace: rabbitmq-clusters-a
spec:
  operatingMode: "upstream" 
  upstreamModeConfiguration: 
    replicationPolicies:
      - name: all-quorum-queues-in-default 
        pattern: "^.*" 
        vhost: "/"
  rabbitmqClusterReference:
    name: rabbitmq-a
    namespace: rabbitmq-clusters-a
EOF

Now that we have both Primary and Standby replication configured in Primary Cluster configured, we can start configuring the Standby Cluster for replication.

Configure Standby Cluster for Warm Standby Replication

Let’s configure the Standby Cluster by following the instructions below.

Configure additional plugins and additional config

Add the below plugins to the RabbitMQ cluster.

rabbitmq:
  spec:
    rabbitmq:
      additionalPlugins:
        - rabbitmq_stream
        - rabbitmq_stream_management
        - rabbitmq_schema_definition_sync
        - rabbitmq_schema_definition_sync_prometheus
        - rabbitmq_standby_replication

Similarly, add the below configuration to additionalConfig section.

rabbitmq:
  spec:
    rabbitmq:
      additionalConfig: |
        schema_definition_sync.operating_mode = downstream
        standby.replication.operating_mode = downstream
        schema_definition_sync.downstream.locals.users = ^default_user_
        schema_definition_sync.downstream.locals.global_parameters = ^standby
        schema_definition_sync.downstream.minimum_sync_interval = 15
        standby.replication.retention.size_limit.messages = 5000000000        

Make sure to apply these changes and wait for all the RabbitMQ nodes to be configured.

Create Replication Users & Permissions

This is very similar to the above (primary cluster), where we created the users and permissions, except for the namespace and RabbitMQ cluster name.

The below command will apply a combined configuration, creating the Kubernetes secret for storing the RabbitMQ replicator user’s credentials, defining the RabbitMQ user itself, and establishing the necessary permissions for both schema and standby replication for each Vhost in Standby Cluster rabbitmq-clusters-b.

Note (1): It’s crucial to ensure that all these resources are created within the namespace where your RabbitMQ cluster is located.

Note (2): Please make sure to replace <rabbitmq replicator password> with the right password.

Note (3): Add seperate permission for each Vhost. in this example, only default / vhost is considered.

kapp deploy -a rabbitmq-replicator-b -y -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-replicator-secret
  namespace: rabbitmq-clusters-b
type: Opaque
stringData:
  username: rabbitmq-replicator-user
  password: <rabbitmq replicator password>
---      
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
  name: rabbitmq-replicator-user
  namespace: rabbitmq-clusters-b
spec:
  rabbitmqClusterReference:
    name: rabbitmq-b
    namespace: rabbitmq-clusters-b
  importCredentialsSecret:
    name: rabbitmq-replicator-secret
    namespace: rabbitmq-clusters-b
---
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all 
  namespace: rabbitmq-clusters-b
spec:
  vhost: rabbitmq_schema_definition_sync
  userReference:
    name: rabbitmq-replicator-user 
    namespace: rabbitmq-clusters-b
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: rabbitmq-b
    namespace: rabbitmq-clusters-b
---
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.default.all
  namespace: rabbitmq-clusters-b
spec:
  vhost: "/"
  userReference:
    name: rabbitmq-replicator-user
    namespace: rabbitmq-clusters-b
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: rabbitmq-b
    namespace: rabbitmq-clusters-b
EOF

Now that user and permissions configured for standby cluster, we need to configure the replicators.

Configure Schema Replication

Create the SchemaReplication resource by running the below command in the Standby Cluster. Before executing, replace 1.0.0.0 ipaddress with the active cluster’s kubernetes service’s external IP. You can obtain this using command kubectl get svc rabbitmq -n rabbitmq-clusters -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}"

kapp deploy -a rabbitmq-schema-replication-b -y -f - <<EOF
apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
  name: rabbitmq-schema-replication
  namespace: rabbitmq-clusters-b
spec:
  endpoints: "1.0.0.0:5672"
  upstreamSecret:
    name: rabbitmq-replicator-secret
    namespace: rabbitmq-clusters-b
  rabbitmqClusterReference:
    name: rabbitmq-b
    namespace: rabbitmq-clusters-b
EOF

Configure Standby Replication

Create the StandbyReplication resource by running the below command in the Standby Cluster. Before executing, replace 1.0.0.0 ipaddress with the active cluster’s kubernetes service’s external IP. You can obtain this using command kubectl get svc rabbitmq -n rabbitmq-clusters -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}"

Note (1): Operating Mode is set as downstream for the standby cluster

kapp deploy -a rabbitmq-standby-replication-b -y -f - <<EOF
apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
  name: rabbitmq-standby-replication
  namespace: rabbitmq-clusters-b
spec:
  operatingMode: "downstream" 
  downstreamModeConfiguration: 
    endpoints: "1.0.0.0:5552"
    upstreamSecret:
      name: rabbitmq-replicator-secret
      namespace: rabbitmq-clusters-b
  rabbitmqClusterReference:
    name: rabbitmq-b
    namespace: rabbitmq-clusters-b
EOF

Now that we have all the configurations complete, make sure to check the RabbitMQ logs for any errors, and fix them incase of any.

Testing

Schema Replication

Simply create a queue in the primary cluster, under a configured vhost (/ default in this case) and check if it appears in the standby cluster. Please give few seconds for the replicated schema to appear in the standby cluster.

Message Replication

Once schema replication is tested successfully, publish some messages in the queue. Then, follow the instructions from Verifying Warm Standby Replication is Configured Correctly to verify if standby replication is successful.

Additional Info (Optional Read)

Users & Permissions

As seen above, to set up replication between an primary and a standby clusters, we need to ensure that certain secrets, users and permissions are properly established on both clusters. Below is a detailed walkthrough of each of them.

v1/Secret

To configure a RabbitMQ user with the required permissions for replication tasks within a Kubernetes environment, the first step involves creating a Kubernetes secret. This secret will securely store the replication username and password, which ensures that sensitive information is handled securely and aligns with the recommended practices for managing credentials in Kubernetes deployments. Refer to below yaml configuration with apiVersion: v1 and kind: Secret for exact definition.

rabbitmq.com/v1beta1/User

The next step involves defining a RabbitMQ user specifically for replication tasks across RabbitMQ clusters. This user utilizes the previously created secret for authentication, ensuring secure and efficient replication operations. The configuration with apiVersion: rabbitmq.com/v1beta1 and kind: User detailed below outlines how to set up this user which will be created by the Messaging Topology Operator.

rabbitmq.com/v1beta1/Permission for Schema Replication

Next is to assign the user with required schema replication permissions. These permissions enable the user to effectively manage and synchronize schema definitions. The first configuration provided below with apiVersion: rabbitmq.com/v1beta1, kind: Permission and name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all is crafted to establish the necessary permissions for the replicator user, for performing schema replication tasks. This setup is implemented by granting full (.*) write, configure, and read permissions on the rabbitmq_schema_definition_sync virtual host.

rabbitmq.com/v1beta1/Permission for Standby Replication

Next is to assign the user with required standby replication permissions.These permissions are critical for enabling the user to replicate data across different clusters, specifically targeting a particular virtual host. The second configuration detailed below apiVersion: rabbitmq.com/v1beta1, kind: Permission and name: rabbitmq-replicator.default.all is designed to set up these essential permissions “full (.*) write, configure, and read” for the replicator user on the default virtual host “/”. If you have additional virtual hosts to configure replication, you must create a kind: Permission for each, specifying the appropriate spec:vhost. This too, leverages the Messaging Topology Operator to configure the permissions in the cluster.

Github

TODO: Will be publishing a github repo soon.

Hope you had fun coding!


comments powered by Disqus