RabbitMQ service degradation

Incident Report for Axinom

Postmortem

Context:

It was noticed that in new Mosaic environments, a couple of shovels failed to start up in RabbitMQ. This issue could be easily recreated with any new Mosaic environment, thus the RabbitMQ cluster was immediately checked.

Initial Investigation:

It was identified that the RabbitMQ plugin responsible for running shovels, called rabbitmq_shovels was not performing as it should. While the majority of existing shovels were running as usual, new ones would not start up properly.

Remediation: 

The standard recovery procedure when a plugin is misbehaving is to disable it on the node where it's misbehaving and then enable it again, akin to restarting the plugin. However, it appeared that the plugin was not enabled within an acceptable time (15 minutes).

Therefore, a RabbitMQ service restart was attempted using the same node. While everything was restarted without an issue, the shovel plugin continued to error out and not stabilize even after the restart.

Even adding an entirely new node would fail if the shovel plugin is enabled in the new node. Thus, it became apparent that the shovel configuration within the cluster has drifted into a corrupted state. Therefore, the only option was to migrate the RabbitMQ definitions and messages to an entirely new cluster.

Conclusion: 

On Apr 08, 2025, at 04:44 UTC, a new cluster was formed, with double the CPU and memory specifications, and all definitions were transferred to the new cluster, effectively fixing the issue.

Lessons Learned:

  1. It is essential to conduct periodical cleanups to keep the number of moving parts to precisely what we need.
  2. It is easier and quicker to migrate our RabbitMQ setup from one cluster to another, without losing configuration or messages.
Posted May 15, 2025 - 07:40 UTC

Resolved

This incident has been resolved and only affected the new environments created after 2025-04-07 15:59 UTC.
Posted Apr 08, 2025 - 04:44 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 07, 2025 - 15:59 UTC
This incident affected: Mosaic RabbitMQ Cluster.