Context:
It was noticed that in new Mosaic environments, a couple of shovels failed to start up in RabbitMQ. This issue could be easily recreated with any new Mosaic environment, thus the RabbitMQ cluster was immediately checked.
Initial Investigation:
It was identified that the RabbitMQ plugin responsible for running shovels, called rabbitmq_shovels was not performing as it should. While the majority of existing shovels were running as usual, new ones would not start up properly.
Remediation:
The standard recovery procedure when a plugin is misbehaving is to disable it on the node where it's misbehaving and then enable it again, akin to restarting the plugin. However, it appeared that the plugin was not enabled within an acceptable time (15 minutes).
Therefore, a RabbitMQ service restart was attempted using the same node. While everything was restarted without an issue, the shovel plugin continued to error out and not stabilize even after the restart.
Even adding an entirely new node would fail if the shovel plugin is enabled in the new node. Thus, it became apparent that the shovel configuration within the cluster has drifted into a corrupted state. Therefore, the only option was to migrate the RabbitMQ definitions and messages to an entirely new cluster.
Conclusion:
On Apr 08, 2025, at 04:44 UTC, a new cluster was formed, with double the CPU and memory specifications, and all definitions were transferred to the new cluster, effectively fixing the issue.
Lessons Learned: