SSR RabbitMQ¶
Archived (pre-2022)
Preserved for reference only -- likely outdated. View original | Last updated: July 2019
Description of the component¶
RabbitMQ is a message-queueing software called a message broker or queue manager. Simply said; It is a software where queues can be defined, applications may connect to the queue and transfer a message onto it.
In order to make the Fairbid system compatible with server-side rewarding capability, we need a RabbitMQ cluster
Links¶
| Resource | Description |
|---|---|
| Terraform Production Environment | The module "rabbitmq_cluster" is initialized here for "inneractive" infrastructure |
| Terraform Staging Environment | The module "rabbitmq_cluster" is initialized here for "ia-staging" infrastructure |
| Terraform RabbitMQ module | The infrastructure for RabbitMQ cluster described here |
| Chef Inneractive RabbitMQ cookbook | |
| Consul Inneractive | |
| Grafana Dashboard | Monitoring Dashboard for both staging and production clusters |
Resources¶
Terraform¶
In order to update the infrastructure, you need to make changes in the tf-rabbitmq repo and then perform terraform apply on the corresponding module in the tf-staging_env and/or tf-production_env
Applying changes in INN infrastructure for the first time
If you never worked with inneractive infrastructure before, you need to configure your environment first. Please get familiar with this manual. In case of any questions - ask for assistance in the devops channel.
Chef¶
We are using the supermarket cookbook for RabbitMQ which is being supported by the maintainer: chef-cookbook (Github)
Inneractive's DevOps created a wrapper cookbook with some basic configuration allowing the connection to Consul: master (Bitbucket)
This role is being used for configuration of RabbitMQ nodes of the SSR-RabbitMQ cluster:
- Production: SSR-rabbitmq-cluster.json (Bitbucket)
- Staging: https://bitbucket.org/inneractive-ondemand/system-common/src/master/chef/roles/SSR-rabbitmq-cluster-ia-staging.json
Consul¶
- Staging: http://consul.staging.inner-active.mobi:8500/ui/ia-staging/services/ssr_rabbitmq
- Production: http://consul.staging.inner-active.mobi:8500/ui/inneractive/services/ssr_rabbitmq
RabbitMQ Management Interface¶
The interface is available by the IP address of the node on port 15672. You can find the IP addresses of the SSR RabbitMQ cluster in consul.
Data Storage¶
Queue's messages are being stored on a separate EBS volume which is created in mounted during the instance's bootstrap.
The persistence of data is being achieved by the mirroring of the queues.
Mirroring to all nodes is the most conservative option. It will put additional strain on all cluster nodes, including network I/O, disk I/O and disk space usage. Having a replica on every node is unnecessary in most cases.
For clusters of 3 and more nodes, it is recommended to replicate to a quorum (the majority) of nodes, e.g. 2 nodes in a 3 node cluster or 3 nodes in a 5 node cluster.
Since some data can be inherently transient or very time sensitive, it can be perfectly reasonable to use a lower number of mirrors for some queues (or even not use any mirroring).
Definitions¶
Nodes and clusters store information that can be thought of schema, metadata or topology. Users, vhosts, queues, exchanges, bindings, runtime parameters all fall into this category.
Definitions are stored in an internal database and replicated across all cluster nodes. Every node in a cluster has its own replica of all definitions. When a part of definitions changes, the update is performed on all nodes in a single transaction. In the context of backups, this means that in practice definitions can be exported from any cluster node with the same result.
You can ask, how we are managing definitions?
Well, so far I came up to the idea of storing definitions containing the users and vhosts in the JSON file. This JSON file is being placed in the data directory of RabbitMQ during the terraform plan. The RabbitMQ configured in a way, where it's loading the definitions from this JSON file during the boot.
User Management¶
The information about users is stored in the definitions file during the boot.
Cluster Tuning¶
Disk Space¶
We are using the configuration which is based on system memory. The recommended value is 1.5, that means on a RabbitMQ node with 4GB of memory, if available disk space drops below 6GB, all new messages will be blocked until the disk alarm clears. If RabbitMQ needs to flush to disk 4GB worth of data, as can sometimes be the case during shutdown, there will be sufficient disk space available for RabbitMQ to start again. In this specific example, RabbitMQ will start and immediately block all publishers since 2GB is well under the required 6GB.
Open Files Handles Limit¶
Operating systems limit a maximum number of concurrently open file handles, which includes network sockets. Make sure that you have limits set high enough to allow for an expected number of concurrent connections and queues.
Make sure your environment allows for at least 50K open file descriptors for effective RabbitMQ user, including in development environments.
As a rule of thumb, multiply the 95th percentile number of concurrent connections by 2 and add a total number of queues to calculate recommended open file handle limit.
Monitoring¶
Monitoring is currently being implemented through the Telegraph's RabbitMQ plugin. It's collecting all the metrics and sending it to the Prometheus endpoint. The Telegraph is registered as a service in Consul.
I've created the Grafana dashboard on inneractive side: SSR - RabbitMQ Cluster
There are two circumstances under which RabbitMQ will stop reading from client network sockets, in order to avoid being killed by the OS (out-of-memory killer):
They are:
- Memory use goes above the configured limit
- Free disk space drops below the configured limit
When running RabbitMQ in a cluster, the memory and disk alarms are cluster-wide; if one node goes over the limit then all nodes will block connections.
Both alerts should be configured based on the instance properties such as RAM and EBS Volume.
There is also an important metric which we should monitor: File Descriptors Used
Info
We need to configure these alerts before going in Production.
Production Checklist¶
Source: Production Checklist
| Topic | Implementation |
|---|---|
| Virtual Hosts, Users, Permission | We are using separate vhosts/users per application/project. |
| Monitoring and Resource Limits | The recommended limits are configured through chef role. Grafana is being used for monitoring. |
| Log Collection | |
| Security | not implemented |
| Clustering | Clustering via Consul |
| Partition Handling Strategy | pause-minority mode |
| Node Time Synchronization | ntpd |
Log Collection¶
Logs are being collected on EBS data volume and aggregated according to the policy specified in a Chef Role for SSR RabbitMQ cluster.
DEVOPSBLN-730
Troubleshooting¶
Recovering from a split-brain: Partitions