Skip to content

Offerwall MySQL in GCP

Imported from Confluence

Content may be outdated. Verify before following any procedures. View original | Last updated: December 2023

Imported from Edit

Offerwall MySQL in GCP

Regular flow

worddav6f3e97307f5601d60d64d247325a6f9e.png

Database servers

There are three database servers, set up as virtual machines:

  • mysql-core-0.vm-offerwall-prod-useast1.fyber.com
  • mysql-core-1.vm-offerwall-prod-useast1.fyber.com
  • mysql-core-2.vm-offerwall-prod-useast1.fyber.com

Data is replicated between these servers as described below and each server assumes a specific role.

Server roles

Server roles are designated by DNS:

  • master: the only server with writes enabled!
  • standby: read-only, used for long running queries and also as backup master
  • standby-users: read-only, used for manual queries

Replication

  • Data is replicated from master to standby to standby-users
  • GTID (Global Transaction ID) is enabled on all servers, allowing for replication to run in auto-position mode (it is not necessary to specify binary log names and positions)
  • There's two-way replication between mysql-core-0 and mysql-core-1, enabling easy master switch!

Common Tasks

Master-switch

Due to existing two-replication it should not be necessary to change anything in the replication when switching master between mysql-core-0 and mysql-core-1. Still, for safety check if replication is running and not delayed on the new master server!
Steps:

  1. Check that replication is running without delay on the new master
  2. Make old master read-only
  3. Make new master read-write
  4. Point mysql-core-master dns to the new master
  5. Restart/Reconnect all applications

Make sure to only enable writes on the new master after the old master has been set to read-only!

Standby-switch

Standby switch can be done anytime simply by pointing mysql-core-standby to the new server.

  1. Check that replication is running without delay on the new standby
  2. Point mysql-core-standby dns to the new standby

Change replication flow

To point replication to a different server, execute these steps on the slave:

  1. Stop replication
  2. Point replication to new master
  3. Start replication

Due to GTID it's not necessary to fiddle with binlog positions anymore.

Example

For example to replicate to mysql-core-2 from mysql-core-0 instead of mysql-core-1, run these commands on mysql-core-2:
mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)
mysql> change master to master_host='mysql-core-0.vm-offerwall-prod-useast1.fyber.com';
Query OK, 0 rows affected (0.01 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

Skip transaction

Skipping a transaction works differently with GTID enabled: We need to find out the UUID of the transaction we want to skip, then create an empty transaction with that UUID. For details see {+}https://dev.mysql.com/doc/mysql-replication-excerpt/5.7/en/replication-administration-skip-gtid.html+
mysql> select * from performance_schema.replication_applier_status_by_worker\G

                                                        1. row ***************************
                                                          CHANNEL_NAME:
                                                          WORKER_ID: 0
                                                          THREAD_ID: 1022
                                                          SERVICE_STATE: ON
                                                          LAST_SEEN_TRANSACTION: a41645e6-9720-11ed-91aa-42010aba0016:59639792
                                                          LAST_ERROR_NUMBER: 1062
                                                          LAST_ERROR_MESSAGE: Could not execute Write_rows_v1 event on table cms_db_production.application_installations; Duplicate entry '1907130520' for key 'PRIMARY'
                                                          LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
                                                          1 row in set (0.00 sec)
                                                          mysql> set gtid_next='a41645e6-9720-11ed-91aa-42010aba0016:59639792'; begin; commit; set gtid_next='AUTOMATIC'; start slave;

                            # Grafana Dashboards and Alerts

                            ## Grafana Dashboards - MySQL Overview: this is the most important dashboards: it gives an overview of all MySQL servers and their status. See the notes on each panel for a detailed description. This dashboard also holds all alerts in the last row (see below) - MySQL Server Details: holds general information about each individual server. - MySQL Replication: shows the replication status on each individual server.

Alerts

Alerts are configured in the last row of the MySQL Overview dashboard.
Currently all alerts go to the bln-escalation-muted channel in Slack. This will most likely have to be tuned!**

  • MySQL Uptime: alerts when a MySQL process has been recently restarted. If no planned restart or reboot of the server is known, it should be investigated why that happened. It might be that some MySQL settings need tuning or that the server needs more memory.
  • Replication running: alerts when replication has stopped on one of the servers. A possible source might be a replication error. See above on how to skip offending transactions.
  • Replication delay: alerts when replication lag on one of the servers exceeds 2 hours. It should be investigated why the server is slow. A possible source for delay might be long running queries blocking parts of the database.
  • /storage Filesystem space: alerts when disk space on the main database disk of one of the servers becomes low. Either the disk needs to be enlarged, or data has to be cleaned up. Please note that when deleting large amounts of data from a table, it's usually necessary to run OPTIMIZE TABLE to actually regain the space. Running OPTIMIZE TABLE might block writes to the table for a considerable amount of time though!
  • /dblogs Filesystem space: alerts when the space for replication and binary logs becomes low. Either the disk needs to be enlarged or some old logs can be cleared using PURGE BINLOGS BEFORE. If there is no known reason why there were too many logs it might also be necessary to tune EXPIRE_LOGS_DAYS.