Skip to content

Aerospike in AWS

Archived (pre-2022)

Preserved for reference only -- likely outdated. View original | Last updated: September 2018

General Overview

The current AES (Aerospike) cluster running in AWS is an extension of the AES cluster running in DC (TASK)

The AES cluster in AWS is running in one region: aws-production-aerospike-core  → eu-west-1

Screen Shot 2018-09-10 at 12.11.01.png

Setup and Configuration

As mentioned in the previous section, the AES core cluster in AWS extends the cluster running in the on-premise data centre. The cluster uses the bundled service with Aerospike EE referenced as the "Cross Datacenter Replication", shortly as XDR. This feature allows a replication of each namespace to geographically diverse clusters in 0 downtime.

XDR setup would prevent cluster node failure by suspending the failed the remote cluster and resumes from the problematic cluster once available again. This feature is commonly used during service migration and between data centers with different cluster setup modes (mixed of mesh and multicast layouts).

To read more about XDR, refer to the official Aerospike documentation found here →

The AES configuration for AWS production environment must have the same namespace names and configuration as the on-premise as follows:

  • counters.
  • advanced_counters.
  • netscores.
  • cookies.
  • fyber_test.

The following section will demonstrate how to enable XDR and start replication for a given namespace from on premise data center to AWS.

XDR - DC && AWS

  1. Make a backup in one of the namespaces in DC using the following command line with at least the next flags:
$ nohup asbackup -h 127.0.0.1 -n cookies -d /tmp/backup/

Where:

    • asbackup: An AES built-in command line tool to backup namespaces in a given data center
  • -h            : The target AES instance containing the target namespace (Default localhost)
  • -n            : The name of the namespace
  • -d            : The specified directory where the backup files of the target namespace will be stored.
  • Note: Use "nohup" or "screen" command line to prevent hangups or asbackup interruption. asbackup AES command line tool can take long time depending on the namespace size.
    1. Restore the namespace in one of the instances in AWS (make sure that it has a replication at least of 2) using the following command line:
$ nohup asrestore -h 10.37.120.25 -d /tmp/backup/

2018-09-14 14:31:08 GMT [INF] [ 5498] Expired 9801 : skipped 0 : inserted 978151 : failed 0 (existed 0, fresher 0)
2018-09-14 14:31:08 GMT [INF] [ 5498] 2% complete, ~20h5m32s remaining
2018-09-14 14:31:18 GMT [INF] [ 5498] 2 UDF file(s), 0 secondary index(es), 994495 record(s) (105 KiB/s, 652 rec/s, 166 B/rec, backed off: 0)
2018-09-14 14:31:18 GMT [INF] [ 5498] Expired 9907 : skipped 0 : inserted 984588 : failed 0 (existed 0, fresher 0)
2018-09-14 14:31:18 GMT [INF] [ 5498] 2% complete, ~20h36m34s remaining
2018-09-14 14:31:28 GMT [INF] [ 5498] 2 UDF file(s), 0 secondary index(es), 1000867 record(s) (103 KiB/s, 641 rec/s, 164 B/rec, backed off: 0)

Where:

    • asrestore: An AES built-in command line tool to restore namespaces in the target data center (AWS in our example).
  • -h            : The target AES instance in which the namespace will be replicated. It is essential to note that is sufficient to provide one host of the cluster (by taking into consideration a replication factor more than 1). The specified host will act as an entry point to the target cluster (AWS in our example). Thus, the rest of AES nodes will be automatically discovered within replicated namespaces in all nodes.
  • -d            : The specified directory containing the backup files (generated by the asbackup command line in the previous step)

3. Enable XDR in the namespace configuration stanza in the Original DC instance node:

$ vim /etc/aerospike/aerospike.conf

...
namespace cookies {
  replication-factor 2
  memory-size 15G
  default-ttl 2d
  enable-xdr true
  xdr-remote-datacenter AWS
  storage-engine device {
    device /dev/sdg1
    scheduler-mode noop
    write-block-size 1M
  }

}
...
xdr {
        enable-xdr true
        xdr-digestlog-path /opt/aerospike/xdr/digestlog 5G

        datacenter AWS {
                dc-node-address-port 10.37.133.194 3000
                dc-node-address-port 10.37.120.25 3000
        }
}

...

In the namespace stanza, the following configuration lines are needed:

    • enable-xdr                    : Enables XDR for the specific namespace
  • xdr-remote-datacenter : Refers to the name of the target data center (as different namespace can target different datacenter )

Add new xdr stanza with the following basic configuration options:

    • enable-xdr                    : Enables XDR for the specific namespace
  • xdr-digestlog-path        : The path of the log file that XDR needs to write to. Make sure to have right permissions of the file and directory path. Adjusting the size of the Digest Log should be enough to handle writing logged namespaces with larger size.
  • datacenter                    : The name of the target data center described in the replicated namespace.
  • dc-node-address-port   : a sub-stanza of the datacenter option providing at least one entry of the target datacenter describing the IP address of the remote host and service port of the remote cluster (AWS in our case).

Important Note: Make sure to check the configuration update in each cluster node that would need a restart or not (static/dynamic) here → . The following web page resources gives more details of each parameter. Make sure to verify the AES version first. The previous configuration implies the restart of AES cluster (rolling restart) to reflect the XDR configuration because the

`xdr-digestlog-path` is a static parameter and required.

To read more about advanced XDR configuration for AES, please refer to the official AES website here → .

  1. Verify that XDR is working by checking the replication activity in the AES log file in the AES DC cluster as the following:
$ tail -f /var/log/aerospike/aerospike.log | grep xdr

...
Sep 17 2018 10:43:34 GMT: INFO (xdr): (xdr_dlog.c:92) dlog: free-pct 100 reclaimed 44600 glst 1537181013834 (2018-09-17 10:43:33 GMT)
Sep 17 2018 10:43:34 GMT: INFO (xdr): (xdr.c:610) [AWS]: dc-state CLUSTER_UP timelag-sec 0 lst 1537181013834 mlst 1537181013834 (2018-09-17 10:43:33 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 34 rsas-ms 0.000 rsas-pct 0.0 con 128 errcl 1185 errsrv 978 sz 2
...

Important Note: Rolling restart is only needed when adding XDR config for the first time.

XDR - Monitoring && Troubleshooting

  1. How to check XDR status in the target AWS cluster:
Admin> show stat xdr
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~XDR Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                               :   ip-10-37-120-25.eu-west-1.compute.internal:3000   ip-10-37-133-194.eu-west-1.compute.internal:3000
dlog_free_pct                      :   100                                               100
dlog_logged                        :   0                                                 0
dlog_overwritten_error             :   0                                                 0
dlog_processed_link_down           :   0                                                 0
dlog_processed_main                :   0                                                 0
dlog_processed_replica             :   0                                                 0
dlog_relogged                      :   0                                                 0
dlog_used_objects                  :   0                                                 0
xdr_active_failed_node_sessions    :   0                                                 0
xdr_active_link_down_sessions      :   0                                                 0
xdr_global_lastshiptime            :   18446744073709551615                              18446744073709551615
xdr_hotkey_fetch                   :   0                                                 0
.....
  1. How to granularly check all metrics by  namespace in the original cluster:
Admin> show stat AWS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~netscores Namespace Statistics~~~~~~~~~~~~~~~~~
NODE                                    :   10.99.36.97:3000                    10.99.36.98:3000                    aes001.prd.fyber.com:3000
allow-nonxdr-writes                     :   true                                true                                true
allow-xdr-writes                        :   true                                true                                true
available_bin_names                     :   32760                               32760                               32760
batch_sub_proxy_complete                :   0                                   0                                   0
batch_sub_proxy_error                   :   0                                   0                                   0
batch_sub_proxy_timeout                 :   0                                   0                                   0
batch_sub_read_error                    :   0                                   0                                   0
batch_sub_read_not_found                :   0                                   0                                   0
batch_sub_read_success                  :   0                                   0                                   0
batch_sub_read_timeout                  :   0                                   0                                   0
batch_sub_tsvc_error                    :   0                                   0                                   0
batch_sub_tsvc_timeout                  :   0                                   0                                   0
client_delete_error                     :   0                                   0                                   0
client_delete_not_found                 :   0                                   0                                   0
client_delete_success                   :   0                                   0                                   0
client_delete_timeout                   :   0                                   0                                   0
client_lang_delete_success              :   0                                   0                                   0
client_lang_error                       :   0                                   0                                   0
client_lang_read_success                :   0                                   0                                   0
client_lang_write_success               :   0                                   0                                   0
client_proxy_complete                   :   36                                  0                                   8
client_proxy_error                      :   0                                   0                                   0
client_proxy_timeout                    :   1                                   0                                   0
client_read_error                       :   0                                   0                                   0
client_read_not_found                   :   90529538                            74713542                            61851130
client_read_success                     :   821838850                           747114969                           726169201
client_read_timeout                     :   0                                   0                                   0
client_tsvc_error                       :   0                                   0                                   0
client_tsvc_timeout                     :   0                                   0                                   0
client_udf_complete                     :   0                                   0                                   0
client_udf_error                        :   0                                   0                                   0
client_udf_timeout                      :   0                                   0                                   0
client_write_error                      :   0                                   0                                   0
client_write_success                    :   18529414                            18454711                            17637322
client_write_timeout                    :   0                                   0                                   0
cold-start-evict-ttl                    :   4294967295                          4294967295                          4294967295
conflict-resolution-policy              :   generation                          generation                          generation
current_time                            :   274886012                           274886012                           274886012
....
  1. How to determine briefly the status of the remote DC (AWS) and new writes shipment details:
Admin> show stat dc
~~~~~~~~~~~~~~~~~~AWS DC Statistics~~~~~~~~~~~~~~~~~~~
NODE                     :   aes001.prd.fyber.com:3000
dc_open_conn             :   128
dc_ship_attempt          :   313325210
dc_ship_bytes            :   372698884639
dc_ship_delete_success   :   0
dc_ship_destination_error:   978
dc_ship_idle_avg         :   0.000
dc_ship_idle_avg_pct     :   0.0
dc_ship_inflight_objects :   0
dc_ship_latency_avg      :   43
dc_ship_source_error     :   1185
dc_ship_success          :   313323047
dc_size                  :   2
dc_state                 :   CLUSTER_UP
dc_timelag               :   0
  1. How to check which nodes having involved in the new writes shipments from the original cluster and success/error rates:
Admin> info xdr
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~XDR Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                     Node      Build         Data    Free        Lag           Req         Req       Req          Cur       Avg
                        .          .      Shipped   Dlog%      (sec)   Outstanding     Shipped   Shipped   Throughput   Latency
                        .          .            .       .          .             .     Success    Errors            .      (ms)
aes001.prd.fyber.com:3000   3.13.0.8   347.224 GB     100   00:00:00       0.000     313.435 M   2.163 K          364        42
Number of rows: 1