Skip to content

Investigation of server-side callback failures

Archived (pre-2022)

Preserved for reference only -- likely outdated. View original | Last updated: December 2018

Root Cause

If only some specific applications were affected, then the most probably the root causes are:

1) The publisher has not whitelisted the IP that we use to send

2) The publisher has not created an appropriate service in their server to reply to us when we send callbacks

Threat

First of all it's causing problems with those applications because their callbacks are not being handled in a proper way.

In regards to the infrastructure, this can cause disk space issue on the Cache Cluster's servers (ccs001, ccs002, ccs003). Usually, the disk space is being cleared automatically, but at some point, this process can be broken by some stuck beanstalk's job. Since there is no way to figure out which job is causing the problem - we need to kick all delayed jobs.

CheckMK Alert

Column 1 Column 2 Column 3
ccs001.prd.fyber.com Filesystem /data CRIT - 88.4% used (648.71 of 733.52 GB), (warn/crit at 70.81/77.13%), trend: +101.65 GB / 24 hours - only 20 hours until disk fullCRIT

Investigation

You can check the current status of the beanstalk's tubes using this link: Ams - Tubes

Pay attention to these tubes: publisher_callback-Xengine_publisher_callback_Xbre_callback_dispatch-X.When we have an incident there are a lot of delayed jobs pilling up in those tubes.  

Current rate of the failing publisher callback's can be checked here: Grafana : mRV User Rewarding

Screen Shot 2018-08-29 at 09.54.19.png

On the graph above the red layer is representing rate of failed publisher callbacks per minute. This is a clear sign that some applications are misbehaving.

To figure out which exact application is causing this troubles, we need to make a query via Impala: Impala Query Editor

Checking via Impala:

SELECT hour, application_id, COUNT(*) FROM data_platform_warehouse.publisher_callback_failure_v1 WHERE `year` = 2018 AND `month` = 9 AND `day` = 19 and hour = 17
GROUP BY 1,2 order by count (*) desc

Response:

Screen Shot 2018-08-29 at 12.38.49.png

In here we can see that the three application ids are cause the greatest number of failed callbacks. Here we can find who is the publisher of the application: AMS - Publishers; This information will be useful for guys from the technical services.

Another way is to check messages in publisher_callback_failure_v1 kafka topic:

docker run desimic/tailtopic -b kfk001.prd.fyber.com -d avro publisher_callback_failure_v1 | jq . | grep application_id
  "application_id": 34326,
  "application_id": 34326,
...

Collect all the information and reach out to technical services. They will get in contact with the publisher of application to solve the issue. Involve developer onCall if some assistance is required from the Fyber side.

UPDATE:

Query to check publishers with failed callbacks:

select application_id, count(*)
 FROM data_platform_warehouse.publisher_callback_failure_v1
 where
  year = 2018 and month = 12 and day =6
 group by 1
 order by count(*) desc

Result:

image (2).png

Check response code and messages from failed publishers:

select *
 FROM data_platform_warehouse.publisher_callback_failure_v1
 where application_id in (39905, 46180,39903)
 and year = 2018 and month = 12 and day =6
 order by callback_timestamp

Resolution

Temporary solution for this problem is to kick the delayed jobs from tubes. You can use this script beanstalkd_tubes_kicker to kick the jobs.

IMPORTANT:

Contact solution engineering about this problem and if needed try to kick beanstalk tubes

Start removing jobs form the last tube, jobs from this tube already had 8 attempts to be processed so not contract violations from out side.

Additional information

Grafana's graphs for monitoring callback's failures among all products:

Grafana - Technical Services Main Cockpit

Also it's good practice to check that callbacks are not piling up:
Grafana - Technical Services Main Cockpit

Example Incidents