raged grafana graphs because of kafka-events-metrics in mesos continuesly restarting¶

Archived (pre-2022)

Preserved for reference only -- likely outdated. View original | Last updated: August 2018

CheckMk alarm: sys_kafka_events_metrics_consumers – CRIT: No instances found for app_id /kafka-events-metrics (also escalated via OpsGenie)

kafka-event-metrics application running in mesos cluster went down because of OOM issues.

That has an impact on a lot of graphs on Team OWF grafana dashboard.
https://grafana.prd.fyber.com:3000/dashboard/db/team-ofw?refresh=5m&panelId=22&fullscreen&orgId=1
https://grafana.prd.fyber.com:3000/dashboard/db/team-ofw?refresh=5m&panelId=6&fullscreen&orgId=1

and so on.

Every time application was down mesos was trying to redeploy it. And that's helps for a while, app was running for 2 minutes and then goes down again.

After timeout configuration for app was increased to 20 sec, it stops failing but produced metrics were not accurate. (Change configs in marathon web UI)

One more change in config "CONSUMER_GROUP_ID=production-kafka-events-metrics-v7" (consumer id was increased by 1) finally solves the issue.

Also this could be caused by lack of resource in mesos cluster.