Skip to content

Prometheus Main

Archived (pre-2022)

Preserved for reference only -- likely outdated. View original | Last updated: April 2020

Metrics

To collect metrics from infrastructure components we are deploying official Prometheus exporters to the instances we want to monitor. Node Exporter with enabled SystemD plugin is mandatory exporters providing basic system-related metrics like CPU, Mem, FileSystems and so on.

Each exporter has a corresponding Chef recipe created in a way so they could be customized in accordance with the need of a specific instance, system or cluster. You can check the list of available exporters here: Exporters

Healthchecks

We are using an official BlackBox exporter which is capable of doing the HTTP, HTTPS, TCP, DNS, ICMP checks. It's installed and configured on the Prometheus Main server.

Current health checks are listed here - Github.  BlackBox exporter working configuration and logs are available here - Prometheus BlackBox UI.

Alerting

The official alert manager is configured to send alerts to Slack and OpsGenie for the OnCall people.  Configurations of the alert manager can be found here - GitHub.

For each system and/or cluster we have separate files with rules describing the conditions and severity of the alerts. This way we can be more flexible on customizing specific alerts for different clusters. You can find all the alert rules here: - GitHub.

Storage

We are using persistent EBS volume - 200Gb. So far it's more than enough.

Metric retention is set to default - 15d. Practise is showing that we never needed more of the metrics data. If you have an idea to adjust this strategy then please refer to Prometheus's storage guide.

Scenarios

Introducing changes in configuration, updating and deploying to the Prometheus main

  1. Clone the aws-infrastructure-code repository.
  2. Create a branch for your change
  3. Adjust the chef configuration for Prometheus in accordance with your plan.
  4. Testing: If your change is more than simply alert rule adjustment, then make use of kitchen tests to avoid simple mistakes. To do that you need to have a configured test kitchen, Docker Desktop and Vagrant installed on your laptop. Then all you need is to run following command from the cookbook directory:
kitchen test

(....)

Service node_exporter
     ✔  should be running
  Service prometheus
     ✔  should be running
  Service alertmanager
     ✔  should be running
  Service consul
     ✔  should be running
  Service blackbox_exporter
     ✔  should be running
Test Summary: 5 successful, 0 failures, 0 skipped
  1. When you satisfied with your changes - don't forget to update the cookbook version in the metadata by increment.
  2. Proceed to chef policies directory and run following:
chef update

chef push production_eu-west-1 Policyfile.rb
  1. Login to Prometheus Main and run:
chef-client -l error
  1. When you tested all your changes and make sure it works as you expecting - push the code to a repo and create PR so your fellow colleagues could enjoy your work too.

Troubleshooting & Tips

Prometheus is unbearable slow, impossible to run the query

Several times we run into an issue with Prometheus UI. The reason was in the extremely large amount of metrics produced by under configured exporters. That's why it's very important to carefully read the documentation of the exporter before deploying it to the production.

How to figure out that there is a problem with large amount of metrics?

Usually, you can say tell when something is fishy - when a lot of metrics got skipped, for instance, when there are duplicates or Prometheus just overwhelmed (link to the Graph):

Screenshot 2020-04-15 at 11.51.48.png

Next, you can count how many metrics is currently on Prometheus:

curl http://prometheus.production.fyber.com:9090/api/v1/label/__name__/values >> output.txt
cat output.txt | tr '","' '\n' | sort | uniq > results.txt
cat results.txt | wc -l

As per official comment (check out one nice issue I've created based on that problem - issue):

Ah yeah, Prometheus is not made for hundreds of thousands of metric names - normally you'd have a couple thousand at most :)

You shouldn't have more than 2-3k of metrics. We had something like 600k and it was causing the UI slowness.

Info

Worth to mention, that this was not affecting the performance of Prometheus server itself.

To solve this you need to figure out what is the source of large amount of metrics and fix it.

Reloading the configuration without server restart

curl -X POST http://localhost:9090/-/reload

Delete metric series

curl -X POST   -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~"metric_you_hate_so_much(.*)"}'
Service
Prometheus UI
[Grafana
[Grafana
[Grafana
Chef Code