OFW Druid Architecture¶

Imported from Confluence

Content may be outdated. Verify before following any procedures. View original | Last updated: November 2023

Druid components:

Master nodes have 2 services:
Coordinator processes manage data availability on the cluster.
Overlord processes control the assignment of data ingestion workloads.
Query nodes have 2 services:
Broker processes handle queries from external clients.
Router processes are optional processes that can route requests to Brokers, Coordinators, and Overlords.
Historical processes store queryable data.
MiddleManager processes are responsible for ingesting data

Dependencies:

Deep storage - stores any data that has been ingested into the system, used by druid as backup and as a way to transfer data in the background between Druid processes. Druid never needs to access deep storage during a query.
Metadata storage - holds various shared system metadata such as segment usage information and task information. We used AWS RDS mysql for it.
Zookeeper - used for internal service discovery, coordination, and leader election. We use exhibitor distribution.
Memcached - layer 2 cache, is used as shared cache for historicals and brokers in addition to local caches on every node. We aws elasticache here.
Consul - we use it in druid deployment for following:
Discover zookeeper nodes.
Every parameter of druid could be configured to be propagated from consul KV store. This is needed for initial setup when you do lot of changes while tuning cluster. After tuning is done, this way is not recommended because changing parameter value from consul will trigger service restart on all nodes almost at the same time which is dangerous. Parameters to consul are set from terraform code.
Health checks - all services are described in consul with health checks, so you will their statuses here Consul - Services
consul could be used to discover services endpoints by dns names if you want to send request directly without using Load Balancer. For example: imply-query-production-1.service.consul instead of Load Balancer http://imply.prd-aws.fyber.com:8080. This could be used for troubleshooting if you suspect problems with load balancer.
We use it to discover NodeExporters and automatically add metrics scraping to prometheus
Prometheus - metric collection service