OFW Druid Upgrade¶
Archived (pre-2022)
Preserved for reference only -- likely outdated. View original | Last updated: August 2021
In order to upgrade apache druid to the next version following steps should be performed.
Pre-requirements and links¶
- Download druid archive version from - Druid
- Upload archive to the S3 bucket Apache Druid#{druid_version}-bin.tar.gz
- Create an RDS druid snapshot!
Testing¶
- Before running druid cookbook, run zookeeper kitchen converge first in the cookbook - fyber_core_exhibitor
- Check zookeeper is registered in consul - Consul - Services
- Update druid version for kitchen - default.rb (Github)
-
kitchen converge to spin up druid cluster on EC2 instance for testing purposes(below tips to fix common issues)
-
to run ec2 with started services update key/value to packer: false,to test image with not started services leave with packer: true
- update kitchen.yml with the correct path to the ssh key, e.g.
transport:
connection_timeout: 10
connection_retries: 5
username: ubuntu
ssh_key: /Users/mguk/.ssh/fyber_core_prod
- run chef install to generate Policyfile.lock.json
- Check if druid services are registered in consul - Consul - Services
- Check if druid services are running properly on the EC2 (systemctl/journactl/logs)
- Check Druid UI via following consul URL - druid-master-test-1.service.core-production-1.consul:8080
- Check Druid version in the kitchen -
Upgrade with Chef¶
- Update druid version in chef Policyfile.rb - Policyfile.rb (Github)
- Generate Policyfile.lock.json with chef install/update and push your changes with chef push
Upgrade with Chef/Packer¶
- create a new AMI for Druid
aws-infrastructure-code
> ./scripts/packer/packer_chef_zero.sh -p druid_cluster_1 -c druid --update-chef yes --skip-packer no
- After AMI creation, the consul will be updated with a new AMI ID - Consul - Edit
- Spot Inst agent service was moved to user-data due to a bug in the installation script: druid_userdata.sh.tmpl (Github)
Upgrade with Terraform¶
- Apply terraform to all druid services, new AMI will be taken from Consul
aws-infrastructure-code/terraform/states/imply_cluster_1
> bundle exec rake "terraform:plan_and_apply[imply_cluster_1,production-eu-west-1]"
Services Rollout¶
Rolling service updates according to Rolling Updates
The best way to rollout services is:
- roll out of all middle managers via spot.io
- adding the same number of historical servers wait until all of them rebalanced and replicated segment (depends of amount of data, 12 hours last time). Remove old servers one by one and monitor Datasource availability is full - Druid - Unified Console

- Rollout broker and master servers via spot.io
Change list for 0.21.0 upgrade¶
- The most detailed list of changes for druid you can find here druid (Github). We did upgrade from 0.18.1 to 0.21.1.
- Every release brought around 200 new features, bug fixed, performance enhancements etc. I tried to create a final list here:
- Druid native batch support for Apache Avro Object Container Files
- New in Druid 0.19.0, native batch indexing now supports Apache Avro Object Container Format encoded files, allowing batch ingestion of Avro data without needing an external Hadoop cluster.
- Updated Druid native batch support for SQL databases
- The SQL input source is used to read data directly from RDBMS
- Apache Ranger based authorization
- A new extension in Druid 0.19.0 adds an Authorizer which implements access control for Druid
- REGEXP_LIKE
- A new REGEXP_LIKE function has been added to Druid SQL and native expressions, which behaves similar to LIKE, except using regular expressions for the pattern.
- Web console lookup management improvements
- Druid 0.19 also web console also includes some useful improvements to the lookup table management interface. Creating and editing lookups is now done with a form to accept user input, rather than a raw text editor to enter the JSON spec.
- 10:26
- Combining InputSource - allowing the user to combine multiple input sources during ingestion
- Automatically determine numShards for parallel ingestion hash partitioning
- New metrics for ingestion
- Support for all partitioning schemes for auto-compaction
- A partitioning spec can now be defined for auto-compaction, allowing users to repartition their data at compaction time. Please see the documentation for the new partitionsSpec property in the compaction tuningConfig for more details:
- Query segment pruning with hash partitioning
- Vectorization support for expression virtual columns
- More vectorization support for aggregators
- offset parameter for GroupBy and Scan queries - It is now possible set an offset parameter for GroupBy and Scan queries, which tells Druid to skip a number of rows when returning results
- OFFSET clause for SQL queries
- Substring search operators - 2.5x performance improvement in some cases by using these functions instead of STRPOS
- Druid SQL queries now support the UNION ALL operator, which fuses the results of multiple queries together
- Improved retention rules UI
- The retention rules UI in the web console has been improved. It now provides suggestions and basic validation in the period dropdown, shows the cluster default rules, and makes editing the default rules more accessible.
- Redis cache extension enhancements
- ZOOKEEPER DEPRECATION! - we still use it but going to test how to remove it from our deployment
- Service discovery and leader election based on Kubernetes - druid is actively adding features for deployments in Kubernetes!
- New grouping aggregator function - You can use the new grouping aggregator SQL function with GROUPING SETS or CUBE to indicate which grouping dimensions are included in the current grouping set
- Improved missing argument handling in expressions and functions - Expression processing now can be vectorized when inputs are missing. For example a non-existent column. When an argument is missing in an expression, Druid can now infer the proper type of result based on non-null arguments. For instance, for longColumn + nonExistentColumn, nonExistentColumn is treated as (long) 0 instead of (double) 0.0. Finally, in default null handling mode, math functions can produce output properly by treating missing arguments as zeros.
- Allow zero period for TIMESTAMPADD - TIMESTAMPADD function now allows zero period. This functionality is required for some BI tools such as Tableau.
- Native parallel ingestion no longer requires explicit intervals - Parallel task no longer requires you to set explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.
- Old Kafka version support
- Multi-phase segment merge for native batch ingestion - A new tuningConfig, maxColumnsToMerge, controls how many segments can be merged at the same time in the task. This configuration can be useful to avoid high memory pressure during the merge.
- Native re-ingestion is less memory intensive
- Updated and improved web console styles - check it out druid.prd-aws.fyber.com
- WebUI - Partitioning information is available in the web console