OFW Druid Upgrade¶

Archived (pre-2022)

Preserved for reference only -- likely outdated. View original | Last updated: August 2021

In order to upgrade apache druid to the next version following steps should be performed.

Pre-requirements and links¶

Download druid archive version from - Druid
Upload archive to the S3 bucket Apache Druid#{druid_version}-bin.tar.gz
Create an RDS druid snapshot!

Testing¶

Before running druid cookbook, run zookeeper kitchen converge first in the cookbook - fyber_core_exhibitor
Check zookeeper is registered in consul - Consul - Services
Update druid version for kitchen - default.rb (Github)
kitchen converge to spin up druid cluster on EC2 instance for testing purposes(below tips to fix common issues)
to run ec2 with started services update key/value to packer: false,to test image with not started services leave with packer: true
update kitchen.yml with the correct path to the ssh key, e.g.

transport:
  connection_timeout: 10
  connection_retries: 5
  username: ubuntu
  ssh_key: /Users/mguk/.ssh/fyber_core_prod

run chef install to generate Policyfile.lock.json
Check if druid services are registered in consul - Consul - Services
Check if druid services are running properly on the EC2 (systemctl/journactl/logs)
Check Druid UI via following consul URL - druid-master-test-1.service.core-production-1.consul:8080
Check Druid version in the kitchen -

curl http://druid-master-test-1.service.core-production-1.consul:8080/status | jq .

Upgrade with Chef¶

Update druid version in chef Policyfile.rb - Policyfile.rb (Github)
Generate Policyfile.lock.json with chef install/update and push your changes with chef push

Upgrade with Chef/Packer¶

create a new AMI for Druid

aws-infrastructure-code
> ./scripts/packer/packer_chef_zero.sh -p druid_cluster_1 -c druid --update-chef yes --skip-packer no

After AMI creation, the consul will be updated with a new AMI ID - Consul - Edit
Spot Inst agent service was moved to user-data due to a bug in the installation script: druid_userdata.sh.tmpl (Github)

Upgrade with Terraform¶

Apply terraform to all druid services, new AMI will be taken from Consul

aws-infrastructure-code/terraform/states/imply_cluster_1
> bundle exec rake "terraform:plan_and_apply[imply_cluster_1,production-eu-west-1]"

Services Rollout¶

Rolling service updates according to Rolling Updates

The best way to rollout services is:

roll out of all middle managers via spot.io
adding the same number of historical servers wait until all of them rebalanced and replicated segment (depends of amount of data, 12 hours last time). Remove old servers one by one and monitor Datasource availability is full - Druid - Unified Console

Screenshot 2021-08-19 at 14.07.43.png

Rollout broker and master servers via spot.io

Change list for 0.21.0 upgrade¶

The most detailed list of changes for druid you can find here druid (Github). We did upgrade from 0.18.1 to 0.21.1.
Every release brought around 200 new features, bug fixed, performance enhancements etc. I tried to create a final list here:
Druid native batch support for Apache Avro Object Container Files
New in Druid 0.19.0, native batch indexing now supports Apache Avro Object Container Format encoded files, allowing batch ingestion of Avro data without needing an external Hadoop cluster.
Updated Druid native batch support for SQL databases
The SQL input source is used to read data directly from RDBMS
Apache Ranger based authorization
A new extension in Druid 0.19.0 adds an Authorizer which implements access control for Druid
REGEXP_LIKE
A new REGEXP_LIKE function has been added to Druid SQL and native expressions, which behaves similar to LIKE, except using regular expressions for the pattern.
Web console lookup management improvements
Druid 0.19 also web console also includes some useful improvements to the lookup table management interface. Creating and editing lookups is now done with a form to accept user input, rather than a raw text editor to enter the JSON spec.
10:26
Combining InputSource - allowing the user to combine multiple input sources during ingestion
Automatically determine numShards for parallel ingestion hash partitioning
New metrics for ingestion
Support for all partitioning schemes for auto-compaction
A partitioning spec can now be defined for auto-compaction, allowing users to repartition their data at compaction time. Please see the documentation for the new partitionsSpec property in the compaction tuningConfig for more details:
Query segment pruning with hash partitioning
Vectorization support for expression virtual columns
More vectorization support for aggregators
offset parameter for GroupBy and Scan queries - It is now possible set an offset parameter for GroupBy and Scan queries, which tells Druid to skip a number of rows when returning results
OFFSET clause for SQL queries
Substring search operators - 2.5x performance improvement in some cases by using these functions instead of STRPOS
Druid SQL queries now support the UNION ALL operator, which fuses the results of multiple queries together
Improved retention rules UI
The retention rules UI in the web console has been improved. It now provides suggestions and basic validation in the period dropdown, shows the cluster default rules, and makes editing the default rules more accessible.
Redis cache extension enhancements
ZOOKEEPER DEPRECATION! - we still use it but going to test how to remove it from our deployment
Service discovery and leader election based on Kubernetes - druid is actively adding features for deployments in Kubernetes!
New grouping aggregator function - You can use the new grouping aggregator SQL function with GROUPING SETS or CUBE to indicate which grouping dimensions are included in the current grouping set
Improved missing argument handling in expressions and functions - Expression processing now can be vectorized when inputs are missing. For example a non-existent column. When an argument is missing in an expression, Druid can now infer the proper type of result based on non-null arguments. For instance, for longColumn + nonExistentColumn, nonExistentColumn is treated as (long) 0 instead of (double) 0.0. Finally, in default null handling mode, math functions can produce output properly by treating missing arguments as zeros.
Allow zero period for TIMESTAMPADD - TIMESTAMPADD function now allows zero period. This functionality is required for some BI tools such as Tableau.
Native parallel ingestion no longer requires explicit intervals - Parallel task no longer requires you to set explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.
Old Kafka version support
Multi-phase segment merge for native batch ingestion - A new tuningConfig, maxColumnsToMerge, controls how many segments can be merged at the same time in the task. This configuration can be useful to avoid high memory pressure during the merge.
Native re-ingestion is less memory intensive
Updated and improved web console styles - check it out druid.prd-aws.fyber.com
WebUI - Partitioning information is available in the web console