Expand druid historicals data disks¶
Imported from Confluence
Content may be outdated. Verify before following any procedures. View original | Last updated: January 2026
Expand GKE Druid disk¶
PR example: DEVOPSBLN-6911
In GKE, we use a Helm chart with Stateful Set, so we need to do next steps to perfom 0 downtime disk extension:
- Extend pvc volume manually using kubectl
for pvc in $(kubectl -n druid get pvc -o name | grep -i historical); do
kubectl -n druid patch "$pvc" -p '{"spec":{"resources":{"requests":{"storage":"250Gi"}}}}'
done
2. Delete STS while leaving orphan pods running.¶
3. Update helm values to match PVC size and run apply¶
4. Validation of change¶
k df-pv -n druid
PV NAME PVC NAME NAMESPACE NODE NAME POD NAME VOLUME MOUNT NAME SIZE USED AVAILABLE %USED IUSED IFREE %IUSED
pvc-1e317538-e9eb-46e1-851b-dcfe6958cbda data-druid-historical-default-3 druid gke-gke-core-dsp-pro-ccc-druid-histor-1d1fb832-rnlr druid-historical-default-3 data 245Gi 157Gi 88Gi 64.15 15710 16368290 0.10
pvc-08243eb4-a28e-49b9-8979-c0837f5afefb data-druid-historical-default-4 druid gke-gke-core-dsp-pro-ccc-druid-histor-1d1fb832-b6bc druid-historical-default-4 data 245Gi 167Gi 78Gi 68.21 17159 16366841 0.10
pvc-70b88639-28fd-4f4e-921a-bb78baafc9cf data-druid-historical-default-0 druid gke-gke-core-dsp-pro-ccc-druid-histor-1d1fb832-d2bs druid-historical-default-0 data
....
Info
Based on DEVOPSBLN-2121
1. Extending current EBS volumes¶
Cannot be done via Terraform, but has to be updated in Terraform to stick with "code=real state" policy. Checking previous expansion limit via AWS console or in the code (better both):
~/repos/aws-infrastructure-code/ [master] grep -r 1200 . | grep -i historical
./terraform/states/imply_cluster_1/production-eu-west-1.tfvars:historical_spotinst_group_storage_volume_size = 1200
So it was up to 1200, now we can fetch ebs VolumeIds:
~/repos/aws-infrastructure-code/ [master] OLDSIZE=1200; aws ec2 describe-volumes --region=eu-west-1 --filters Name=size,Values=$OLDSIZE Name=tag:Name,Values=druid-historical-production-1 | grep -o "VolumeId.*" | sort -u | cut -d'"' -f3
vol-032bf923c88f5015d
vol-051d82227ae27caf8
vol-05560a586ef03cc04
vol-06720b509a4e2a73b
vol-06ec856762fe71f4d
vol-079941aab34777b51
vol-08343a71dd2186286
vol-0b3b4f1e235d668cb
vol-0cd0c939fe0e2e2da
vol-0d8421e8e00b71820
vol-0df8e1c7a7ea2c6bd
vol-0e13424534757862b
vol-0ed3e946492c20233
vol-0edcc443c333395af
vol-0fd16340b0b534636
And finally try to update it (note --dry-run option):
~/repos/aws-infrastructure-code/ [master] OLDSIZE=1200; NEWSIZE=1500; REGION=eu-west-1; for ebs in $(aws ec2 describe-volumes --region=$REGION --filters Name=size,Values=$OLDSIZE Name=tag:Name,Values=druid-historical-production-1 | grep -o "VolumeId.*" | sort -u | cut -d'"' -f3); do aws ec2 modify-volume --region=$REGION --volume-id $ebs --size $NEWSIZE --dry-run; done
An error occurred (DryRunOperation) when calling the ModifyVolume operation: Request would have succeeded, but DryRun flag is set.
<...>
Final application:
~/repos/aws-infrastructure-code/ [master] OLDSIZE=1200; NEWSIZE=1500; REGION=eu-west-1; for ebs in $(aws ec2 describe-volumes --region=$REGION --filters Name=size,Values=$OLDSIZE Name=tag:Name,Values=druid-historical-production-1 | grep -o "VolumeId.*" | sort -u | cut -d'"' -f3); do aws ec2 modify-volume --region=$REGION --volume-id $ebs --size $NEWSIZE &>/dev/null; done
And the result:
2. Updating spot.io¶
PR example: aws-infrastructure-code (Github)

Applying the code:
~/repos/aws-infrastructure-code/ [DEVOPSBLN-2121] bundle exec rake 'terraform:plan_and_apply[imply_cluster_1,production-eu-west-1]'
3. Updating Druid itself (consul key + default value)¶
PR example: aws-infrastructure-code (Github)
Info
Why update default? To prevent new instances from restarting after the initial start due to "default value != what is in consul".

4. Rebuilding image¶
Info
Why rebuild? Basically we are preparing an image which is equal to consul template, meaning if consul is down/not accessible, new instances would not be spawned with outdated (not-synched-with-consul) config (in ami).
~/repos/aws-infrastructure-code/ [DEVOPSBLN-2121] ./scripts/packer/packer_chef_zero.sh -p druid_cluster_1 -c druid --update-chef yes --skip-packer no
Applying the code:
~/repos/aws-infrastructure-code/ [DEVOPSBLN-2121] bundle exec rake 'terraform:plan_and_apply[imply_cluster_1,production-eu-west-1]'
Checking results:
Notes¶
- on druid instance template could be found in /opt/druid/conf/historical/runtime.propertoes.tmpl
- aws-infrastructure-code/cookbooks/fyber_apache_druid/attributes/data.rb to check if value goes from consul
5. Possible issues¶
As we can see from the picture below, some historical servers are good (~60%) and some still have >80% disk space used (which is not cool as we'd just added more space)
to fix the thing we should restart druid-coordinator (the beast is in charge for segments <-> S3 managment/balancing):
~/ ssh ubuntu@druid-master-production-1.service.core-production-1.consul
<...>
ubuntu@ip-10-37-134-2:~$ sudo systemctl restart druid-coordinator.service
ubuntu@ip-10-37-134-2:~$ systemctl status druid-coordinator.service
● druid-coordinator.service - Druid coordinator
Loaded: loaded (/etc/systemd/system/druid-coordinator.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-09-14 15:08:07 UTC; 5s ago
Main PID: 598 (java)
Tasks: 74 (limit: 4915)
CGroup: /system.slice/druid-coordinator.service
└─598 /usr/bin/java -server -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplication
Sep 14 15:08:07 ip-10-37-134-2 systemd[1]: Started Druid coordinator.
Sep 14 15:08:08 ip-10-37-134-2 run-druid[598]: 2021-09-14 15:08:08,836 main INFO Registered Log4j as the java.util.logging.LogManager.
Check Services -> "Details" field for ongoing processing status ("segments to drop/load" - meaning it's recalculating):
And later that everything is fixed (approx. same percentage):