Enable Autoscaling for Indexing Task

Hello,

I want to enable autoscaling for indexing tasks. I have given the configuration in Overlord properties file. I have also made a post call to /druid/indexer/v1/worker with config for autoscaling.

But I am not sure how it will launch the instance. It must be needing some Access key & id to launch instances. Also how Druid will be setup there? Will it be done by creating custom AMI?

Thanks,

Navneet

Hi Navneet,

although I have not personally tried out the builtin autoscaler, the following links might help figure out how to set it up:

The EC2 autoscaler takes the following config object:

https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/overlord/autoscaling/ec2/EC2EnvironmentConfig.java

The config subdivides into “nodeData” and “userData”. You can see which config settings can be set within these subsections here:

https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/overlord/autoscaling/ec2/EC2NodeData.java

https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/overlord/autoscaling/ec2/EC2UserData.java

https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/overlord/autoscaling/ec2/StringEC2UserData.java

Personally, I did not like to have the autoscaling logic hardcoded within the Druid codebase, so I developed an extension that would delegate Druid’s autoscaling decision to a simple external webservice. I hope to get clearance from the company I work for to contribute this extension soon.

This way, you could for instance swiftly develop a python webapp that would take care of provisioning/terminating clusters in an environment of your choice. For instance, if you are based on Amazon AWS, you could realize the autoscaling webservice via lambda functions and the Amazon API gateway service. Cloud providers that offer cluster solutions usually have an easy-to-use API like boto3 for Amazon EMR or the Databricks API, Quobole API etc.