Can't configure druid EC2 autoscale

Hi guys!

I’m new to druid and now I have the task to deploy a druid cluster with EC2 autoscaling.

Now I have 3 instances of EC2 for master, data and query. In the druid console I found an overlord dynamic config that can be configured as follows https://druid.apache.org/docs/latest/configuration/index.html#overlord-dynamic-configuration

So, following this documentation, I’ve inserted following config to Auto Scaler field:
{
“type”: “ec2”,
“minNumWorkers”: 1,
“maxNumWorkers”: 12,
“envConfig”: {
“availabilityZone”: “us-east-2a”,
“nodeData”: {
“amiId”: “ami-0d03add87774b12c5”,
“instanceType”: “i3.4xlarge”,
“minInstances”: 1,
“maxInstances”: 10,
“securityGroupIds”: [
“sg-XXXX”
],
“keyName”: “XXXX”,
“subnetId”: null,
“iamProfile”: null,
“associatePublicIpAddress”: null
},
“userData”: null
}
}

In the common properties for all servers I added druid-ec2-extensions to druid.extensions.loadList variable

Also coordinator-overlord config looks like this with autoscale settings:

druid.service=druid/coordinator
druid.plaintextPort=8081

druid.coordinator.startDelay=PT10S
druid.coordinator.period=PT5S

Run the overlord service in the coordinator process

druid.coordinator.asOverlord.enabled=true
druid.coordinator.asOverlord.overlordService=druid/overlord

druid.indexer.queue.startDelay=PT5S

druid.indexer.runner.type=remote
druid.indexer.storage.type=metadata

druid.indexer.autoscale.strategy=ec2
druid.indexer.autoscale.doAutoscale=true
druid.indexer.autoscale.provisionPeriod=PT30S

But this does not work for me.
I do not see new nodes when I create many tasks for my cluster.

I also found such an error in the log of the coordinator-overlord after inserting auto scaler settings:
2019-11-18T09:07:37,863 INFO [DatabaseRuleManager-Exec–0] org.apache.druid.metadata.SQLMetadataRuleManager - Polled and found 1 rule(s) for 1 datasource(s)
2019-11-18T09:07:37,883 ERROR [SimpleResourceManagement-manager–0] org.apache.druid.indexing.overlord.autoscaling.AbstractWorkerProvisioningStrategy - Uncaught exception.
org.apache.druid.java.util.common.ISE: No minVersion found! It should be set in your runtime properties or configuration database.
at org.apache.druid.indexing.overlord.autoscaling.ProvisioningUtil$1.apply(ProvisioningUtil.java:39) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
at org.apache.druid.indexing.overlord.autoscaling.ProvisioningUtil$1.apply(ProvisioningUtil.java:33) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
at com.google.common.collect.Iterators$7.computeNext(Iterators.java:647) ~[guava-16.0.1.jar:?]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.1.jar:?]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.1.jar:?]
at com.google.common.collect.Iterators.size(Iterators.java:186) ~[guava-16.0.1.jar:?]
at com.google.common.collect.Collections2$FilteredCollection.size(Collections2.java:211) ~[guava-16.0.1.jar:?]
at org.apache.druid.indexing.overlord.autoscaling.SimpleWorkerProvisioningStrategy$SimpleProvisioner.doProvision(SimpleWorkerProvisioningStrategy.java:130) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
at org.apache.druid.indexing.overlord.autoscaling.AbstractWorkerProvisioningStrategy$WorkerProvisioningService$1.run(AbstractWorkerProvisioningStrategy.java:75) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

But I can’t understand what this setting is? maybe because of this setting all the problems?

I will be glad of any help!
Thank you all and have a nice day!

What does your runtime.properties on the middle manager (data node) look like?

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

druid.service=druid/middleManager
druid.plaintextPort=8091

Number of tasks per middleManager

druid.worker.capacity=4

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms1g -Xmx1g -XX:MaxDirectMemorySize=1g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=var/druid/task

HTTP server threads

druid.server.http.numThreads=60

Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.numMergeBuffers=2
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000
druid.indexer.fork.property.druid.processing.numThreads=1

Hadoop indexing

druid.indexer.task.hadoopWorkingPath=var/druid/hadoop-tmp

Hi Alex,

Can you try setting the following in your middle manager runtime.properties?

druid.worker.version

Also you may need to define the following in your autoscaler field for userData.

“userData”: {
“versionReplacementString”: “:VERSION:” ,
“version”: null

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Hi Alex,

I looked through the code and it looks like this needs to be set.

Autoscaling-based replacement

If autoscaling is enabled on your Overlord, then Overlord processes can launch new Middle Manager processes en masse and then gracefully terminate old ones as their tasks finish. This process is configured by setting druid.indexer.runner.minWorkerVersion=#{VERSION}. Each time you update your Overlord process, the VERSION value should be increased, which will trigger a mass launch of new Middle Managers.

The config druid.indexer.autoscale.workerVersion=#{VERSION} also needs to be set.

So please set this as well.

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

I’ve added to my coordinator-overlord properties such string:

druid.indexer.autoscale.workerVersion=1

And now my Auto Scaler config looks like this:

{ "type": "ec2", "minNumWorkers": 5, "maxNumWorkers": 12, "envConfig": { "availabilityZone": "us-east-2b", "nodeData": { "amiId": "ami-0d03add87774b12c5", "instanceType": "i3.4xlarge", "minInstances": 2, "maxInstances": 10, "securityGroupIds": [ "sg-XXXX" ], "keyName": "XXXX", "subnetId": null, "iamProfile": null, "associatePublicIpAddress": null }, "userData": { "version": "1", } } }

But now I have the following error:

`2019-11-18T16:00:07,189 INFO [SimpleResourceManagement-manager–0] org.apache.druid.indexing.overlord.autoscaling.SimpleWorkerProvisioningStrategy - Our target is 5 workers, and I’m okay with that (current = 0, min = 5, max = 12).
2019-11-18T16:00:07,361 ERROR [SimpleResourceManagement-manager–0] org.apache.druid.indexing.overlord.autoscaling.ec2.EC2AutoScaler - Unable to provision any EC2 instances.
com.amazonaws.services.ec2.model.AmazonEC2Exception: Invalid availability zone: [us-east-2b] (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterValue; Request ID: 40e93594-9192-40b8-bc49-6972ed18c7f0)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1638) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1303) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) ~[aws-java-sdk-core-1.11.199.jar:?]
at com.amazonaws.services.ec2.AmazonEC2Client.doInvoke(AmazonEC2Client.java:14078) ~[aws-java-sdk-ec2-1.11.199.jar:?]
at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:14054) ~[aws-java-sdk-ec2-1.11.199.jar:?]
at com.amazonaws.services.ec2.AmazonEC2Client.executeRunInstances(AmazonEC2Client.java:13446) ~[aws-java-sdk-ec2-1.11.199.jar:?]
at com.amazonaws.services.ec2.AmazonEC2Client.runInstances(AmazonEC2Client.java:13423) ~[aws-java-sdk-ec2-1.11.199.jar:?]
at org.apache.druid.indexing.overlord.autoscaling.ec2.EC2AutoScaler.provision(EC2AutoScaler.java:151) [druid-ec2-extensions-0.16.0-incubating.jar:0.16.0-incubating]
at org.apache.druid.indexing.overlord.autoscaling.SimpleWorkerProvisioningStrategy$SimpleProvisioner.doProvision(SimpleWorkerProvisioningStrategy.java:153) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
at org.apache.druid.indexing.overlord.autoscaling.AbstractWorkerProvisioningStrategy$WorkerProvisioningService$1.run(AbstractWorkerProvisioningStrategy.java:75) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

`

My Druid servers are in the us-east-2b zone...

Can you try setting the aws.region to your config? Explained here.

https://druid.apache.org/docs/latest/development/extensions-core/s3.html

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

I have added to all jvm.config files the next string:

-Daws.region=us-east-2

Also changed middleManager runtime.properties:

druid.indexer.runner.javaOpts=-server -Xms1g -Xmx1g -XX:MaxDirectMemorySize=1g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Daws.region=us-east-2

Also set AWS_REGION and AWS_DEFAULT_REGION env variables to us-east-2

But this didn’t help =(

Hi Alex,

In there error I noticed us-east-2b is referenced but you have us-east-2 in the config. Can you make sure you have the right availability zone and have access to it? This looks like a AWS SDK problem.

Take a look here

https://github.com/coreos/coreos-kubernetes/issues/442

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones

Eric

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

I’ve changed it

2019-11-18T17:25:25,530 INFO [SimpleResourceManagement-manager–0] org.apache.druid.indexing.overlord.autoscaling.SimpleWorkerProvisioningStrategy - Our target is 5 workers, and I’m okay with that (current = 1, min = 5, max = 12).
2019-11-18T17:25:25,758 ERROR [SimpleResourceManagement-manager–0] org.apache.druid.indexing.overlord.autoscaling.ec2.EC2AutoScaler - Unable to provision any EC2 instances.
com.amazonaws.services.ec2.model.AmazonEC2Exception: Invalid availability zone: [us-east-2] (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterValue; Request ID: dff37c20-cd30-489d-b8d2-c8d4a7eddcf3)

Please review the links that I sent and verify you have access to that availability group.

Eric Graham

Solutions EngineerCell: +1-303-589-4581
egraham@imply.io

Try the zone as us-east-2. I don’t think you can specify the a,b,c bit

based on this ticket https://github.com/apache/incubator-druid/issues/5383
I managed to run autoscale when I changed the region from us-east-2 to us-east-1

So, I think this is it

Thank You, Eric!