Performance question for ingestion and selection query(groupBy)

Hi Druid,
Thank you in advance for your help.

I am using the latest Druid version and just installed the single machine based on https://druid.apache.org/docs/latest/operations/single-server.html

Here is the data volume I ingested.
Screen Shot 2020-09-02 at 5.20.54 PM.png

select count(distinct a) from “search” --> 5
select count(distinct b) from “search” --> 4,000,000

Here is the ingestion spec.

{

“type”: “index_parallel”,

“spec”: {

“dataSchema”: {

“dataSource”: “search1”,

“dimensionsSpec” : {

“dimensions” : [“a”, “b”, “c”, “d”, “e”, “f”]

},

“timestampSpec”: {

“column”: “dt”,

“format”: “iso”,

“missingValue”: “2020-08-22T00:00:00.000”

},

“granularitySpec”: {

“segmentGranularity”: “day”,

“queryGranularity”: “none”,

“rollup”: false

}

},

“ioConfig”: {

“type”: “index_parallel”,

“inputSource”: {

“type”: “s3”,

“prefixes”: [“s3://bucket_name**/**”]

},

“inputFormat”: {

“type”: “parquet”

}

},

“tuningConfig”: {

“type”: “index_parallel”,

“maxNumConcurrentSubTasks”: 4 <-- Always 4 index runner processors are working even though I changed this. I think it is because the S3 bucket has 31 files. each processor is processing around 10 files.

}

}

}

Question 1 for ingestion performance:

  • In terms of ingestion performance, it is not a big difference between Medium(i3.4xlarge) and Xlarge(i3.16xlarge).

Tested Ingesting in single i3.4xlarge

  • took 7.16 min

Testing Ingesting in single i3.16xlarge

  • took 7.02 min

I only changed the config from the default option.

  • druid.indexer.runner.javaOpts=-server -Xms3g -Xmx3g : default is 1g

Is it a reasonable performance? Or do I have more room to improve?

Question 2 for selection(groupBy) performance:

When I ran this query, it took more than 30 seconds for both Medium(i3.4xlarge) and XLarge(i3.16xlarge).
select a, b, count(distinct c)

from "search"
group by a, b

I only changed the config from the default option.
For broker:

  • druid.processing.buffer.sizeBytes=2000000000

For historical:

  • druid.processing.buffer.sizeBytes=2000000000

  • druid.query.groupBy.maxOnDiskStorage=10000000000

  • druid.query.groupBy.maxMergingDictionarySize=300000000

To avoid the following error, I changed the above option.
Resource limit exceeded / Not enough dictionary space to execute this query. Try increasing druid.query.groupBy.maxMergingDictionarySize or enable disk spilling by setting druid.query.groupBy.maxOnDiskStorage to a positive number. / org.apache.druid.query.ResourceLimitExceededException / on host localhost:8083

Do you have any clue or hint to improve ingestion and groupBy selection?

I also tried this ingestion spec to partition by a. But it does not help to improve performance.

{

“type”: “index_parallel”,

“spec”: {

“dataSchema”: {

“dataSource”: “search”,

“dimensionsSpec” : {

“dimensions” : [“a”, “b”, “c”, “d”, “e”, “f”]

},

“timestampSpec”: {

“column”: “dt”,

“format”: “iso”,

“missingValue”: “2020-08-22T00:00:00.000”

},

“granularitySpec”: {

“segmentGranularity”: “day”,

“queryGranularity”: “none”,

“rollup”: false,

“intervals”: [“20200822/20200823”]

}

},

“ioConfig”: {

“type”: “index_parallel”,

“inputSource”: {

“type”: “s3”,

“prefixes”: [“s3://bucket_name**/**”]

},

“inputFormat”: {

“type”: “parquet”

}

},

“tuningConfig”: {

“type”: “index_parallel”,

“maxNumConcurrentSubTasks”: 4,

"partitionsSpec": {

"type": “single_dim”,

"partitionDimension": “a”,

"maxRowsPerSegment": 10000000

},

"forceGuaranteedRollup": true

}

}

}