Druid 12.1 Datasource load fails in Coordinator due change in implementation from Set to Map

Hi Team,

We have close to ~500,000 active data segment in the Metadata store (Postgres)

Coordinator is running on a 4 CPU, Centos server

We have updated from 0.10.0 to Druid 0.12.1, Post this when we bring up the Co-ordinator we see the below behaviour

The datasource loading keeps running and goes into a hung state in inside the poll() function in SQLMetadataSegmentManager.java.

On further debugging we see that below portion is the one that is taking time

if (!dataSource.getSegments().contains(segment)) {

dataSource.addSegment(segment);

}

And it seems like the main reason it is taking time is due to the change in the file DruidDataSource.java from ConcurrentSkipListSet (and a HashMap) to ConcurrentSkipListMap

We added additional logging statements to time the above section in the **SQLMetadataSegmentManager.java **and we see that as the loop runs collecting segments, initially the time taken is less than a milli second but as the loop runs inserting more records into the ConcurrentSkipListMap, the insertions take ~8 ms by ~50k records and then increase all the way to ~300 ms when we reach 300K records

We also added the same timers to the lower version of Druid and with **ConcurrentSkipListSet **the implementation the loop completes processing the 500k records in 5 mins.

Also when we try with a higher config machine 32 CPU, we still see the same behaviour.

In Summary it seems like ConcurrentSkipListMap is slower than ConcurrentSkipListSet and is resulting in some sort of timeout in version 0.12.1 whereas the same number of segments are getting loaded without issues in under 10 mins in version 0.10.0. Also when we check the code, the code in 0.11.0 seems identical to 0.10.0 however the 0.12.1 has this change.

Regards,

Venu

Adding the dev email group.

We are currently hitting this problem in our environment too where loading 200K segments is taking forever where as on 10.1 the load happened in less than 5 minutes. I see a pull request (https://github.com/druid-io/druid/pull/5878) that potentially fixes this issue that was checked in to master. I believe this fix would be part of the 0.12.2 release whenever it comes out.

Yes, this should be out soon. We regret the regression! If you are comfortable patching #5878 into your build, that should fix it. It will also be included in 0.12.2.