Hi Team,
We have close to ~500,000 active data segment in the Metadata store (Postgres)
Coordinator is running on a 4 CPU, Centos server
We have updated from 0.10.0 to Druid 0.12.1, Post this when we bring up the Co-ordinator we see the below behaviour
The datasource loading keeps running and goes into a hung state in inside the poll() function in SQLMetadataSegmentManager.java.
On further debugging we see that below portion is the one that is taking time
if (!dataSource.getSegments().contains(segment)) {
dataSource.addSegment(segment);
}
And it seems like the main reason it is taking time is due to the change in the file DruidDataSource.java from ConcurrentSkipListSet (and a HashMap) to ConcurrentSkipListMap
We added additional logging statements to time the above section in the **SQLMetadataSegmentManager.java **and we see that as the loop runs collecting segments, initially the time taken is less than a milli second but as the loop runs inserting more records into the ConcurrentSkipListMap, the insertions take ~8 ms by ~50k records and then increase all the way to ~300 ms when we reach 300K records
We also added the same timers to the lower version of Druid and with **ConcurrentSkipListSet **the implementation the loop completes processing the 500k records in 5 mins.
Also when we try with a higher config machine 32 CPU, we still see the same behaviour.
In Summary it seems like ConcurrentSkipListMap is slower than ConcurrentSkipListSet and is resulting in some sort of timeout in version 0.12.1 whereas the same number of segments are getting loaded without issues in under 10 mins in version 0.10.0. Also when we check the code, the code in 0.11.0 seems identical to 0.10.0 however the 0.12.1 has this change.
Regards,
Venu