Unbalanced clusterboot

Hi all,

today I had a weird issue with a new set of historical nodes coming up, and I was asking whether you have an idea on why it happened:

So in our company we are testing different (aws-) instancetypes on how they effect cluster performance. So we did a test with 15 machines (40cores, 160gb mem) and now switched to 38 machines (16cores, 64gb mem), read: similar resources. (With 15machines we had no issues, and the cluster was coming up relatively fast, say an hour, so we wanted also to see, whether this could be improved timewise with more smaller machines)

What’s odd is, that, all machines coming up at the same time, and the coordinator assigned segments equally to all 38 machines. but after a while one and only one machine, has downloaded way more segments, and I wanna understand why.

Here is what it looked like

Thing is, usually I wouldn’t bother, as this should balance out eventually, but thing is, this one instance still had 2000+ segments to download, while all the other 37 historicals were already finished downloading.

But(!), even though of a replication-factor of 2, the datasource was marked as not available and I don’t get this ? Shouldn’t there be another instance having a replica of said 2000+ pending segments, making the datasource available ?

Here is the relevant dynamic config of the coordinator:

{

“mergeSegmentsLimit”: 100,

“maxSegmentsToMove”: 3005,

“replicantLifetime”: 15,

“replicationThrottleLimit”: 3000,

“balancerComputeThreads”: 1,

“emitBalancingStats”: false,

}

``

So this was done with 3000 segments to move and 3000 replicationThrottleLimit

Cheers,

André


Upon thinking about it:

Are segments in the loadqueue ever evicted ?

Then my only explanation (and I would like to hear your ideas about it) is, that this one machine was just the first one to be recognized, coordinator assigns this machine maxSegmentsToMove (3000+) segments, then all the other come up, and in the next round of segment-assignment, they all get an equal share of say 85… - is that how it happens ?

This would at least explain this weird unbalanced segment-loading

But why would the datasource still be marked as not available, if replication is set to 2 ?

Thanks in advance,

André

So I just took down the cluster, since it was taking ages to load the 2000+ segments, changed the config to 200 segments, restarted the historicals and still getting a similar behaviour, what is going on ??

Hi André,

This behavior is strange for sure. You could be hitting some weird problem with the cost function. I have seen something similar to this before but very uncommonly and am not sure what causes it. It might be interesting to dump a list of all your segment metadata and run some simulations with it to see if they reveal anything. If you are really feeling adventurous you could do this yourself (check out CostBalancerStrategy and the classes that call it).

If you are not feeling that adventurous, would you consider dumping your segment metadata and attaching it to a GitHub issue in case someone is willing to try this? You can change datasource names to protect the innocent if you want.

Hi! How are you guys?

Wanted to jump in here because we are having a similar issue:

Many instances have 0 segments to load, but the ones with less size have a lot of segments to be loaded. However they load very little, and they throw the following error:

“Balancer move segments queue has a segment stuck”

Were you able to solve this problem?

Any insight will be much appreciated!

Best,

Hi Federico:

your error message points to some segment loading failure. Can you find in your coordinator log which segment it is, and which historical node is acting up? We maybe able to reset it if you have the info.

Thanks

Ming

Hi Ming! Thanks for your quick answer :slight_smile:

Yes, that log show sus both, segment name and historical IP, for example:

2019-01-20T21:08:12,015 ERROR [Coordinator-Exec–0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=sourceData_2018-06-07T22:00:00.000Z_2018-06-07T23:00:00.000Z_2018-06-08T00:00:04.196Z_13, server=DruidServerMetadata{name=‘172.18.50.181:8083’, hostAndPort=‘172.18.50.181:8083’, hostAndTlsPort=‘null’, maxSize=1840000000000, tier=’_default_tier’, type=historical, priority=0}}

In this case:

segment ID: sourceData_2018-06-07T22:00:00.000Z_2018-06-07T23:00:00.000Z_2018-06-08T00:00:04.196Z_13

Historical IP: 172.18.50.181

Right?

If I go to that server, it has the segment indeed:

/mnt/disk3/druid/historical/sourceData/2018-06-07T22:00:00.000Z_2018-06-07T23:00:00.000Z/2018-06-08T00:00:04.196Z/13

$ ll

total 687456

drwxr-xr-x 2 x x 4096 Jan 12 10:50 ./

drwxr-xr-x 3 x x 4096 Jan 12 10:50 …/

-rw-r–r-- 1 x x 703928015 Jan 12 10:50 00000.smoosh

-rw-r–r-- 1 x x 29 Jan 12 10:50 factory.json

-rw-r–r-- 1 x x 2892 Jan 12 10:50 meta.smoosh

-rw-r–r-- 1 x x 4 Jan 12 10:50 version.bin

What does it mean that the segment is “stuck”?

Does it want to move the segment from this server to another? If yes, how does this migration happen? I was wondering if it removes the segment, changes the coordinator linking for this segment (assigns another historical) and then it tells that other historical to grab it from deep storage?

Best,

Yes, Federico, that’s what I would do too. Try deleting this segment from the historical node, and let Coordinator to reload it from deep storage, and hopefully, unstucks it :slight_smile:

Ming