today I had a weird issue with a new set of historical nodes coming up, and I was asking whether you have an idea on why it happened:
So in our company we are testing different (aws-) instancetypes on how they effect cluster performance. So we did a test with 15 machines (40cores, 160gb mem) and now switched to 38 machines (16cores, 64gb mem), read: similar resources. (With 15machines we had no issues, and the cluster was coming up relatively fast, say an hour, so we wanted also to see, whether this could be improved timewise with more smaller machines)
What’s odd is, that, all machines coming up at the same time, and the coordinator assigned segments equally to all 38 machines. but after a while one and only one machine, has downloaded way more segments, and I wanna understand why.
Here is what it looked like
Thing is, usually I wouldn’t bother, as this should balance out eventually, but thing is, this one instance still had 2000+ segments to download, while all the other 37 historicals were already finished downloading.
But(!), even though of a replication-factor of 2, the datasource was marked as not available and I don’t get this ? Shouldn’t there be another instance having a replica of said 2000+ pending segments, making the datasource available ?
Here is the relevant dynamic config of the coordinator:
So this was done with 3000 segments to move and 3000 replicationThrottleLimit