question regarding uneven distribution of segments across historicals

We have several weeks worth of data residing on 15 historical nodes and when they were running full, I removed one week of data using a drop rule.
Before the dropping of data, the coordinator console showed that all historicals were equally “full”, at 98% or something.
Immediately after the deletion, I’m seeing the following in the coordinator console:

I was surprised to see that most of that data seemed to have resided on a single historical and that some of the data also seemed to have been one all but one of the others. Is this how things should be or shoud I be concerned? We are having performance issues and I was wondering whether they might be due to this uneven distribution.

We also have set the replication level to 2, so I was thinking that maybe one replica of each segment ended up on the same historial while the other replica got distributed among the others.

However, prior to the deletion, the cluster has had days to balance itself and I did see the cluster rebalance the segments after the deletion.

Is there a quick and easy way to see whether the data is not only spread out evenly across the historicals in terms of volume but also so that parallized processing of a query will be good?

thanks

Sascha

We also see things like this:

A query spanning 4 weeks of data generates a segment scan time of allegedly 20 seconds on one node while the other nodes have a segment scan time of zero.
Same or similar queries rerun then also lead to evenly distributed scan times as can be seen in the graphs too.
I
don’t know whether to trust the metrics too, as we are using the statsd-emitter and I’m not sure whether individual metrics emitted by Druid are perhaps summed up such that if during a metric emission interval a node scans 20 segments and takes 1 second for each, the metric emitted would be 20 instead of 1 second. The segment scan metric is of type statsd timer and the dashboard uses the mean over those, so to my understanding the way the metrics are handled seems correct. However in load tests I sometimes see scan times of 200 ms as they aught
to be and sometimes I see scan times of 10ms.

The only metric I see in our dashboard that seems to correlate to the spike in the segment scan time is the CollectD CPU System metric, middle row on the left:

There is some evidence pointing towards an uneven distribution of segments across historical nodes. We don’t know for sure but as a working hypothesis, we currently assume that the data isn’t spread out across historical nodes evenly because the coordinator might distribute segments to historicals based on which historical has the most free local disk space or something like this. As we grew the number of historicals over time, it seems that a small number of new historicals received all the new incoming segments and now, when a query comes in, there is not enough parallelism because all segments are residing on one of two nodes.

Can somebody shed some light on what the distribution-strategy of a coordinator is or where within the code base I’d need to look to get more insights myself?

The strategy applied by the coordinator doesn’t seem to be configurable, or is it?

Thanks
Sascha

I think this PR explains the current strategy and new strategy pretty well:

WOOOW!!! That is an awesome pull request. I don’t understand how I could have overlooked it.

Thanks so much for sharing.

The visualizations look very nice. May I ask how they were made? Everybody here wants to be able to compute these histograms ourselves.

Mabye a dumb question to ask, but I didn’t understand the following statement in the pull request:

Ensuring the unit of computation is constant is also good practice to reduce variance in segment scan times across the cluster.

How can “ensuring the unit of computation is constant” be achieved exactly? Does it mean that we should ensure that all segments of all datasources and all segment granularities have the same number of records per segment?

We have two datasources. One has 5 mil records per partition and a partition size of 700 MB, the other datasource has 10 mil records per partition and a partition size of 200 MB.
For this second datasource we are currently wondering whether we should reduce the number of records per partition to 5 mil also, even though that would lead to even smaller partition sizes.

The visualizations look very nice. May I ask how they were made? Everybody
here wants to be able to compute these histograms ourselves.

Thank you, I'll have a more extensive blog post about the new balancing
algorithm coming out soon. To keep it short, I have some R code pull
segment data from the coordinator endpoints and generate a bunch of still
plots using ggplot2, which I then combined using imagemagick.

Mabye a dumb question to ask, but I didn't understand the following
statement in the pull request:
>> Ensuring the unit of computation is constant is also good practice to
reduce variance in segment scan times across the cluster.

How can "ensuring the unit of computation is constant" be achieved
exactly? Does it mean that we should ensure that all segments of all
datasources and all segment granularities have the same number of records
per segment?

It depends largely on the nature of your data as well as the types of
queries you are running. You can monitor segment scan times across various
data sources to tune your shardspec.

We have two datasources. One has 5 mil records per partition and a

partition size of 700 MB, the other datasource has 10 mil records per
partition and a partition size of 200 MB.
For this second datasource we are currently wondering whether we should
reduce the number of records per partition to 5 mil also, even though that
would lead to even smaller partition sizes.

Number of records will affect scan time, but so will the number and types
of metrics you query. You might want to shard more, unless your scan times
are comparable already.