Both druid and hadoop are getting data from kafka. Both are consuming hundreds of topics reliably. This particular topic was also fine until we added a new dimension that has url values. The topic already had around 100 dimensions. This is what we tried so far:
We got all urls for a particular minute from both druid and hadoop, and the count of null in druid was much higher than in hadoop. Rest of the values were lesser in druid so the total count was correct. We then picked all urls that were present in hadoop but not in druid from this set to see if there was something special about those urls.
Turns out many among these urls were present as dimension values at other period of time in druid than this particular minute. so nothing was special about those urls it seems, just something special with this batch. We then also tried to produce and send all these urls through kafka to a different druid real time index task and see if the null count is still the same. But unfortunately the problem is not that simple, the new data source indexed all urls successfully. I am not sure at this moment whether the issue is with tranquility (less likely) or with index task or at the merging time. But the problem might be related to either large values in this particular dimension or overall number of dimensions or message size.
Also we are not really sure if there is a (practical)limit to the size of dimension or json message.
Our next step would be to consume this topic into a different datasource with less number of dimension to rule the possibility of msg size or total num dimension causing the problem. Then we will narrow down to specific part of druid indexing.
Any suggestions as how to approach solving this are welcome.