Druid broker seems to be occasionally missing a segment

Hi,

I’m looking for some help in debugging an issue I seem to be having with our broker on a simple cluster.

A full log of configuration/commands/etc is pasted in a gist at https://gist.github.com/qix/7b8122539c9b4969c35b ; let me know if anything else is particularly relevant.

The issue we’re having is that on around half the queries to our druid broker result in significantly lower results. I’m assuming that it is missing querying a single segment, but since our cluster is so simple (1 realtime + 1 historical) that seems strange. The error also happens fairly reliably.

Any idea what could be going on? Where should I look next?

Regards,

Josh (qixxiq in #druid-dev)

Josh,

That is very... concerning.

Can you try setting day granularity and see if you can track down if
it is happening on a specific day (like the most recent) or if it's
always happening on an older day, or if tehre is any sort of pattern
there?

Also, could you try turning off caching, add:

"context": { "useCache": false, "populateCache": false }

to your query and see if you still get discrepancies?

--Eric

Hi Eric,

Turning off caching didn’t seem to make any difference, but I did manage to isolate it to a single day. With ‘hour’ granularity on that day half the time the hours 5am through to midnight are missing. This would explain why the counts change that much.

The day I isolated it to (which might explain things) also happens to be the day where we did a large import. Since the data was old I configured rejectionPolicy=none, and then afterwards set the rejectionPolicy back to serverTime. Finally I replaced the firehose with a time interval one in the past, which I believed would cause the realtime node to build it’s segments and upload them.

I’ll look into the proper bulk indexing service in future, but I’d like to somehow solve the problem now.

Is it possible we have two segment files being served for the same day/partition/etc, with one having missing data?

How is that even possible with only one historical node?

Is there an easy way to list/delete the bad segment?

Regards,

Josh

Ok, my guess as to what is happening is that your realtime node is
still serving the "rejectionPolicy": "none" segment and your
historical is serving a different one, but they have the same version.
You can probably verify that by looking in the coordinator console and
filtering the interval down to the day in question.

If that is indeed the case, I think that something might be getting in
the way of the hand-off from the real-time. if you go into the DB and
set the "used" flag on the segment served from the historical to
false, hopefully it'll clean itself up. Can you try that and see?

Playing with rejection policies to load up old data is definitely not
something that I think we've seen before. I'm remembering the IRC
conversation where I told you that I thought it *should* work, but I
guess we are learning that there are bugs down that road. Sorry about
that.

--Eric