Bug in index task with ingestSegment firehose

We are experiencing a strange bug with Druid 0.10, though I believe it was also present before.

Our Kafka indexing service produces “way too many segments” on an hourly basis, and every night we run additional tasks to reindex/compact the previous day’s segments. We don’t use hadoop indexing, and so we simply submit an “index” task to the overlord with an “ingestSegment” firehose defined that pulls in all the previous day’s segments. Most of this works fine.

We have some rows that use multi-valued fields to keep a tag list. For these, we write a boolean flag for the presence of any tags, and a multi valued dimension for the list of tags. For example, the input rows look like this:

{

“version” : “v1”,

“timestamp” : “2018-01-06T00:00:00.000Z”,

“event” : {

“has_tags” : “0”,

“tags” : ,

“event_count” : 1

}

},{

“version” : “v1”,

“timestamp” : “2018-01-06T00:00:00.000Z”,

“event” : {

“has_tags” : “1”,

“tags” : [“tag1”, “tag2”]

“event_count” : 1

}

}

After we have reindexed, the compacted segments contain “impossible” rows in which the boolean dimension is false, but some tags from other rows have somehow crept into the wrong rows. For example:

{

“version” : “v1”,

“timestamp” : “2018-01-06T00:00:00.000Z”,

“event” : {

“has_tags” : “0”,

“tags” : “tag1”

“event_count” : 1

}

}

So far, I have only observed this behavior with multi-valued dimension fields.

Has anyone seen something like this, or bugs in this area? We have tried messing around with the index specs we give the jobs to no avail. Is there a bad spec we could give that would result in this behavior? Possibly, we can get a test case together to reproduce.

Thanks for any tips.

Hi Max,

This sounds like https://github.com/druid-io/druid/pull/5012 which was a bug in 0.10.x (not 0.9.x and not 0.11.x). Please try upgrading to 0.11.0 and seeing if that helps.

That looks like it!

Actually, I misspoke and we are running 0.11.0 RC which appears to not have this fix yet. We can update to the final release.