Segment Data Counts Doubled

Hi,

We are running druid 0.6.171 and seeing double counts in one of our segments. We are running a realtime node on our cluster on java 1.7.0_60 with the following command:

java -Xmx4096m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=/usr/local/druid/config/realtime/volagg_kafka8_realtime_task.json -classpath /usr/local/druid/lib/*:/usr/local/druid/config/realtime io.druid.cli.Main server realtime

Other segment data looks good and we have been able to parse the same file that (mostly) created data in the bad segment with good counts on a separate cluster so we know it is not our data. From the log, it looks as if maxRowsInMemory was hit and it flushed the same data twice. Please advise.

-drew

Realtime log pertaining to the segment

Hi Drew, where do you see the double flush happening? I looked through the logs and didn’t see it but might’ve missed it.

It was something of a leap, but the flushes of 500K rows within ~1 sec seemed a little suspicious. I think we have solved this problem as it seems related to an under-resourced realtime node. I doubled heap sizes and added a processing thread and now the counts are consistently correct whereas before we had periods where they would be generally larger than expected. I am surprised, however, that druid would silently corrupt data without printing some kind of warning in the logs. Is it possible we just weren’t seeing it because we didn’t have the -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager option set? Also, are there key jvm metrics to watch for to ensure we don’t run into this again such as gc pause and heap usage?

-drew

Drew,

That is quite concerning actually. From looking at the logs I think
some lines are missing. That is, I see the incremental persist of #1
and some of the persist of #3, but it looks like #2 is not included in
the logs (when it does the persist-n-merge it says it is combining 4
segments). Given that it's not all there, it can be hard to figure
out exactly what might've caused it to happen.

You said that another process was able to handle the same "file", but
this appears to be connected to a kafka topic instead of a file. Are
you piping the file into a kafka topic and ingesting from there? Did
the other cluster read the exact same kafka topic? A part of me is
wondering if perhaps something started pushing the file into kafka, it
got killed and then started again such that there was duplicate data
in that one topic?

--Eric

Yes I did grep the logs for the 2015-06-08T23:27 timestamp so there may well be lines missing. Our process works by sending files to a parser which generates events on a kafka topic for druid to ingest. I was able to track down the file that generated the majority of the data in the segment where counts were duplicated, spin up another parser and druid cluster with similar settings, and re-parse finding the counts to be as expected. I don’t think we would have duplicated data into kafka as there were no errors on our parser. Periods where data was being inflated lined up roughly with periods of high traffic on our druid cluster. When I adjusted our druid realtime process settings (more threads and heap), counts started lining up more or less as expected.

-drew

Hrm, the topics aren't still around such that you could verify that
there isn't duplicate data in them, correct?

This just shouldn't happen... Is it possible to share the full log
file from when the indexer started on the segment to when it finished?

The topics and logs have rolled already unfortunately.

-drew

If it is easy to reproduce/you see this again, please send along the
logs. This is the first I've heard of this sort of report. Well,
I've heard something similar but it ended up being multiple machines
ingesting the same topic with differing shardSpecs. But, it looks
like you are using the NoneShardSpec which shouldn't have that
problem...

--Eric