Troubleshot Real time task issue.

Hi all,

I am experiencing a weird issue with one of my real time pipeline involving Druid.

I am using Kafka + Storm + Druid as a real time pipeline. The process is simple, I am parsing logs coming from Kafka in a Storm topology and send a set of dimensions and metrics directly to Druid using tranquility (No aggregation performed with Storm).

For some reason, we ended up with a crazy number for our revenue metric on the hour 20 today (1,435,636,203,520.00$ which is obviously wrong). We have one replica for this real time task and our index and segment granularity is set to one hour.

I’m guessing the problem came from one of the middle manager aggregating events coming from Storm. Now I don’t see anything useful in my middle managers logs which could help me troubleshooting this issue.

Do you have any recommendation or any idea on what could have happened? Have you ever heard about such a problem before? Is it possible to point out a possible issue on middle manager aggregation with Druid logs?

I have attached my real time task specifications, I am using Druid 0.7.0. Hope we can find out what happened as Druid is becoming very critical for our real time data.

Guillaume

real_time_task.json (6.57 KB)

Here are the logs of the two real time tasks involved in this issue (on replica).

index_realtime_ad_events_2015-06-29T20:00:00.000-07:00_0_0_lhekbpok (2.62 MB)

index_realtime_ad_events_2015-06-29T20:00:00.000-07:00_0_1_ioikfkde (2.64 MB)

Hi Torche,
the task logs seems fine.

Looks like it might be a data issue where the data being ingested might not be correct.

If you happen to have your raw data, can you validate your input data ?

Hi Nishant,

I can validate that the data were properly logged. My raw data looks fine, I pulled the logs involved in this issue and there is nothing wrong.

We have also another Storm topology parsing the same logs, performing aggregations on the same metrics and updating those metrics in MySQL. We didn’t have any problem with the revenue in MySQL.

Now this topology using MySQL is performing aggregation on the revenue using BigDecimal variables directly with Storm.

For our Druid topology it’s a bit different. We are just parsing every single log (event), we load the revenue into a BigDecimal variable for each event, serialize it and send it to Druid for aggregation.

I know that Druid is using Double for the aggregation part, I looked at the code and it seems like Druid is deserializing my BigDecimal revenue with a call to Double.parseDouble(String).

I’m wondering if there could be an issue here. An overflow problem maybe?

Hi Torche, is it possible you’re sending any numbers to Druid as strings with commas instead of dots as the decimal mark, like “10,34”? I think that’d be interpreted by Druid as one thousand thirty-four instead of ten and thirty-four hundredths.

I think in general it’s best to convert the BigDecimals to doubles before sending them off to Druid. At least that way, Druid won’t be parsing them as strings, and so things will be unambiguous. If for some reason that doesn’t help, I think the best thing to do would be to run some Druid queries to try to nail down what slice of the data is messed up. Some topNs should be useful here. It’s possible that there’s a single bogus event, or a small number of them. If you can find one and correlate it to a row in your input data, that’d help figure out what’s going wrong with the processing or parsing.

Thanks for your answer Gian,

I think you are right, I shouldn’t send directly BigDecimals to Druid. I don’t think I’m sending commas instead of dots though. I’m gonna try to convert my BigDecimals to Doubles and see how it goes. I will keep you update if I see any problem.