MessageDroppedException : Causes?

All,

We are running Tranquility, with HOUR granularity. At the top of the hour, presumably when a new RT task spins up, we receive an increased number of MessageDroppedExceptions (MDE). The timestamps on the events appear to be well within the windowPeriod. We have replication = 3. (so three tasks actually kick off every hour)

A couple questions:

  1. Do we receive a MessageDroppedException if any of those tasks fail to receive the event?

(or only if all tasks fail to receive the event)

  1. Are there any other known causes of MDE’s?

  2. Is there anyway to get cause information out of the MDE?

(message doesn’t seem to be very useful)

thanks again,

brian

Hey Brian,

You get MDEs if all tasks fail to receive the event.

The two most common causes are: (1) events outside the windowPeriod, (2) task or overlord communications not succeeding within your firehoseRetryPeriod or indexRetryPeriod (you should generally see errors in the logs if this is happening).

Is it possible that (2) is happening because your tasks are taking a long time to start up?

OK, we fixed this. We use the taskWarmingPeriod. (great feature!)
That settings starts the tasks ahead of schedule so they are fired up and ready to go by the time we get there.

We also had to make a small tweak to the cycleBucket to allow for overlapping tasks (given that we were firing them up early). We didn’t want the new tasks sitting in the pending queue.

fwiw,

https://github.com/druid-io/tranquility/pull/169

-brian

Cool, good to know. That’s what the feature is there for :slight_smile:

Will take a look at that PR in a bit.