MessageDroppedException is thrown when only one replicant is down

I was reading: https://groups.google.com/forum/#!topic/druid-user/4OVoJPQgGdE,
Gian mentioned that “You get MDEs if all tasks fail to receive the event.”.

My Setup: Druid 0.8.3 Tranquility: 0.7.3
“task.replicants”: “2” and I can see that two tasks were running for each partition, say partition_0_0 and partition_0_1, I’ve seen below behavior

  • When I kill one middle manager (the one running partition_0_1), then MessageDroppedException was thrown. After 15minutes, the task for partition_0_1 got FAILED, and no more MessageDroppedException
  • After partition_0_1 FAILED, seems no data loss for the period (15minutes), the indexing is running with single replicant.
  • When I killed Middle Manager for partition_0_0, MessageDroppedException is thrown when both replicants were down, and messages start to actually loss

My question is:
1. Why MDE was thrown regard less of there is an available replicant?
2. How to properly distinguish if a message is "REALLY" dropped vs. just one replication is not available

Code:

    public void pushMessage(Map<String, Object> message, AtomicLong failedCounter, AtomicLong droppedMessages) {
        tranquilizer.send(message).addEventListener(
            new FutureEventListener<BoxedUnit>() {
                @Override
                public void onSuccess(BoxedUnit value) {
                }

                @Override
                public void onFailure(Throwable e) {
                    if (e instanceof MessageDroppedException) {
                        // Only when all replicas of a partition failed to index
                        // then a message is considered to be dropped
                        droppedMessages.incrementAndGet();
                    }

                    // As long as one replica is available, indexing should be availbale
                    // increment failedCount as an event indicator
                    failedCounter.incrementAndGet();
                }
            });
    }

Any thought would be appreciated!

Hi Shuai, I don’t believe right now with the metrics there is a way to distinguish messages dropped for real vs when a replicant is down. I think it’ll be useful to have this feature though. Is your metrics collector able to locate tasks and their replicas right now?

Thanks Fangjin,Is your metrics collector able to locate tasks and their replicas right now?
For now, all the metrics I collect is actually by query zookeeper /druid path. I can get number of running tasks there. I can probably query Mysql for failed tasks, however, it seems a bit hacky. Is there an curl GET API for getting num tasks or get number of failed tasks for a certain segment?

As far as I understand Druid metric emitter, Druid requires an endpoint to post metrics as events, is that correct?

Hi Shuai, have you had a chance to read:
http://druid.io/docs/0.9.0/operations/metrics.html

Druid emits metrics over HTTP, to stdout, or using extensions, to a variety of different collection services.