Tranquility task failing while ingesting data

Getting this exception after some data has been ingested

First this occurs

Emitting alert: [anomaly] Loss of Druid redundancy: user-persona

Then

com.metamx.tranquility.beam.DefunctBeamException: Tasks are all gone: index_realtime_user-persona_2016-12-05T13:00:00.000Z_0_0

Setup is single machine one.

More such errors are coming. Now this

c.m.tranquility.beam.ClusteredBeam - Emitting alert: [anomaly] Failed to propagate events: druid:overlord/user-persona
{
“eventCount” : 1,
“timestamp” : “2016-12-13T10:00:00.000Z”,
“beams” : “MergingPartitioningBeam(DruidBeam(interval = 2016-12-13T10:00:00.000Z/2016-12-13T11:00:00.000Z, partition = 0, tasks = [index_realtime_user-persona_2016-12-13T10:00:00.000Z_0_0/user-persona-010-0000-0000]))”
}
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 1.minutes+30.seconds to druidTask!druid:overlord!index_realtime_user-persona_2016-12-13T10:00:00.000Z_0_0 while waiting for a response for the request, including retries (if applicable)
at com.twitter.finagle.NoStacktrace(Unknown Source) ~[na:na]
2016-12-13 10:49:36,644 [Hashed wheel timer #1] INFO c.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“alerts”,“timestamp”:“2016-12-13T10:49:36.644Z”,“service”:“tranquility”,“host”:“localhost”,“severity”:“anomaly”,“description”:“Failed to propagate events: druid:overlord/user-persona”,“data”:{“exceptionType”:“com.twitter.finagle.GlobalRequestTimeoutException”,“exceptionStackTrace”:“com.twitter.finagle.GlobalRequestTimeoutException: exceeded 1.minutes+30.seconds to druidTask!druid:overlord!index_realtime_user-persona_2016-12-13T10:00:00.000Z_0_0 while waiting for a response for the request, including retries (if applicable)\n\tat com.twitter.finagle.NoStacktrace(Unknown Source)\n”,“timestamp”:“2016-12-13T10:00:00.000Z”,“beams”:“MergingPartitioningBeam(DruidBeam(interval = 2016-12-13T10:00:00.000Z/2016-12-13T11:00:00.000Z, partition = 0, tasks = [index_realtime_user-persona_2016-12-13T10:00:00.000Z_0_0/user-persona-010-0000-0000]))”,“eventCount”:1,“exceptionMessage”:“exceeded 1.minutes+30.seconds to druidTask!druid:overlord!index_realtime_user-persona_2016-12-13T10:00:00.000Z_0_0 while waiting for a response for the request, including retries (if applicable)”}}]

``

Hey Shantanu,

Were you able to ingest (and query) any data before this error happened?

Could you post your overlord logs, indexing task logs , and the full Tranquility logs?

Yes, I am able to query data before this error occurs.
Please find logs attached.

druid-indexing-task2.log (716 KB)

tran.log (187 KB)

Hey Shantanu,

The problem is that your indexing task is running out of memory (JVM heap). You can either:

  • increase the heap size for the peon processes: to do this, increase Xmx in the druid.indexer.runner.javaOpts config of the middle manager runtime.properties
  • decrease maxRowsInMemory in the Tranquility tuningConfig

Ohh,
That was a noob mistake. Didn’t see out of heap mem error.

But then the question is, I loaded one data source where data size was twice as big as this one. But it still got loaded completely. It also had more columns and more metrics. This has one only has 3 columns, only one metric but it runs out of memory! How can this be explained?

I will try your suggestions. Thanks a lot.

Hmm, indexing tasks flush to disk based on both number of rows ingested and periodically every few minutes, so if your rate of ingestion was higher it may have not had the chance to flush. Also, cardinality of your data is a significant factor affecting memory / segment size.