Kafka Indexing Tasks OOM

Hi there,

I’m having memory issues trying to recover from a long Druid outage. A fair bit of unprocessed data has piled up in my kafka topic and with the default configuration I get out of memory errors and my indexing tasks fail.

I have tried to mitigate this by using the maxRowsInMemory setting (set to 50000) in the tuningConfig section of the spec, and I am trying to limit the amount of data pulled by the kafka consumer using “max.partition.fetch.bytes” : “134217728” (I have 8 partitions). In my middle manager config I have:

druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dhadoop.mapreduce.job.user.classpath.first=true

My rows have 132 columns, but even conservatively guesstimating that they require 128 bytes each (they require less) that would still come in at a little over 800MB for 50k rows which doesn’t seem excessive given 3GB configured for the jvm.

Am I missing a setting? Bad math somewhere?

Thanks,

–Ben

Hey Ben,

Could you try adding -XX:+HeapDumpOnOutOfMemoryError and look at the heap dump when you get an OOME to see if there’s suspicious stuff in there?

One thing that is intentional is that maxRowsInMemory is not really the max; it’s actually maxRowsInMemory * (2 + maxPendingPersists) since you might have some indexes persisting while another is filling up. So you do need to account for that too (plus some extra space for handling queries).

Another thing that might be going on, and would be unintentional if so, is that if you get multiple segments generated the maxRowsInMemory may only be applying to each segment individually.

Hey Ben,

I think what’s going on is that the maxRowsInMemory actually applies to each generated segment, and since there’s one per time range per Kafka partition, that could potentially be a lot more rows in memory than you expect if you have a lot of Kafka partitions. I filed an issue for changing this behavior in the future: https://github.com/druid-io/druid/issues/3271

But for now, you should get the behavior you’re expecting if you divide maxRowsInMemory by the number of Kafka partitions you expect each task to handle.

Thank you, that was very helpful information. I greatly reduced the maxRowsInMemory and now that does not appear to be a problem. However my indexing tasks still do not succeed. Now I am getting the following:

2016-07-21T02:38:50,750 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KafkaIndexTask{id=index_kafka_business_events_test_191cda0fb891da5_nmiakddb, type=index_kafka, dataSource=business_events_test}]

java.lang.RuntimeException: java.lang.InterruptedException

    at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

    at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:110) ~[druid-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.appenderator.ActionBasedSegmentAllocator.allocate(ActionBasedSegmentAllocator.java:52) ~[druid-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.segment.realtime.appenderator.FiniteAppenderatorDriver.getSegment(FiniteAppenderatorDriver.java:341) ~[druid-server-0.9.1.jar:0.9.1]

    at io.druid.segment.realtime.appenderator.FiniteAppenderatorDriver.add(FiniteAppenderatorDriver.java:195) ~[druid-server-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:417) ~[?:?]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.jar:0.9.1]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

Caused by: java.lang.InterruptedException

    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302) ~[?:1.8.0_91]

    at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) ~[guava-16.0.1.jar:?]

    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]

    at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:105) ~[druid-indexing-service-0.9.1.jar:0.9.1]

    ... 10 more

2016-07-21T02:38:50,772 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_business_events_test_191cda0fb891da5_nmiakddb] status changed to [FAILED].

2016-07-21T02:38:50,779 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_kafka_business_events_test_191cda0fb891da5_nmiakddb”,

“status” : “FAILED”,

“duration” : 1954254

}

Also, despite having only 2 indexing tasks running (according to the web console) I still have a whole bunch of peon processes:

[hdfs@druid-data002 services]$ jcmd

55777 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_business_events_test_191cda0fb891da5_njpngnnb/task.json /services/druid/data2/task/index_kafka_business_events_test_191cda0fb891da5_njpngnnb/4db94321-10c2-4165-9c2c-fb30ee7200a0/status.json

48290 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_sor_business_event_success_v1_bc9136377a73b79_lippeobc/task.json /services/druid/data2/task/index_kafka_sor_business_event_success_v1_bc9136377a73b79_lippeobc/c1d6e5aa-bd2c-4d28-82e9-1596a5769946/status.json

47524 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_business_events_test_dbdc87080a0fca8_phgdbnbn/task.json /services/druid/data2/task/index_kafka_business_events_test_dbdc87080a0fca8_phgdbnbn/f03b47d3-cdb4-4b61-b616-c7e9e1ed9542/status.json

51653 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_sor_business_event_success_v1_ceccd9ff870f2b1_ekhfhomh/task.json /services/druid/data2/task/index_kafka_sor_business_event_success_v1_ceccd9ff870f2b1_ekhfhomh/e95e4dfc-ce8b-48f5-92ce-29f89cb940ce/status.json

58508 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_sor_business_event_success_v1_6326aca7c5a400e_khnbjkig/task.json /services/druid/data2/task/index_kafka_sor_business_event_success_v1_6326aca7c5a400e_khnbjkig/8be4cf00-77bb-4f8d-ab76-a05ca04d4608/status.json

28562 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_sor_business_event_success_v1_3bc7e5362bc9b22_odfhmkal/task.json /services/druid/data2/task/index_kafka_sor_business_event_success_v1_3bc7e5362bc9b22_odfhmkal/2fc0f8a7-e1a0-480a-ab79-c30edaeb8052/status.json

16343 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_sor_business_event_success_v1_5b8922454a6ba85_djjhelld/task.json /services/druid/data2/task/index_kafka_sor_business_event_success_v1_5b8922454a6ba85_djjhelld/07da9147-2f39-4d53-85f6-8512c62c9796/status.json

25851 io.druid.cli.Main server historical

25852 io.druid.cli.Main server middleManager

61021 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_sor_business_event_success_v1_590eacdd7bca8fb_gjpebjpc/task.json /services/druid/data2/task/index_kafka_sor_business_event_success_v1_590eacdd7bca8fb_gjpebjpc/24ad0c3e-525f-401d-8db2-a0fbea6ee3ad/status.json

39965 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_sor_business_event_success_v1_1795277419cc5ef_jbbdoalg/task.json /services/druid/data2/task/index_kafka_sor_business_event_success_v1_1795277419cc5ef_jbbdoalg/19a1a0cc-920f-448c-83d5-60ce32e98035/status.json

57790 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_business_events_test_bb8b4037b1ecca7_dihbeaff/task.json /services/druid/data2/task/index_kafka_business_events_test_bb8b4037b1ecca7_dihbeaff/0e205d2b-43a0-4d46-9333-0035be34eef1/status.json

41182 io.druid.cli.Main internal peon /services/druid/data2/task/index_kafka_business_events_test_a6e41386ef80784_obdopepl/task.json /services/druid/data2/task/index_kafka_business_events_test_a6e41386ef80784_obdopepl/a73a7ccd-2913-4302-8529-1246137c769d/status.json

This is what happened over the weekend as well. I believe that after receiving an OOM error these processes are not killed.

–Ben

I tried dropping my mysql db, cleaning out zookeeper and the local data and re-submitting the indexing specs. Unfortunately I am getting a couple of new errors:

I see this in an indexing log:

2016-07-21T04:44:07,964 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KafkaIndexTask{id=index_kafka_business_events_test_ad1379bb84a7460_pgeigkmd, type=index_kafka, dataSource=business_events_test}]

com.metamx.common.ISE: Missing hydrant [2] in sinkDir [/services/druid/data2/task/index_kafka_business_events_test_ad1379bb84a7460_pgeigkmd/work/persist/business_events_test_2016-07-15T15:00:00.000Z_2016-07-15T16:00:00.000Z_2016-07-21T03:34:14.236Z_13].

    at io.druid.segment.realtime.appenderator.AppenderatorImpl.bootstrapSinksFromDisk(AppenderatorImpl.java:960) ~[druid-server-0.9.1.jar:0.9.1]

    at io.druid.segment.realtime.appenderator.AppenderatorImpl.startJob(AppenderatorImpl.java:197) ~[druid-server-0.9.1.jar:0.9.1]

    at io.druid.segment.realtime.appenderator.FiniteAppenderatorDriver.startJob(FiniteAppenderatorDriver.java:136) ~[druid-server-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:275) ~[?:?]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.jar:0.9.1]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

    Suppressed: java.lang.NullPointerException

            at io.druid.segment.realtime.appenderator.AppenderatorImpl.mergeBarrier(AppenderatorImpl.java:652) ~[druid-server-0.9.1.jar:0.9.1]

            at io.druid.segment.realtime.appenderator.AppenderatorImpl.abandonSegment(AppenderatorImpl.java:1021) ~[druid-server-0.9.1.jar:0.9.1]

            at io.druid.segment.realtime.appenderator.AppenderatorImpl.close(AppenderatorImpl.java:758) ~[druid-server-0.9.1.jar:0.9.1]

            at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:526) ~[?:?]

            at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.jar:0.9.1]

            at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.jar:0.9.1]

            at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

            at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

2016-07-21T04:44:07,975 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_business_events_test_ad1379bb84a7460_pgeigkmd] status changed to [FAILED].

2016-07-21T04:44:07,980 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_kafka_business_events_test_ad1379bb84a7460_pgeigkmd”,

“status” : “FAILED”,

“duration” : 3297

}

And I found a couple of these in the overlord log:

2016-07-21T04:31:44,986 INFO [qtp690051320-215] io.druid.indexing.overlord.TaskLockbox - Task[index_kafka_business_events_test_ad1379bb84a7460_pgeigkmd] already present in TaskLock[index_kafka_business_events_test]

2016-07-21T04:31:44,998 INFO [qtp690051320-215] io.druid.metadata.IndexerSQLMetadataStorageCoordinator - Allocated pending segment [business_events_test_2016-07-14T11:00:00.000Z_2016-07-14T12:00:00.000Z_2016-07-21T03:30:30.914Z_9] for sequence[index_kafka_business_events_test_ad1379bb84a7460_5] (previous = [business_events_test_2016-07-14T10:00:00.000Z_2016-07-14T11:00:00.000Z_2016-07-21T03:30:28.742Z_9]) in DB

2016-07-21T04:31:45,093 WARN [KafkaSupervisor-sor_business_event_success_v1-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - Task [index_kafka_sor_business_event_success_v1_feffecbde70f303_bcjmfelf] failed to return start time, killing task

java.lang.RuntimeException: java.net.ConnectException: Connection refused

    at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:328) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:233) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.getStartTime(KafkaIndexTaskClient.java:156) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor.getTaskStartTime(KafkaSupervisor.java:1218) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor.updateTaskStatus(KafkaSupervisor.java:708) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor.runInternal(KafkaSupervisor.java:457) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor$RunNotice.handle(KafkaSupervisor.java:409) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor$2.run(KafkaSupervisor.java:261) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

Caused by: java.net.ConnectException: Connection refused

    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_91]

    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_91]

    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_91]

    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_91]

    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_91]

    at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_91]

    at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_91]

    at java.net.Socket.<init>(Socket.java:434) ~[?:1.8.0_91]

    at java.net.Socket.<init>(Socket.java:211) ~[?:1.8.0_91]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.checkConnection(KafkaIndexTaskClient.java:228) ~[?:?]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:264) ~[?:?]

    ... 12 more

Hey Ben,

I think this issue: com.metamx.common.ISE: Missing hydrant [2] in sinkDir [/services/druid/data2/task/index_kafka_business_events_test_ad1379bb84a7460_pgeigkmd/work/persist/business_events_test_2016-07-15T15:00:00.000Z_2016-07-15T16:00:00.000Z_2016-07-21T03:34:14.236Z_13].

is caused by a consistency check failing between the catalog of persist directories and the actual persisted data. I’m not sure which files you removed during your cleanup, but if you haven’t already, try removing the directory pointed to by ‘druid.indexer.task.baseTaskDir’ and everything underneath it. The suppressed NPE is a red herring and happened when an implicit finally block in a try-with-resources failed because it tried to use something that hadn’t been initialized (since the missing hydrant exception was thrown halfway through init).

The logs on the overlord are because the supervisor tries to contact a task it thinks exists but gets no response so it terminates it. When cleaning up the metadata table, the tables I would remove for a fresh start are: [druid_dataSource, druid_pendingSegments, druid_supervisors, druid_tasklocks, druid_tasks] (or whatever you named them as).

Thanks David. I dropped the whole druid database, cleaned up all of the local directories on each node, and cleaned up ZooKeeper and the missing hydrant problem went away. The other problem persists but is not consistent. How I have also got this problem: https://github.com/druid-io/druid/issues/3274

Caused by: org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.PacketTooBigException: Packet for query is too large (1201867 > 1048576). You can change this value on the server by setting the max_allowed_packet’ variable. [statement:“INSERT INTO druid_tasklogs (task_id, log_payload) VALUES (:entryId, :payload)”, located:“INSERT INTO druid_tasklogs (task_id, log_payload) VALUES (:entryId, :payload)”, rewritten:“INSERT INTO druid_tasklogs (task_id, log_payload) VALUES (?, ?)”, arguments:{ positional:{}, named:{payload:[123, 34, 116, 121, 112, 101, 34, 58, 34, 115, 101, 103, 109, 101, 110, 116, 84, 114, 97, 110, 115, 97, 99, 116, 105, 111, 110, 97, 108, 73, 110, 115, 101, 114, 116, 34, 44, 34, 115, 101, 103, 109, 101, 110, 116, 115, 34, 58, 91, 123, 34, 100, 97, 116, 97, 83, 111, 117, 114, 99, 101, 34, 58, 34, 115, 111, 114, 95, 98, 117, 115, 105, 110, 101, 115, 115, 95, 101, 118, 101, 110, 116, 95, 115, 117, 99, 99, 101, 115, 115, 95, 118, 49, 34, 44, 34, 105, 110, 116, 101, 114, 118, 97, 108, 34, 58, 34, 50, 48, 49, 54, 45, 48, 55, 45, 49, 53, 84, 50, 50, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 47, 50, 48, 49, 54, 45, 48, 55, 45, 49, 53, 84, 50, 51, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 34, 44, 34, 118, 101, 114, 115, 105, 111, 110, 34, 58, 34, 50, 48, 49, 54, 45, 48, 55, 45, 50, 49, 84, 48, 51, 58, 52, 50, 58, 48, 48, 46, 51, 52, 57, 90, 34, 44, 34, 108, 111, 97, 100, 83, 112, 101, 99, 34, 58, 123, 34, 116, 121, 112, 101, 34, 58, 34, 104, 100, 102, 115, 34, 44, 34, 112, 97, 116, 104, 34, 58, 34, 104, 100, 102, 115, 58, 47, 47, 104, 100, 102, 115, 48, 48, 49, 58, 56, 48, 50, 48, 47, 100, 114, 117, 105, 100, 50, 47, 115, 101, 103, 109, 101, 110, 116, 115, 47, 115, 111, 114, 95, 98, 117, 115, 105, 110, 101, 115, 115, 95, 101, 118, 101, 110, 116, 95, 115, 117, 99, 99, 101, 115, 115, 95, 118, 49, 47, 50, 48, 49, 54, 48, 55, 49, 53, 84, 50, 50, 48, 48, 48, 48, 46, 48, 48, 48, 90, 95, 50, 48, 49, 54, 48, 55, 49, 53, 84, 50, 51, 48, 48, 48, 48, 46, 48, 48, 48, 90, 47, 50, 48, 49, 54, 45, 48, 55, 45, 50, 49, 84, 48, 51, 95, 52, 50, 95, 48, 48, 46, 51, 52, 57, 90, 47, 57, 47, 105, 110, 100, 101, 120, 46, 122, 105, 112, 34, 125, 44, 34, 100, 105, 109, 101, 110, 115, 105, 111, 110, 115, 34, 58, 34, 117, 115, 101, 114, 95, 103, 101, 110, 100, 101, 114, 44, 117, 115, 101, 114, 95, 116, 111, 100, 97, 121, 95, 105, 115, 95, 98, 97, 100, 103, 101, 100, 44, 117, 115, 101, 114, 95, 115, 111, 99, 105, 97, 108, 95, 102, 97, 99, 101, 98, 111, 111, 107, 44, 117, 115, 101, 114, 95, 97, 103, 101, 44, 117, 115, 101, 114, 95, 97, 103, 101, 95, 98, 114, 97, 99, 107, 101, 116, 44, 117, 115, 101, 114, 95, 114, 101, 103, 105, 115, 116, 114, 97, 116, 105, 111, 110, 95, 97, 103, 101, 95, 98, 114, 97, 99, 107, 101, 116, 44, 117, 115, 101, 114, 95, 114, 101, 103, 105, 115, 116, 114, 97, 116, 105, 111, 110, 95, 97, 103, 101, 95, 100, 97, 121, 44, 117, 115, 101, 114, 95, 114, 101, 103, 105, 115, 116, 114, 97, 116, 105, 111, 110, 95, 97, 103, 101, 95, 119, 101, 101, 107, 44, 101, 118, 101, 110, 116, 95, 116, 121, 112, 101, 44, 101, 118, 101, 110, 116, 95, 115, 116, 97, 116, 117, 115, 44, 101, 118, 101, 110, 116, 95, 105, 115, 95, 111, 102, 102, 108, 105, 110, 101, 44, 101, 118, 101, 110, 116, 95, 99, 105, 116, 121, 44, 101, 118, 101, 110, 116, 95, 114, 101, 103, 105, 111, 110, 44, 101, 118, 101, 110, 116, 95, 99, 111, 117, 110, 116, 114, 121, 44, 117, 115, 101, 114, 95, 99, 111, 117, 110, 116, 114, 121, 44, 99, 104, 97, 105, 110, 95, 110, 97, 109, 101, 44, 99, 108, 105, 101, 110, 116, 95, 97, 112, 112, 95, 118, 101, 114, 115, 105, 111, 110, 44, 99, 108, 105, 101, 110, 116, 95, 105, 112, 95, 97, 100, 100, 114, 101, 115, 115, 44, 99, 108, 105, 101, 110, 116, 95, 108, 111, 99, 97, 108, 101, 44, 99, 108, 105, 101, 110, 116, 95, 111, 115, 95, 110, 97, 109, 101, 44, 99, 108, 105, 101, 110, 116, 95, 117, 115, 101, 114, 95, 97, 103, 101, 110, 116, 44, 100, 101, 118, 105, 99, 101, 95, 112, 108, 97, 116, 102, 111, 114, 109, 95, 116, 121, 112, 101, 44, 100, 101, 118, 105, 99, 101, 95, 110, 97, 109, 101, 44, 108, 111, 99, 97, 116, 105, 111, 110, 95, 110, 97, 109, 101, 44, 108, 111, 99, 97, 116, 105, 111, 110, 95, 112, 114, 111, 118, 105, 100, 101, 114, 95, 105, 100, 44, 108, 111, 99, 97, 116, 105, 111, 110, 95, 112, 114, 111, 118, 105, 100, 101, 114, 95, 110, 97, 109, 101, 44, 117, 115, 101, 114, 95, 107, 105, 99, 107, 115, 95, 98, 97, 108, 97, 110, 99, 101, 44, 97, 112, 112, 95, 105, 100, 44, 112, 97, 114, 116, 110, 101, 114, 95, 105, 100, 44, 108, 111, 99, 97, 116, 105, 111, 110, 95, 105, 100, 44, 99, 104, 97, …

Caused by: com.mysql.jdbc.PacketTooBigException: Packet for query is too large (1201867 > 1048576). You can change this value on the server by setting the max_allowed_packet’ variable.
at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3583) ~[?:?]
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2460) ~[?:?]
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625) ~[?:?]
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551) ~[?:?]
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861) ~[?:?]
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1192) ~[?:?]
at org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198) ~[commons-dbcp2-2.0.1.jar:2.0.1]
at org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198) ~[commons-dbcp2-2.0.1.jar:2.0.1]
at org.skife.jdbi.v2.SQLStatement.internalExecute(SQLStatement.java:1328) ~[jdbi-2.63.1.jar:2.63.1]
at org.skife.jdbi.v2.Update.execute(Update.java:56) ~[jdbi-2.63.1.jar:2.63.1]
at io.druid.metadata.SQLMetadataStorageActionHandler$10.withHandle(SQLMetadataStorageActionHandler.java:359) ~[druid-server-0.9.1.jar:0.9.1]
at io.druid.metadata.SQLMetadataStorageActionHandler$10.withHandle(SQLMetadataStorageActionHandler.java:347) ~[druid-server-0.9.1.jar:0.9.1]
at org.skife.jdbi.v2.DBI.withHandle(DBI.java:281) ~[jdbi-2.63.1.jar:2.63.1]
… 64 more

This ID list pretty much goes on forever. I upped the max_allowed_packet size to 16MB but it failed to resolve the issue - the payload is ridiculously large.

–Ben

Hey Ben,

Glad to hear things are starting to move again. It’s odd that the exception seems to indicate a ~1MB payload but setting max_allowed_packet to 16MB doesn’t work. Apparently you can also modify that value on the SQL server dynamically by issuing, e.g. for 16MB:

SET GLOBAL max_allowed_packet = 1024*1024*16;

Maybe give that a try and see if it helps at all?

If not, one thing you could try is increasing the number of tasks to match the equal the number of partitions you have in Kafka which will reduce the number of segments created by any one task which should hopefully get things rolling. You could also reduce the task duration in the supervisor config so that a task will read less data and hopefully generate less segments.

The payload of the task log stores an entry for each segment created which includes the names of all the dimensions in that segment + other metadata. Kafka indexing tasks need to create a segment per Kafka partition to guarantee deterministic behavior, and depending on your segment granularity and the spread of timestamps in your event stream, that task could be creating a whole ton of segments, each of which is storing 132 dimension strings which could easily take up a lot of space.

Thanks for filing the issue. Now that Kafka indexing tasks can handle arbitrary timestamps which might potentially generate lots of segments, we may want to re-consider our task log schema to split the log across multiple entries to minimize this exception from happening. Hope one of these workarounds helps you get past this issue.

Apparently the setting is available both on the client and the server. I had made the change on the server (and queried it to make sure that it took) but the problem still occurred. I do not know where to set the client setting. Rather than chase that I am trying running 8 different tasks for my supervise spec as you suggested. This seems to make a lot more sense to me given how memory allocation and segments are implemented.

Thanks for your help.