Can't query when Kafka Index Service switch tasks in the end of task duration

Hi everyone,
I have an issue when Kafka index service switching the tasks at the end of task duration the query was very slow and almost timeout >15s but after that query back to normal.

What I saw in the console is: My task set 5 tasks and 2 replicas and 3 - 5 tasks of the previous batch (last hour). New peon was created and stopped.

In the 15th minute of each hour query is very slow (real-time query).

Is that handoff period?

How to prevent that?

+1, we have the same problem, and we also see the data reduced on pivot during handoff period if query is not out of time and results come back to correct values later (after handoff period).

Hey Chanh, Linbo,

I wonder if what’s going on is that handoff is taking longer than the default completionTimeout (see http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html) and so the data is unavailable for some amount of time. Or maybe system resource usage is quite high during the persist, and that’s making queries slow down too.

Hi Gian,

According to default configuration the completionTimeout is PT30M and I haven’t changed it.

For more detail of my data current taskDuration is 1H. It cost quite big around 1GB for each one.

Do I need to reduce the taskDuration?

IN the switching time do I need to limit the number of task to start in new interval? and How to do that for example I have 20 workers available and current task set 5 tasks and 2 replicas?

Regards,

Chanh

Screen Shot 2016-09-29 at 10.41.55 AM.png

The main thing to do first is check out the Kafka task logs to see what’s going on… is it taking a really long time to persist? Are there gc problems? Is it being stopped early? That kind of thing.

Screen Shot 2016-09-29 at 10.41.55 AM.png

Hi Gian,

I tracked it but “duration” : 3840401 mean 64 minutes ?

How to track the persist

2016-09-29T04:26:13,867 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_kafka_ad_statistics_hourly_realtime_ae96c9038371fdf_endnkhkb”,

“status” : “SUCCESS”,

“duration” : 3840401

}

and I also attached the log and GC log of a index task.

gclog.txt (187 KB)

log.log.txt.zip (1.68 MB)

Hey Chanh,
I took a look at the logs and wanted to mention a few things that might help us track down what’s going on:

  • First, it doesn’t look like the slowdown is happening during the publishing phase and the task seems to be publishing pretty normally. Yes the 3840401 mean 64 minutes, and your logs show that the task runs for 60 minutes and then takes 4 minutes to build the final segments and push them to deep storage which is great, so the problem isn’t related to handoff.

  • The slowdown is happening after the task has been running for about 7 minutes. These log messages are pretty interesting; they show that for some reason there’s no logged activity for about 40 seconds, and then the next logs are the ZK client timing out after 47 seconds and a query/wait/time metric of 40 seconds (which explains why the query was so slow):
    2016-09-29T03:29:37,949 INFO [qtp397639322-173] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-09-29T03:29:37.949Z”,“service”:“druid/middlemanager”,“host”:“adx-pro-vdchcm-ants19-ip-10-199-0-19:8109”,“metric”:“query/bytes”,“value”:84,“context”:"{“finalize”:false,“queryId”:“890f77a5-cc1b-4fbb-ba44-d2a5a6911c81”,“timeout”:40000}",“dataSource”:“ad_statistics_hourly_realtime”,“duration”:“PT3660S”,“hasFilters”:“false”,“id”:“890f77a5-cc1b-4fbb-ba44-d2a5a6911c81”,“interval”:[“2016-09-29T02:00:00.000Z/2016-09-29T03:01:00.000Z”],“remoteAddress”:“10.199.0.19”,“type”:“topN”,“version”:“0.9.1.1”}]
    2016-09-29T03:30:18,417 WARN [main-SendThread(adx-pro-vdchcm-ants19-ip-10-199-0-19:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 47129ms for sessionid 0x15765b79e4c0803
    2016-09-29T03:30:18,417 INFO [main-SendThread(adx-pro-vdchcm-ants19-ip-10-199-0-19:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 47129ms for sessionid 0x15765b79e4c0803, closing socket connection and attempting reconnect
    2016-09-29T03:30:18,426 INFO [topN_ad_statistics_hourly_realtime_[2016-09-29T02:00:00.000Z/2016-09-29T03:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-09-29T03:30:18.426Z”,“service”:“druid/middlemanager”,“host”:“adx-pro-vdchcm-ants19-ip-10-199-0-19:8109”,“metric”:“query/partial/time”,“value”:4,“dataSource”:“ad_statistics_hourly_realtime”,“dimension”:“inmarket_id”,“duration”:“PT3660S”,“hasFilters”:“false”,“id”:“c898c3ba-dc7d-4e1a-9ab2-4ac964cd48dc”,“interval”:[“2016-09-29T02:00:00.000Z/2016-09-29T03:01:00.000Z”],“numComplexMetrics”:“0”,“numMetrics”:“4”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]
    2016-09-29T03:30:18,426 INFO [topN_ad_statistics_hourly_realtime_[2016-09-29T02:00:00.000Z/2016-09-29T03:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-09-29T03:30:18.426Z”,“service”:“druid/middlemanager”,“host”:“adx-pro-vdchcm-ants19-ip-10-199-0-19:8109”,“metric”:“query/wait/time”,“value”:40245,“dataSource”:“ad_statistics_hourly_realtime”,“dimension”:“inmarket_id”,“duration”:“PT3660S”,“hasFilters”:“false”,“id”:“c898c3ba-dc7d-4e1a-9ab2-4ac964cd48dc”,“interval”:[“2016-09-29T02:00:00.000Z/2016-09-29T03:01:00.000Z”],“numComplexMetrics”:“0”,“numMetrics”:“4”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]

  • This seems symptomatic of GC issues, but I can’t really tell because the GC data isn’t timestamped so it can’t be correlated with the logs (also I don’t think it covers this point in time either). Setting -XX:+PrintGCDetails and -XX:+PrintGCTimeStamps (or -XX:+PrintGCDateStamps) in druid.indexer.runner.javaOpts would help with this.

  • Also curious how much RAM you have on that machine; it seems like a pretty powerful machine, but based on your settings, if you have less than 90GB or so you might be over committed on memory and not have enough for paging spilled segments into memory which would affect query performance (but not 40 seconds affected)…

Hope this helps.

Hi David,

Thanks for replying.

  • Also curious how much RAM you have on that machine; it seems like a pretty powerful machine, but based on your settings, if you have less than 90GB or so you might be over committed on memory and not have enough for paging spilled segments into memory which would affect query performance (but not 40 seconds affected)…

Currently we ran our druid cluster on 2 physical servers total : CPU 128 cores, 760GB RAM, 2TB SSD each.

We deployed:

  • 8 historical nodes (-Xms24g -Xmx24g -XX:NewSize=8g -XX:MaxNewSize=8g -XX:MaxDirectMemorySize=12g)
  • 2 middle manager nodes
  • 1 overload
  • 1 broker (-Xms24g -Xmx24g -XX:NewSize=8g -XX:MaxNewSize=8g -XX:MaxDirectMemorySize=16g)
  • 1 pivot
  • 2 hdfs nodes
  • 1 zookeeper nodes
  • 1 memcached 16GB
  • 22 peon (-Xms5G -Xmx5G)
  • 4 kafka broker (16 partition)
  • Ingest 20 milion event per minute
    about RAM we still have a lot free space we only concerned more about CPU.
  • The slowdown is happening after the task has been running for about 7 minutes. These log messages are pretty interesting; they show that for some reason there’s no logged activity for about 40 seconds, and then the next logs are the ZK client timing out after 47 seconds and a query/wait/time metric of 40 seconds (which explains why the query was so slow):

Yes I think maybe we should me zookeeper to redundant mode.

  • This seems symptomatic of GC issues, but I can’t really tell because the GC data isn’t timestamped so it can’t be correlated with the logs (also I don’t think it covers this point in time either). Setting -XX:+PrintGCDetails and -XX:+PrintGCTimeStamps (or -XX:+PrintGCDateStamps) in druid.indexer.runner.javaOpts would help with this.

Yeah, we do think it is about GC issue in peon. We tried set follow the recommended setting from druid but it doesn’t help much CPU in some peon was too high (>1000%) ~ 10 CPU.

We also increase the number of thread into 4 but it still has some lags when query in handoff time.

We currently run on java version “1.8.0_77” we try G1C but it seems not good enough vs CMS.

Before that we try to increase peon heap size into 10GB but it didn’t help also.

Right now we tried this one we run 8 tasks .

druid.indexer.runner.javaOpts=-server -Xms5G -Xmx5G -XX:ParallelGCThreads=5 -XX:ConcGCThreads=2 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

right now we just don’t know how to estimate how many task enough to query realtime and ingest 20 millions events/minute.

we try run with 4 tasks but we even can’t query or extremely slow.

Regards,

Chanh

Hi David,
I tracked and found that in handoff time it took a lot of time to Allocate Buffer memory Am I right?

2016-10-06T08:35:39,281 INFO [qtp1679022055-158] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:39.281Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/time","value":113,"context":"{\"finalize\":false,\"queryId\":\"a902c022-7277-42d0-99b6-d5bc5361af0a\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT60S","hasFilters":"false","id":"a902c022-7277-42d0-99b6-d5bc5361af0a","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"remoteAddress":"10.199.0.19","success":"true","type":"timeseries","version":"0.9.1.1"}]
2016-10-06T08:35:39,282 INFO [qtp1679022055-158] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:39.281Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/bytes","value":146,"context":"{\"finalize\":false,\"queryId\":\"a902c022-7277-42d0-99b6-d5bc5361af0a\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT60S","hasFilters":"false","id":"a902c022-7277-42d0-99b6-d5bc5361af0a","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"remoteAddress":"10.199.0.19","type":"timeseries","version":"0.9.1.1"}]
Total time for which application threads were stopped: 0.0005963 seconds, Stopping threads took: 0.0001487 seconds
Total time for which application threads were stopped: 0.0003115 seconds, Stopping threads took: 0.0000461 seconds
Total time for which application threads were stopped: 0.0003206 seconds, Stopping threads took: 0.0000310 seconds
Total time for which application threads were stopped: 0.0004690 seconds, Stopping threads took: 0.0000485 seconds
Total time for which application threads were stopped: 0.0004448 seconds, Stopping threads took: 0.0000441 seconds
Total time for which application threads were stopped: 0.0006129 seconds, Stopping threads took: 0.0000942 seconds
Total time for which application threads were stopped: 0.0005040 seconds, Stopping threads took: 0.0000575 seconds
2016-10-06T08:35:39,524 INFO [topN_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z]] io.druid.offheap.OffheapBufferPool - Allocating new intermediate processing buffer[0] of size[1,036,870,912]
Total time for which application threads were stopped: 6.6925143 seconds, Stopping threads took: 6.6910491 seconds
Total time for which application threads were stopped: 0.0006422 seconds, Stopping threads took: 0.0001926 seconds
2016-10-06T08:35:46,363 INFO [timeBoundary_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T09:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.362Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/partial/time","value":1,"dataSource":"ad_statistics_hourly_realtime","id":"b24a47bd-b52c-4512-b962-d285f4f443e9","type":"timeBoundary"}]
2016-10-06T08:35:46,363 INFO [timeBoundary_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T09:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.363Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/wait/time","value":0,"dataSource":"ad_statistics_hourly_realtime","id":"b24a47bd-b52c-4512-b962-d285f4f443e9","type":"timeBoundary"}]
Total time for which application threads were stopped: 0.0005853 seconds, Stopping threads took: 0.0001108 seconds
Total time for which application threads were stopped: 0.0006089 seconds, Stopping threads took: 0.0002154 seconds
2016-10-06T08:35:46,368 INFO [qtp1679022055-160] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.368Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/time","value":6704,"context":"{\"finalize\":false,\"queryId\":\"b24a47bd-b52c-4512-b962-d285f4f443e9\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"b24a47bd-b52c-4512-b962-d285f4f443e9","interval":["2016-10-06T08:00:00.000Z/2016-10-06T09:00:00.000Z"],"remoteAddress":"10.199.0.19","success":"true","type":"timeBoundary","version":"0.9.1.1"}]
2016-10-06T08:35:46,369 INFO [qtp1679022055-160] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.369Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/bytes","value":85,"context":"{\"finalize\":false,\"queryId\":\"b24a47bd-b52c-4512-b962-d285f4f443e9\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"b24a47bd-b52c-4512-b962-d285f4f443e9","interval":["2016-10-06T08:00:00.000Z/2016-10-06T09:00:00.000Z"],"remoteAddress":"10.199.0.19","type":"timeBoundary","version":"0.9.1.1"}]
2016-10-06T08:35:46,440 INFO [topN_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.440Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/partial/time","value":6928,"dataSource":"ad_statistics_hourly_realtime","dimension":"gender_id","duration":"PT60S","hasFilters":"false","id":"2b64e708-d01c-4644-a83f-56a270f2c001","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"4","threshold":"1000","type":"topN","version":"0.9.1.1"}]
2016-10-06T08:35:46,440 INFO [topN_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.440Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/wait/time","value":1,"dataSource":"ad_statistics_hourly_realtime","dimension":"gender_id","duration":"PT60S","hasFilters":"false","id":"2b64e708-d01c-4644-a83f-56a270f2c001","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"4","threshold":"1000","type":"topN","version":"0.9.1.1"}]
Total time for which application threads were stopped: 0.0006346 seconds, Stopping threads took: 0.0001335 seconds
Total time for which application threads were stopped: 0.0005245 seconds, Stopping threads took: 0.0001554 seconds
Total time for which application threads were stopped: 0.0003763 seconds, Stopping threads took: 0.0000998 seconds
2016-10-06T08:35:46,449 INFO [qtp1679022055-159] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.449Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/time","value":6969,"context":"{\"finalize\":false,\"queryId\":\"2b64e708-d01c-4644-a83f-56a270f2c001\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT60S","hasFilters":"false","id":"2b64e708-d01c-4644-a83f-56a270f2c001","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"remoteAddress":"10.199.0.19","success":"true","type":"topN","version":"0.9.1.1"}]
2016-10-06T08:35:46,450 INFO [qtp1679022055-159] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:46.450Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/bytes","value":266,"context":"{\"finalize\":false,\"queryId\":\"2b64e708-d01c-4644-a83f-56a270f2c001\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT60S","hasFilters":"false","id":"2b64e708-d01c-4644-a83f-56a270f2c001","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"remoteAddress":"10.199.0.19","type":"topN","version":"0.9.1.1"}]
Total time for which application threads were stopped: 0.0197579 seconds, Stopping threads took: 0.0190897 seconds
Total time for which application threads were stopped: 0.0120070 seconds, Stopping threads took: 0.0113326 seconds
Total time for which application threads were stopped: 0.0007714 seconds, Stopping threads took: 0.0002170 seconds
Total time for which application threads were stopped: 0.0007229 seconds, Stopping threads took: 0.0001717 seconds
Total time for which application threads were stopped: 0.0057598 seconds, Stopping threads took: 0.0028854 seconds
Total time for which application threads were stopped: 0.0006208 seconds, Stopping threads took: 0.0000913 seconds
Total time for which application threads were stopped: 0.0010569 seconds, Stopping threads took: 0.0003147 seconds
Total time for which application threads were stopped: 0.0039626 seconds, Stopping threads took: 0.0031511 seconds
Total time for which application threads were stopped: 0.0036775 seconds, Stopping threads took: 0.0008415 seconds
Total time for which application threads were stopped: 0.0010646 seconds, Stopping threads took: 0.0002936 seconds
2016-10-06T08:35:46,694 INFO [topN_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z]] io.druid.offheap.OffheapBufferPool - **Allocating new intermediate processing buffer[1] of size[1,036,870,912]**
2016-10-06T08:35:46,695 INFO [topN_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z]] io.druid.offheap.OffheapBufferPool - **Allocating new intermediate processing buffer[2] of size[1,036,870,912]**
2016-10-06T08:35:47,035 INFO [topN_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:47.035Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/partial/time","value":358,"dataSource":"ad_statistics_hourly_realtime","dimension":"browser_id","duration":"PT60S","hasFilters":"true","id":"78ce4c6f-e2be-4da6-89fc-4f4e7ec3bc39","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"4","threshold":"1000","type":"topN","version":"0.9.1.1"}]
2016-10-06T08:35:47,035 INFO [topN_ad_statistics_hourly_realtime_[2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-06T08:35:47.035Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants19-ip-10-199-0-19:8104","metric":"query/wait/time","value":1,"dataSource":"ad_statistics_hourly_realtime","dimension":"browser_id","duration":"PT60S","hasFilters":"true","id":"78ce4c6f-e2be-4da6-89fc-4f4e7ec3bc39","interval":["2016-10-06T08:00:00.000Z/2016-10-06T08:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"4","threshold":"1000","type":"topN","version":"0.9.1.1"}]
Total time for which application threads were stopped: **14.1567895 seconds**, Stopping threads took: **14.1553502** seconds
Total time for which application threads were stopped: 0.0006413 seconds, Stopping threads took: 0.0002452 seconds

Hey Chanh,

My suspicion is that it’s not the time it took to allocate the buffer directly that’s causing you problems, but more that the allocation triggered a stop-the-world full GC that took a long time to progress and halted everything. If you add the -XX:+PrintGCDetails and -XX:+PrintGCTimeStamps flags to the peon javaOpts, you might get more insight into what’s happening in the GC. Based on what you were saying about a peon using 1000% CPU, it seems possible that you’re over capacity for processing power so that when the GC kicked in it took forever to progress.

Hey David,

My suspicion is that it’s not the time it took to allocate the buffer directly that’s causing you problems, but more that the allocation triggered a stop-the-world full GC that took a long time to progress and halted everything

Yes It is.

druid.indexer.fork.property.druid.processing.buffer.sizeBytes: 2000000000
druid.indexer.fork.property.druid.processing.numThreads: 3

# SafepointSynchronize::begin: Timeout detected:

Based on what you were saying about a peon using 1000% CPU, it seems possible that you’re over capacity for processing power so that when the GC kicked in it took forever to progress.

I changed the peon config into

druid.indexer.runner.javaOpts: -server -Xmx5g -Xms5g -XX:NewSize=1g -XX:MaxDirectMemorySize=10g -XX:+UseConcMarkSweepGC -XX:+UseStringDeduplication -XX:MaxGCPauseMillis=300 -XX:ParallelGCThreads=4 -XX:ConcGCThreads=2 -XX:InitiatingHeapOccupancyPercent=85 -verbosegc -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500 -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManage

quite good now.

and one thing I just test in 1 Hour data - so I think I still have enough resource to handle.

2016-10-10T05:26:24,529 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[group_id] in 151 millis.
2016-10-10T05:26:24,549 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[network_id] in 20 millis.
2016-10-10T05:26:24,553 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[advertiser_id] in 3 millis.
2016-10-10T05:26:24,554 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[lineitem_id] in 1 millis.
2016-10-10T05:26:24,558 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[campaign_id] in 4 millis.
2016-10-10T05:26:24,560 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[creative_id] in 2 millis.
2016-10-10T05:26:24,581 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[payment_model] in 21 millis.
2016-10-10T05:26:24,605 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[is_ad_default] in 24 millis.
2016-10-10T05:26:24,607 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[website_id] in 2 millis.
2016-10-10T05:26:24,609 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[section_id] in 2 millis.
2016-10-10T05:26:24,615 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[channel_id] in 6 millis.
2016-10-10T05:26:24,617 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[zone_id] in 2 millis.
2016-10-10T05:26:24,620 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[placement_id] in 3 millis.
2016-10-10T05:26:24,636 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[template_id] in 16 millis.
2016-10-10T05:26:24,648 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[zone_format] in 12 millis.
2016-10-10T05:26:24,651 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[topic_id] in 3 millis.
2016-10-10T05:26:24,654 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[interest_id] in 3 millis.
2016-10-10T05:26:24,657 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[topic_curr_id] in 3 millis.
2016-10-10T05:26:24,669 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[interest_curr_id] in 12 millis.
2016-10-10T05:26:24,679 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[inmarket_curr_id] in 10 millis.
2016-10-10T05:26:24,687 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[audience_id] in 8 millis.
2016-10-10T05:26:24,698 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[location_id] in 11 millis.
2016-10-10T05:26:24,715 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[os_id] in 17 millis.
2016-10-10T05:26:24,722 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[os_version_id] in 7 millis.
2016-10-10T05:26:24,739 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[browser_id] in 17 millis.
2016-10-10T05:26:24,756 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[device_type] in 17 millis.
2016-10-10T05:26:24,769 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[device_id] in 13 millis.
2016-10-10T05:26:24,780 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[carrier_id] in 11 millis.
2016-10-10T05:26:24,795 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[age_range_id] in 14 millis.
2016-10-10T05:26:24,820 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension column[gender_id] in 25 millis.
2016-10-10T05:26:24,820 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed dimension columns in 442 millis.
2016-10-10T05:26:24,820 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.IndexMergerV9 - Completed index.drd in 0 millis.
2016-10-10T05:26:24,832 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing ad_statistics_hourly_realtime_2016-10-10T04:00:00.000Z_2016-10-10T05:00:00.000Z_2016-10-10T04:00:07.494Z_59
2016-10-10T05:26:24,832 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing ad_statistics_hourly_realtime_2016-10-10T04:00:00.000Z_2016-10-10T05:00:00.000Z_2016-10-10T04:00:07.494Z_59, numReferences: 0
2016-10-10T05:26:24,834 INFO [ad_statistics_hourly_realtime-incremental-persist] io.druid.segment.realtime.appenderator.AppenderatorImpl - Committing metadata[FiniteAppenderatorDriverMetadata{activeSegments={index_kafka_ad_statistics_hourly_realtime_1b895bbe3f4f95f_15=[ad_statistics_hourly_realtime_2016-10-09T22:00:00.000Z_2016-10-09T23:00:00.000Z_2016-10-09T22:00:16.726Z_156, ad_statistics_hourly_realtime_2016-10-09T23:00:00.000Z_2016-10-10T00:00:00.000Z_2016-10-09T23:00:25.051Z_140, ad_statistics_hourly_realtime_2016-10-10T00:00:00.000Z_2016-10-10T01:00:00.000Z_2016-10-10T00:00:09.443Z_124, ad_statistics_hourly_realtime_2016-10-10T01:00:00.000Z_2016-10-10T02:00:00.000Z_2016-10-10T01:00:08.686Z_104, ad_statistics_hourly_realtime_2016-10-10T02:00:00.000Z_2016-10-10T03:00:00.000Z_2016-10-10T02:00:07.499Z_95, ad_statistics_hourly_realtime_2016-10-10T03:00:00.000Z_2016-10-10T04:00:00.000Z_2016-10-10T03:00:14.836Z_77, ad_statistics_hourly_realtime_2016-10-10T04:00:00.000Z_2016-10-10T05:00:00.000Z_2016-10-10T04:00:07.494Z_59]}, lastSegmentIds={index_kafka_ad_statistics_hourly_realtime_1b895bbe3f4f95f_15=ad_statistics_hourly_realtime_2016-10-10T02:00:00.000Z_2016-10-10T03:00:00.000Z_2016-10-10T02:00:07.499Z_95}, callerMetadata={nextPartitions=KafkaPartitions{topic='ad_statistic_hourly_replica_1', partitionOffsetMap={15=185108958}}}}] for sinks[ad_statistics_hourly_realtime_2016-10-09T22:00:00.000Z_2016-10-09T23:00:00.000Z_2016-10-09T22:00:16.726Z_156:6, ad_statistics_hourly_realtime_2016-10-10T03:00:00.000Z_2016-10-10T04:00:00.000Z_2016-10-10T03:00:14.836Z_77:6, ad_statistics_hourly_realtime_2016-10-10T02:00:00.000Z_2016-10-10T03:00:00.000Z_2016-10-10T02:00:07.499Z_95:6, ad_statistics_hourly_realtime_2016-10-10T01:00:00.000Z_2016-10-10T02:00:00.000Z_2016-10-10T01:00:08.686Z_104:6, ad_statistics_hourly_realtime_2016-10-10T00:00:00.000Z_2016-10-10T01:00:00.000Z_2016-10-10T00:00:09.443Z_124:6, ad_statistics_hourly_realtime_2016-10-10T04:00:00.000Z_2016-10-10T05:00:00.000Z_2016-10-10T04:00:07.494Z_59:8, ad_statistics_hourly_realtime_2016-10-09T23:00:00.000Z_2016-10-10T00:00:00.000Z_2016-10-09T23:00:25.051Z_140:6].
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
53.738: RevokeBias                       [     202          0              2    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0006284 seconds, Stopping threads took: 0.0001531 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
53.740: RevokeBias                       [     202          1              3    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0005048 seconds, Stopping threads took: 0.0001056 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
53.740: RevokeBias                       [     202          1              2    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0004418 seconds, Stopping threads took: 0.0001306 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
53.741: RevokeBias                       [     202          1              2    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0002987 seconds, Stopping threads took: 0.0001002 seconds
2016-10-10T05:26:25,251 INFO [timeseries_ad_statistics_hourly_realtime_[2016-10-10T03:01:00.000Z/2016-10-10T04:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-10T05:26:25.250Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants20-ip-10-199-0-20:8105","metric":"query/partial/time","value":2,"dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"08d68a3c-55ed-4c80-9360-eee0fb474442","interval":["2016-10-10T03:01:00.000Z/2016-10-10T04:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"2","type":"timeseries","version":"0.9.1.1"}]
2016-10-10T05:26:25,251 INFO [timeseries_ad_statistics_hourly_realtime_[2016-10-10T03:01:00.000Z/2016-10-10T04:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-10T05:26:25.251Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants20-ip-10-199-0-20:8105","metric":"query/wait/time","value":0,"dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"08d68a3c-55ed-4c80-9360-eee0fb474442","interval":["2016-10-10T03:01:00.000Z/2016-10-10T04:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"2","type":"timeseries","version":"0.9.1.1"}]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
53.744: RevokeBias                       [     202          1              4    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0006463 seconds, Stopping threads took: 0.0003044 seconds
2016-10-10T05:26:25,296 INFO [timeseries_ad_statistics_hourly_realtime_[2016-10-10T04:00:00.000Z/2016-10-10T04:01:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-10T05:26:25.296Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants20-ip-10-199-0-20:8105","metric":"query/partial/time","value":48,"dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"08d68a3c-55ed-4c80-9360-eee0fb474442","interval":["2016-10-10T03:01:00.000Z/2016-10-10T04:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"2","type":"timeseries","version":"0.9.1.1"}]
2016-10-10T05:26:25,297 INFO [timeseries_ad_statistics_hourly_realtime_[2016-10-10T04:00:00.000Z/2016-10-10T04:01:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-10T05:26:25.296Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants20-ip-10-199-0-20:8105","metric":"query/wait/time","value":0,"dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"08d68a3c-55ed-4c80-9360-eee0fb474442","interval":["2016-10-10T03:01:00.000Z/2016-10-10T04:01:00.000Z"],"numComplexMetrics":"0","numMetrics":"2","type":"timeseries","version":"0.9.1.1"}]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
53.790: RevokeBias                       [     202          0              2    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0007456 seconds, Stopping threads took: 0.0001693 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
53.791: RevokeBias                       [     202          2              2    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0004037 seconds, Stopping threads took: 0.0001033 seconds
2016-10-10T05:26:25,300 INFO [qtp1679022055-166] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-10T05:26:25.299Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants20-ip-10-199-0-20:8105","metric":"query/time","value":55,"context":"{\"finalize\":false,\"queryId\":\"08d68a3c-55ed-4c80-9360-eee0fb474442\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"08d68a3c-55ed-4c80-9360-eee0fb474442","interval":["2016-10-10T03:01:00.000Z/2016-10-10T04:01:00.000Z"],"remoteAddress":"10.199.0.19","success":"true","type":"timeseries","version":"0.9.1.1"}]
2016-10-10T05:26:25,300 INFO [qtp1679022055-166] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-10T05:26:25.300Z","service":"druid/middlemanager","host":"adx-pro-vdchcm-ants20-ip-10-199-0-20:8105","metric":"query/bytes","value":89,"context":"{\"finalize\":false,\"queryId\":\"08d68a3c-55ed-4c80-9360-eee0fb474442\",\"timeout\":40000}","dataSource":"ad_statistics_hourly_realtime","duration":"PT3600S","hasFilters":"false","id":"08d68a3c-55ed-4c80-9360-eee0fb474442","interval":["2016-10-10T03:01:00.000Z/2016-10-10T04:01:00.000Z"],"remoteAddress":"10.199.0.19","type":"timeseries","version":"0.9.1.1"}]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
54.063: RevokeBias                       [     202          0              1    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0006254 seconds, Stopping threads took: 0.0001507 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
54.064: RevokeBias                       [     202          1              1    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0002923 seconds, Stopping threads took: 0.0000580 seconds
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
54.082: RevokeBias                       [     202          0              2    ]      [     0     0     0     0     0    ]  0
Total time for which application threads were stopped: 0.0007730 seconds, Stopping threads took: 0.0001604 seconds
2016-10-10T05:26:25,608 INFO [topN_ad_statistics_hourly_realtime_[2016-10-10T03:01:00.000Z/2016-10-10T04:00:00.000Z]] io.druid.offheap.OffheapBufferPool - Allocating new intermediate processing buffer[0] of size[2,000,000,000]
2016-10-10T05:26:25,608 INFO [topN_ad_statistics_hourly_realtime_[2016-10-10T04:00:00.000Z/2016-10-10T04:01:00.000Z]] io.druid.offheap.OffheapBufferPool - Allocating new intermediate processing buffer[1] of size[2,000,000,000]

**# SafepointSynchronize::begin: Timeout detected:
# SafepointSynchronize::begin: Timed out while waiting for threads to stop.
# SafepointSynchronize::begin: Threads which did not reach the safepoint:
# "topN_ad_statistics_hourly_realtime_[2016-10-10T04:00:00.000Z/2016-10-10T04:01:00.000Z]" #207 daemon prio=5 os_prio=0 tid=0x00007fa4fc014800 nid=0x24c04 runnable [0x00007fa499454000]
   java.lang.Thread.State: RUNNABLE
# "topN_ad_statistics_hourly_realtime_[2016-10-10T03:01:00.000Z/2016-10-10T04:00:00.000Z]" #206 daemon prio=5 os_prio=0 tid=0x00007fa508007800 nid=0x249d9 runnable [0x00007fa49a11f000]
   java.lang.Thread.State: RUNNABLE
# "qtp1679022055-154-acceptor-0@20b570aa-ServerConnector@2e728acc{HTTP/1.1}{0.0.0.0:8105}" #154 daemon prio=4 os_prio=0 tid=0x00007fa7ce651000 nid=0x2467e runnable [0x00007fa58a4e7000]
   java.lang.Thread.State: RUNNABLE
# "qtp1679022055-153-selector-ServerConnectorManager@6b4b9aa1/3" #153 daemon prio=5 os_prio=0 tid=0x00007fa7ce64f000 nid=0x2467d runnable [0x00007fa58a5e8000]
   java.lang.Thread.State: RUNNABLE
# "qtp1679022055-152-selector-ServerConnectorManager@6b4b9aa1/2" #152 daemon prio=5 os_prio=0 tid=0x00007fa7ce64d800 nid=0x2467c runnable [0x00007fa58a6e9000]
   java.lang.Thread.State: RUNNABLE
# "qtp1679022055-151-selector-ServerConnectorManager@6b4b9aa1/1" #151 daemon prio=5 os_prio=0 tid=0x00007fa7ce64b800 nid=0x2467b runnable [0x00007fa58a7ea000]
   java.lang.Thread.State: RUNNABLE
# "qtp1679022055-150-selector-ServerConnectorManager@6b4b9aa1/0" #150 daemon prio=5 os_prio=0 tid=0x00007fa7ce64b000 nid=0x2467a runnable [0x00007fa58a8eb000]
   java.lang.Thread.State: RUNNABLE
# "parent-monitor-0" #147 daemon prio=5 os_prio=0 tid=0x00007fa7ce621800 nid=0x24677 runnable [0x00007fa58aaed000]
   java.lang.Thread.State: RUNNABLE
# "main-SendThread(adx-pro-vdchcm-ants19-ip-10-199-0-19:2181)" #141 daemon prio=5 os_prio=0 tid=0x00007fa7ce5ef000 nid=0x245fb runnable [0x00007fa58b3f4000]
   java.lang.Thread.State: RUNNABLE
# SafepointSynchronize::begin: (End of list)**