Urgently need help getting my historical nodes up

Hi all,

I have had another Event (capital E) with my Druid cluster. I am not entirely sure what happened, but there is a problem with the communication between nodes. It appears that the coordinator is having trouble with the historical nodes.

The cluster is small, I have a single master node running the coordinator and overlord, 2 data nodes running middle managers and historical nodes, and 2 query nodes running brokers. The middle manager is spawning indexing tasks and the overlord can see them fine. However there are many errors in the coordinator log: first about not having any servers at all, then about not having enough for the given replication factor (2).

I do have 2 historical nodes, but tailing the logs one appears to do nothing (and this one is currently seen by the coordinator), while the other is in an endless loop. It loads all of segment cache files, then it announces all of the segments, then it starts over:

So a bunch of these:

2016-09-09T17:39:02,221 INFO [main] io.druid.server.coordination.ZkCoordinator - Loading segment cache file [7552/64589][/services/druid/data/segment-cache/info_dir/sor_business_event_success_v1_2016-08-14T23:00:00.000Z_2016-08-15T00:00:00.000Z_2016-08-14T

Followed by a bunch of these:

2016-09-09T17:41:10,427 INFO [main] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[business_events_test_2016-09-04T01:00:00.000Z_2016-09-04T02:00:00.000Z_2016-09-03T01:54:31.884Z_11] at path[/druid/segments/druid-data001.internal.shopkick.com:8083/druid-data001.internal.shopkick.com:8083_historical__default_tier_2016-09-09T17:41:10.381Z_e7b20ac9d0e544c187cb1187da8ac68b0]

Then:

2016-09-09T17:45:08,939 INFO [main] io.druid.server.coordination.ZkCoordinator$BackgroundSegmentAnnouncer - Completed background segment announcing

2016-09-09T17:45:08,940 INFO [main] io.druid.server.coordination.ZkCoordinator - Cache load took 63,227 ms

2016-09-09T17:45:08,946 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.query.lookup.LookupReferencesManager.start()] on object[io.druid.query.lookup.LookupReferencesManager@7135ce0a].

2016-09-09T17:45:08,946 INFO [main] io.druid.query.lookup.LookupReferencesManager - Started lookup factory references manager

2016-09-09T17:45:08,948 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.server.listener.announcer.ListenerResourceAnnouncer.start()] on object[io.druid.query.lookup.LookupResourceListenerAnnouncer@2e590b].

2016-09-09T17:45:08,968 INFO [main] io.druid.server.listener.announcer.ListenerResourceAnnouncer - Announcing start time on [/druid/listeners/lookups/__default/druid-data001.internal.shopkick.com:8083]

2016-09-09T17:45:08,969 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.server.coordination.AbstractDataSegmentAnnouncer.start()] on object[io.druid.server.coordination.BatchDataSegmentAnnouncer@2c18a3ea].

2016-09-09T17:45:08,969 INFO [main] io.druid.server.coordination.AbstractDataSegmentAnnouncer - Announcing self[DruidServerMetadata{name=‘druid-data001.internal.shopkick.com:8083’, host=‘druid-data001.internal.shopkick.com:8083’, maxSize=130000000000, tier=’_default_tier’, type=‘historical’, priority=‘0’}] at [/druid/announcements/druid-data001.internal.shopkick.com:8083]

2016-09-09T17:45:09,018 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - New request[LOAD: business_events_test_2016-09-11T02:00:00.000Z_2016-09-11T03:00:00.000Z_2016-09-07T02:08:30.019Z_8] with zNode[/druid/loadQueue/druid-data001.internal.shopkick.com:8083/business_events_test_2016-09-11T02:00:00.000Z_2016-09-11T03:00:00.000Z_2016-09-07T02:08:30.019Z_8].

2016-09-09T17:45:09,019 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Loading segment business_events_test_2016-09-11T02:00:00.000Z_2016-09-11T03:00:00.000Z_2016-09-07T02:08:30.019Z_8

2016-09-09T17:45:09,024 WARN [ZkCoordinator-0] io.druid.server.coordination.ServerManager - Told to load a adapter for a segment[business_events_test_2016-09-11T02:00:00.000Z_2016-09-11T03:00:00.000Z_2016-09-07T02:08:30.019Z_8] that already exists

2016-09-09T17:45:09,038 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Completed request [LOAD: business_events_test_2016-09-11T02:00:00.000Z_2016-09-11T03:00:00.000Z_2016-09-07T02:08:30.019Z_8]

2016-09-09T17:45:09,038 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - zNode[/druid/loadQueue/druid-data001.internal.shopkick.com:8083/business_events_test_2016-09-11T02:00:00.000Z_2016-09-11T03:00:00.000Z_2016-09-07T02:08:30.019Z_8] was removed

Followed by a bunch of metrics and then it looks like the process restarts and it logs all the initialization stuff and starts over.

Help would be greatly appreciated.

Hey Ben,

Any idea why the process dies and restarts? That seems like the real problem. Some stuff to check for:

  1. Exceptions/errors in the historical jvm logs

  2. Errors in the logs of whatever is launching the historical (if you’re using Imply, this is bin/supervise)

  3. Errors from your OS (if Linux, one big one to look for is the oom-killer; Linux may be torching your process unceremoniously – check dmesg / kernel logs)

Hi Gian,

Thanks for your help.

The only error in the historical jvm log is this one, and it doesn’t appear to trigger the shutdown as the process continues for a good while afterwards:

2016-09-09T17:06:50,714 ERROR [main] io.druid.server.coordination.ZkCoordinator - Failed to load segment from segmentInfo file: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class com.fasterxml.jackson.databind.JsonMappingException, exceptionMessage=No content to map due to end-of-input

at [Source: /services/druid/data/segment-cache/info_dir/business_events_success_2016-09-09T14:00:00.000Z_2016-09-09T15:00:00.000Z_2016-09-09T13:30:28.843Z; line: 1, column: 1], file=/services/druid/data/segment-cache/info_dir/business_events_success_2016-09-09T14:00:00.000Z_2016-09-09T15:00:00.000Z_2016-09-09T13:30:28.843Z}

com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input

at [Source: /services/druid/data/segment-cache/info_dir/business_events_success_2016-09-09T14:00:00.000Z_2016-09-09T15:00:00.000Z_2016-09-09T13:30:28.843Z; line: 1, column: 1]

at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148) ~[jackson-databind-2.4.6.jar:2.4.6]

at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3110) ~[jackson-databind-2.4.6.jar:2.4.6]

at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051) ~[jackson-databind-2.4.6.jar:2.4.6]

at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2115) ~[jackson-databind-2.4.6.jar:2.4.6]

at io.druid.server.coordination.ZkCoordinator.loadLocalCache(ZkCoordinator.java:258) [druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.start(ZkCoordinator.java:133) [druid-server-0.9.1.1.jar:0.9.1.1]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]

at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:350) [java-util-0.27.9.jar:?]

at com.metamx.common.lifecycle.Lifecycle.start(Lifecycle.java:259) [java-util-0.27.9.jar:?]

at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:155) [druid-api-0.9.1.1.jar:0.9.1.1]

at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:91) [druid-services-0.9.1.1.jar:0.9.1.1]

at io.druid.cli.ServerRunnable.run(ServerRunnable.java:40) [druid-services-0.9.1.1.jar:0.9.1.1]

at io.druid.cli.Main.main(Main.java:105) [druid-services-0.9.1.1.jar:0.9.1.1]

2016-09-09T17:06:50,747 INFO [main] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“alerts”,“timestamp”:“2016-09-09T17:06:50.725Z”,“service”:“druid/historical”,“host”:“druid-data002.internal.shopkick.com:8083”,“severity”:“component-failure”,“description”:“Failed to load segment from segmentInfo file”,“data”:{“class”:“io.druid.server.coordination.ZkCoordinator”,“exceptionType”:“com.fasterxml.jackson.databind.JsonMappingException”,“exceptionMessage”:“No content to map due to end-of-input\n at [Source: /services/druid/data/segment-cache/info_dir/business_events_success_2016-09-09T14:00:00.000Z_2016-09-09T15:00:00.000Z_2016-09-09T13:30:28.843Z; line: 1, column: 1]”,“exceptionStackTrace”:“com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input\n at [Source: /services/druid/data/segment-cache/info_dir/business_events_success_2016-09-09T14:00:00.000Z_2016-09-09T15:00:00.000Z_2016-09-09T13:30:28.843Z; line: 1, column: 1]\n\tat com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)\n\tat com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3110)\n\tat com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051)\n\tat com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2115)\n\tat io.druid.server.coordination.ZkCoordinator.loadLocalCache(ZkCoordinator.java:258)\n\tat io.druid.server.coordination.ZkCoordinator.start(ZkCoordinator.java:133)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:350)\n\tat com.metamx.common.lifecycle.Lifecycle.start(Lifecycle.java:259)\n\tat io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:155)\n\tat io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:91)\n\tat io.druid.cli.ServerRunnable.run(ServerRunnable.java:40)\n\tat io.druid.cli.Main.main(Main.java:105)\n”,“file”:"/services/druid/data/segment-cache/info_dir/business_events_success_2016-09-09T14:00:00.000Z_2016-09-09T15:00:00.000Z_2016-09-09T13:30:28.843Z"}}]

I am using supervise, but I’m not sure where to look for its logs. Can you point me in the right place? In the sv directory i see only the historical and middle manager logs.

supervise logs to stdout/stderr, so its logs depend on what runs it.

That error doesn’t look like anything that should kill the node, btw.

Just to follow up. There was no OOM killer and I still do not know why the process was dying since it did not log an error. However I deleted the segment-cache on all of the historical nodes and restarted the processes and they came back healthy.

–Ben

That fix was only temporary it appears. The supervise script is definitely showing historical node restarts:

[Sun Sep 11 19:36:48 2016] Running command[historical], logging to[/services/druid/druid-20160817.01/…/data/sv/historical.log]: /etc/druid/current/run-druid historical /services/druid/conf

[Sun Sep 11 19:36:48 2016] Running command[middleManager], logging to[/services/druid/druid-20160817.01/…/data/sv/middleManager.log]: /etc/druid/current/run-druid middleManager /services/druid/conf

[Sun Sep 11 21:25:41 2016] Command[historical] exited (pid = 47468, exited = 1)

[Sun Sep 11 21:25:41 2016] Command[historical] failed, see logfile for more details: /services/druid/druid-20160817.01/…/data/sv/historical.log

[Sun Sep 11 21:25:44 2016] Running command[historical], logging to[/services/druid/druid-20160817.01/…/data/sv/historical.log]: /etc/druid/current/run-druid historical /services/druid/conf

[Sun Sep 11 21:41:39 2016] Command[historical] exited (pid = 9191, exited = 1)

[Sun Sep 11 21:41:39 2016] Command[historical] failed, see logfile for more details: /services/druid/druid-20160817.01/…/data/sv/historical.log

[Sun Sep 11 21:41:42 2016] Running command[historical], logging to[/services/druid/druid-20160817.01/…/data/sv/historical.log]: /etc/druid/current/run-druid historical /services/druid/conf

[Sun Sep 11 22:00:55 2016] Command[historical] exited (pid = 24161, exited = 1)

[Sun Sep 11 22:00:55 2016] Command[historical] failed, see logfile for more details: /services/druid/druid-20160817.01/…/data/sv/historical.log

[Sun Sep 11 22:00:58 2016] Running command[historical], logging to[/services/druid/druid-20160817.01/…/data/sv/historical.log]: /etc/druid/current/run-druid historical /services/druid/conf

[Sun Sep 11 22:02:14 2016] Command[historical] exited (pid = 44688, exited = 1)

[Sun Sep 11 22:02:14 2016] Command[historical] failed, see logfile for more details: /services/druid/druid-20160817.01/…/data/sv/historical.log

[Sun Sep 11 22:02:17 2016] Running command[historical], logging to[/services/druid/druid-20160817.01/…/data/sv/historical.log]: /etc/druid/current/run-druid historical /services/druid/conf

[Sun Sep 11 22:03:33 2016] Command[historical] exited (pid = 47175, exited = 1)

[Sun Sep 11 22:03:33 2016] Command[historical] failed, see logfile for more details: /services/druid/druid-20160817.01/…/data/sv/historical.log

[Sun Sep 11 22:03:36 2016] Running command[historical], logging to[/services/druid/druid-20160817.01/…/data/sv/historical.log]: /etc/druid/current/run-druid historical /services/druid/conf

But the historical node log isn’t very helpful. The only exceptions I see are these below - there exist for different data sources and for different segments in each data source:

2016-09-12T01:05:51,134 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[business_events_test_2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z_2016-09-11T15:12:33.459Z], segment=DataSegment{size=17665, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[count, transaction_kicks, transaction_cost, transaction_revenue, users, user_devices, locations], dimensions=[user_gender, user_social_facebook, user_registration_age_bracket, user_registration_age_day, user_registration_age_week, event_type, event_status, event_is_offline, user_country, client_app_version, client_ip_address, client_locale, client_os_name, client_user_agent, device_platform_type, device_name, app_id, user_id, device_id, client_event_time, client_request_time, server_request_time, user_registration_time, event_coordinates], version=‘2016-09-11T15:12:33.459Z’, loadSpec={type=hdfs, path=hdfs://hdfs001:8020/druid/segments/business_events_test/20161024T150000.000Z_20161024T160000.000Z/2016-09-11T15_12_33.459Z/0/index.zip}, interval=2016-10-24T15:00:00.000Z/2016-10-24T16:00:00.000Z, dataSource=‘business_events_test’, binaryVersion=‘9’}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[business_events_test_2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z_2016-09-11T15:12:33.459Z]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:309) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:350) [druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44) [druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:152) [druid-server-0.9.1.1.jar:0.9.1.1]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.10.0.jar:?]

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]

at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:772) [curator-recipes-2.10.0.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

Caused by: io.druid.segment.loading.SegmentLoadingException: /services/druid/data/segment-cache/business_events_test/2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z/2016-09-11T15:12:33.459Z/0/index.drd (No such file or directory)

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:52) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:96) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]

… 18 more

Caused by: java.io.FileNotFoundException: /services/druid/data/segment-cache/business_events_test/2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z/2016-09-11T15:12:33.459Z/0/index.drd (No such file or directory)

at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_91]

at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_91]

at java.io.FileInputStream.(FileInputStream.java:138) ~[?:1.8.0_91]

at io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:43) ~[druid-api-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.IndexIO.loadIndex(IndexIO.java:211) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:49) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:96) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]

… 18 more

2016-09-12T01:07:21,065 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[business_events_2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z_2016-09-11T15:12:32.883Z], segment=DataSegment{size=18044, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[count, transaction_kicks, transaction_cost, transaction_revenue, user_id_sketch, unique_user_id, unique_user_device_id, unique_location_id], dimensions=[event_is_offline, user_social_facebook, user_gender, user_registration_age_bracket, event_type, event_status, user_registration_age_day, user_registration_age_week, user_country, client_app_version, client_locale, client_os_name, client_user_agent, device_platform_type, device_name, app_id, client_ip_address, user_id, device_id, client_event_time, client_request_time, server_request_time, user_registration_time, event_coordinates], version=‘2016-09-11T15:12:32.883Z’, loadSpec={type=hdfs, path=hdfs://hdfs001:8020/druid/segments/business_events/20161024T150000.000Z_20161024T160000.000Z/2016-09-11T15_12_32.883Z/0/index.zip}, interval=2016-10-24T15:00:00.000Z/2016-10-24T16:00:00.000Z, dataSource=‘business_events’, binaryVersion=‘9’}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[business_events_2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z_2016-09-11T15:12:32.883Z]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:309) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:350) [druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44) [druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:152) [druid-server-0.9.1.1.jar:0.9.1.1]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.10.0.jar:?]

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]

at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:772) [curator-recipes-2.10.0.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

Caused by: io.druid.segment.loading.SegmentLoadingException: /services/druid/data/segment-cache/business_events/2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z/2016-09-11T15:12:32.883Z/0/index.drd (No such file or directory)

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:52) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:96) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]

… 18 more

Caused by: java.io.FileNotFoundException: /services/druid/data/segment-cache/business_events/2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z/2016-09-11T15:12:32.883Z/0/index.drd (No such file or directory)

at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_91]

at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_91]

at java.io.FileInputStream.(FileInputStream.java:138) ~[?:1.8.0_91]

at io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:43) ~[druid-api-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.IndexIO.loadIndex(IndexIO.java:211) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:49) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:96) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]

… 18 more

I have plenty of disk space on the host, but memory is running low.

My Historical jvm config is:

[ben@druid-data001 druid]$ cat /services/druid/conf/historical/jvm.config

-server

-Xms8g

-Xmx12g

-XX:MaxDirectMemorySize=4096m

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=/services/druid/data/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-Dservicename=historical

Although according to top the historical java process is using 17GB RAM, 80GB virtual memory.

The historical process using a lot of virtual memory is typical, but most of that should be disk backed and not really problematic from an oom perspective.

This sort of “caused by” is pretty weird though: “Caused by: java.io.FileNotFoundException: /services/druid/data/segment-cache/business_events/2016-10-24T15:00:00.000Z_2016-10-24T16:00:00.000Z/2016-09-11T15:12:32.883Z/0/index.drd (No such file or directory)”.

Is it possible that you have multiple historical node processes running on the same machine or filesystem and they are clobbering each others’ files?

Or is it possible that one of your mount points is full even though the disk overall has free space?

What you’re describing sounds a lot like either the oom-killer or malloc failure, both of which can happen in memory constrained environments. How much ram do you have in the box? With the config you have, -Xmx12g -XX:MaxDirectMemorySize=4096m, you need 16–20GB free for the historical process to be able to stay up.

If those are what’s happening, I’m surprised you don’t see evidence of either of them in the logs, but maybe they’re happening and getting logged in strange places, or maybe they aren’t being logged for some reason (?).

I just realized from another thread you posted that you have “-XX:OnOutOfMemoryError=kill -9 %p” in your jvm opts. This would probably chomp logs about an OOME that occurred. So I think you are experiencing some form of OOME or malloc failure, and your historical is killing itself without logging it’s doing so.

I would suggest using -XX:+ExitOnOutOfMemoryError rather than “-XX:OnOutOfMemoryError=kill -9 %p”. This will exit more cleanly and not chomp logs. It’s only available in the last couple jdk8 releases so you might have to update first. After that, the logs you get showing the OOME condition will help debug further.

I will upgrade my version of java and see if I can get some more useful logs. I have also reduced the number of workers to 8 to see if the cluster stabilizes.

I am running java 1.8.0_91 so without the ExitOnOutOfMemoryError option it appears. However, I am not using OnOutOfMemoryError=kill -9 on the historical command line process - only for my peons. The host has 64GB RAM on it, but I am running the middle manager there as well - I was running 10 slots with Xmx 4g. That should still have left ~24GB for the historical process as the middle manager uses little memory.

If those peons are running realtime tasks, watch out for their processing buffers too. They are off-heap direct memory buffers, and can’t be paged out.

Also you might want to check if swap space is setup on the machine. In case it is not setup the java process may be killed by OS if it memory maps files that does not fit in memory.

Hi Gian,

I upgraded Java and now I am get crash dumps every 2 minutes (see attached). I also reduced the RAM allocation to the historical process from 12g to 8g and reduced the number of middle manager workers to 8. The host still has free memory and swap but the historical process is constantly hitting OOM. What is the expected behavior of the historical service? Do I need enough memory to keep all of my data sources in RAM in their entirety? My data sources are very small at the moment, but I have only 2 historical nodes in this cluster. I was expecting that anything that the historical node couldn’t keep in memory would just be cached on disk.

Thanks for your help.

–Ben

historical_pid_5091.log.gz (1.88 MB)

You don’t need memory to store your data – it’s memory mapped and the OS can page it in and out as necessary. But you do need enough memory for your heaps + any offheap direct memory buffers for any data nodes. In total, for each process, that’s going to be,

heap (all nodes) = Xmx

offheap buffers (historical, peons) = druid.processing.buffer.sizeBytes * (druid.processing.numThreads + 1)

It sounds like you do have enough memory for your heaps, but you might not have enough memory for your offheap buffers. Please double-check that.

On a 64g host I have:

Historical xmx 8g + 4g = 12GB

Middle manager xmx 64mb

peons xmx 4g * 8 workers = 32 GB

peons buffer 512mb * 2 threads * 8 workers = 8GB

Total: 52GB

That matches what I see on top: I see 12GB free, but I am constantly seeing OOM errors.

–Ben

Actually I just re-read you email and I see that it should be num processing threads + 1 for peons which would mean 12GB allocated to them. That would still leave 8GB free if all the worker slots were occupied - they aren’t right now which is why I see 12GB free.

Another data point is that the coordinator process is going crazy CPU-wise. It is using 3000-4000% CPU. It has also run into an OOM error at least once since I upgraded to java 1.8.0_101 (even though the host had 25GB RAM free, see attached log). I presume it is because the historical nodes are bouncing up and down so it is trying to rebalance segments.

From the coordinator log:

2016-09-13T10:01:13,106 INFO [Coordinator-Exec–0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/loadQueue/druid-data001.internal.shopkick.com:8083] to load segment[sor_business_event_success_v1_2016-08-30T01:00:00.000Z_2016-08-30T02:00:00.000Z_2016-08-30T00:53:27.087Z_14]

2016-09-13T10:01:13,108 INFO [Coordinator-Exec–0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/druid-data001.internal.shopkick.com:8083] skipping doNext() because something is currently loading[business_events_2016-09-13T05:00:00.000Z_2016-09-13T06:00:00.000Z_2016-09-13T02:03:21.100Z_1].

2016-09-13T10:01:31,450 INFO [Coordinator-Exec–0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/loadQueue/druid-data001.internal.shopkick.com:8083] to load segment[sor_business_event_success_v1_2016-08-30T01:00:00.000Z_2016-08-30T02:00:00.000Z_2016-08-30T00:53:27.087Z_12]

2016-09-13T10:01:31,450 INFO [Coordinator-Exec–0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/druid-data001.internal.shopkick.com:8083] skipping doNext() because something is currently loading[business_events_2016-09-13T05:00:00.000Z_2016-09-13T06:00:00.000Z_2016-09-13T02:03:21.100Z_1].

2016-09-13T10:01:41,377 INFO [Coordinator-Exec–0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/loadQueue/druid-data001.internal.shopkick.com:8083] to load segment[sor_business_event_success_v1_2016-08-30T01:00:00.000Z_2016-08-30T02:00:00.000Z_2016-08-30T00:53:27.087Z_11]

2016-09-13T10:01:41,377 INFO [Coordinator-Exec–0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/druid-data001.internal.shopkick.com:8083] skipping doNext() because something is currently loading[business_events_2016-09-13T05:00:00.000Z_2016-09-13T06:00:00.000Z_2016-09-13T02:03:21.100Z_1].

Aborting due to java.lang.OutOfMemoryError: GC overhead limit exceeded

coordinator_pid_41024.log.gz (32.2 KB)

historical_pid_58328.log.gz (1.88 MB)

Hi Gian,

Is there anything else I can get you to help diagnose this? I tried shutting down the middle manager so that the only thing running was the historical node and I upped the Xmx to 16g with no change in behavior (still crashing with OOM error). My data source is small, only 90GB, and there is 130GB available on each historical node. I find it hard to believe that it is stressing out the cluster to this extent.

Thanks for your help,
Ben

Hi didn’t read all the thread but got small suggestion,
have you tried to increase the MaxDirectMemorySize to something -XX:MaxDirectMemorySize=10240g