Segments are written in hdfs, but the indexing tasks continue to run even after being handed off

I’m running tranquility to feed a druid cluster and i’va followed all the recommended configurations.

Segments are written in hdfs after segmentGranularity + windowPeiod + few_minutes.

But the indexing tasks (peons) never stop, even after days. And at some point, all workers are in running state and no new tasks is created.

Can anyone help?

Hi

As you know, indexing task will be finished after (Segment
Granularity + WindowPeriod + margine time).

And new task will be created depending on Segment Granularity.

So I think, what is the value of Segment Granularity?

And Are there pending task?

And how about middle manager logs?

Thanks.

2018년 3월 7일 수요일 오후 10시 41분 55초 UTC+9, Cheikh Sow 님의 말:

Hi,
I know what you said about indexing task and my segmentGranularity is 1H and windowPeriod = PT10m.
New tasks are created and it happens to have arround 30 pending tasks.
Here are the important infos I get from the logs.
Historical nodes at starting:

/opt/application/sdr/logs/druid/historical.8.log:15764:"“2018-03-07T15:28:50,156 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
/opt/application/sdr/logs/druid/historical.8.log:15769:”“2018-03-07T15:28:51,143 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29
/opt/application/sdr/logs/druid/historical.8.log:15960:”“2018-03-07T15:29:00,137 INFO [main-SendThread(xxxxxx:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server xxxxxx/xx.xxx.xxx.xx:2181. Will not attempt to authenticate using SASL (unknown error)
/opt/application/sdr/logs/druid/historical.8.log:15965:”“2018-03-07T15:29:00,139 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.server.coordination.ZkCoordinator.start() throws java.io.IOException] on object[io.druid.server.coordination.ZkCoordinator@601d6622].
/opt/application/sdr/logs/druid/historical.8.log:15978:”“2018-03-07T15:29:00,313 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering io.druid.server.initialization.jetty.CustomExceptionMapper as a provider class
/opt/application/sdr/logs/druid/historical.8.log:15979:”“2018-03-07T15:29:00,313 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering io.druid.server.initialization.jetty.ForbiddenExceptionMapper as a provider class
/opt/application/sdr/logs/druid/historical.8.log:15982:”“2018-03-07T15:29:00,368 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding io.druid.server.initialization.jetty.CustomExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
/opt/application/sdr/logs/druid/historical.8.log:15983:”“2018-03-07T15:29:00,370 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding io.druid.server.initialization.jetty.ForbiddenExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
/opt/application/sdr/logs/druid/historical.8.log:15994:”“2018-03-07T15:29:00,630 WARN [main] com.sun.jersey.spi.inject.Errors - The following warnings have been detected with resource and/or provider classes:
/opt/application/sdr/logs/druid/historical.8.log:15995: WARNING: A HTTP GET method, public void io.druid.server.http.SegmentListerResource.getSegments(long,long,long,javax.servlet.http.HttpServletRequest) throws java.io.IOException, MUST return a non-void type.
/opt/application/sdr/logs/druid/historical.8.log:16004:”“2018-03-07T15:29:00,650 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.curator.discovery.ServerDiscoverySelector.start() throws java.lang.Exception] on object[io.druid.curator.discovery.ServerDiscoverySelector@510da778].
/opt/application/sdr/logs/druid/historical.8.log:16035:”"2018-03-07T15:29:00,784 WARN [main] io.druid.query.lookup.LookupReferencesManager - No lookups found for tier [__default], response [com.metamx.http.client.response.FullResponseHolder@15d65fcf]

``

For peons logs from a task launching to the alertTimeout i’ve set up for (54000 seconds = 1h30mn). The segment for that interval (22:00:00 - 23:00:00) is well served in hdfs, as for the other intervals too :

/opt/application/sdr/logs/druid/peons.1.log:1418:"“2018-03-07T22:00:03,237 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
/opt/application/sdr/logs/druid/peons.1.log:1424:”“2018-03-07T22:00:04,276 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29
/opt/application/sdr/logs/druid/peons.1.log:1608:”“2018-03-07T22:00:05,384 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.curator.discovery.ServerDiscoverySelector.start() throws java.lang.Exception] on object[io.druid.curator.discovery.ServerDiscoverySelector@d36c1c3].
/opt/application/sdr/logs/druid/peons.1.log:1609:”“2018-03-07T22:00:05,384 INFO [main-SendThread(xxxxxx:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server xxxxxx/xx.xx.xx.xx:2181. Will not attempt to authenticate using SASL (unknown error)
/opt/application/sdr/logs/druid/peons.1.log:1641:”“2018-03-07T22:00:05,578 WARN [main] io.druid.query.lookup.LookupReferencesManager - No lookups found for tier [__default], response [com.metamx.http.client.response.FullResponseHolder@64a1116a]
/opt/application/sdr/logs/druid/peons.1.log:1647:”“2018-03-07T22:00:05,581 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.curator.discovery.ServerDiscoverySelector.start() throws java.lang.Exception] on object[io.druid.curator.discovery.ServerDiscoverySelector@1544ded3].
/opt/application/sdr/logs/druid/peons.1.log:1653:”“2018-03-07T22:00:05,589 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.indexing.worker.executor.ExecutorLifecycle.start() throws java.lang.InterruptedException] on object[io.druid.indexing.worker.executor.ExecutorLifecycle@5203c80f].
/opt/application/sdr/logs/druid/peons.1.log:1765: “reportParseExceptions” : false,
/opt/application/sdr/logs/druid/peons.1.log:1790:”“2018-03-07T22:00:05,739 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering io.druid.server.initialization.jetty.CustomExceptionMapper as a provider class
/opt/application/sdr/logs/druid/peons.1.log:1791:”“2018-03-07T22:00:05,739 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering io.druid.server.initialization.jetty.ForbiddenExceptionMapper as a provider class
/opt/application/sdr/logs/druid/peons.1.log:1794:”“2018-03-07T22:00:05,839 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding io.druid.server.initialization.jetty.CustomExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
/opt/application/sdr/logs/druid/peons.1.log:1795:”“2018-03-07T22:00:05,841 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding io.druid.server.initialization.jetty.ForbiddenExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
/opt/application/sdr/logs/druid/peons.1.log:1814:”“2018-03-07T22:00:06,120 WARN [main] com.sun.jersey.spi.inject.Errors - The following warnings have been detected with resource and/or provider classes:
/opt/application/sdr/logs/druid/peons.1.log:1815: WARNING: A HTTP GET method, public void io.druid.server.http.SegmentListerResource.getSegments(long,long,long,javax.servlet.http.HttpServletRequest) throws java.io.IOException, MUST return a non-void type.
/opt/application/sdr/logs/druid/peons.1.log:16254:”"2018-03-07T23:30:05,647 ERROR [RealtimeIndexTask-Timer] io.druid.indexing.common.task.RealtimeIndexTask - RealtimeIndexTask for dataSource [wassup-logs] hasn’t finished in configured time [5400000] ms.: {class=io.druid.indexing.common.task.RealtimeIndexTask}

``

I have also these logs from tranquility :

/opt/application/sdr/logs/druid/tranquility.log:30952:2018-03-08 05:01:03,040 [ClusteredBeam-ZkFuturePool-cb964f2f-6dc9-4f62-8d91-732da8cc10f8] WARN c.m.tranquility.druid.DruidBeamMaker - DruidTuning key[rejectionPolicy] for task[index_realtime_wassup-logs_2018-03-08T05:00:00.000Z_2_0] overridden from[Map(type -> serverTime)] to[Map(type -> none)].
/opt/application/sdr/logs/druid/tranquility.log:30953:2018-03-08 05:01:03,040 [ClusteredBeam-ZkFuturePool-cb964f2f-6dc9-4f62-8d91-732da8cc10f8] WARN c.m.tranquility.druid.DruidBeamMaker - DruidTuning key[windowPeriod] for task[index_realtime_wassup-logs_2018-03-08T05:00:00.000Z_2_0] overridden from[PT10m] to[PT10M].
/opt/application/sdr/logs/druid/tranquility.log:31071:2018-03-08 05:01:04,970 [Hashed wheel timer #1] WARN c.m.tranquility.beam.ClusteredBeam - Emitting alert: [anomaly] Failed to propagate events: druid:overlord/wassup-logs
/opt/application/sdr/logs/druid/tranquility.log:31080:2018-03-08 05:01:05,540 [Hashed wheel timer #1] WARN c.m.tranquility.beam.ClusteredBeam - Emitting alert: [anomaly] Failed to propagate events: druid:overlord/wassup-logs
/opt/application/sdr/logs/druid/tranquility.log:31089:2018-03-08 05:01:08,852 [Hashed wheel timer #1] WARN c.m.tranquility.beam.ClusteredBeam - Emitting alert: [anomaly] Failed to propagate events: druid:overlord/wassup-logs
/opt/application/sdr/logs/druid/tranquility.log:31098:2018-03-08 05:01:15,410 [Hashed wheel timer #1] WARN c.m.tranquility.beam.ClusteredBeam - Emitting alert: [anomaly] Failed to propagate events: druid:overlord/wassup-logs

``

And for the overlord, when I mannually stop stucked tasks, I have these logs:

/opt/application/sdr/logs/druid/overlord.log:90:"“2018-03-08T08:08:03,252 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T17:00:00.000Z_2_0
/opt/application/sdr/logs/druid/overlord.log:296:”“2018-03-08T08:08:07,205 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T18:00:00.000Z_0_0
/opt/application/sdr/logs/druid/overlord.log:496:”“2018-03-08T08:08:10,671 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T18:00:00.000Z_1_0
/opt/application/sdr/logs/druid/overlord.log:785:”“2018-03-08T08:08:15,781 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T18:00:00.000Z_2_0
/opt/application/sdr/logs/druid/overlord.log:937:”“2018-03-08T08:08:19,689 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T19:00:00.000Z_0_0
/opt/application/sdr/logs/druid/overlord.log:1179:”“2018-03-08T08:08:24,206 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T19:00:00.000Z_1_0
/opt/application/sdr/logs/druid/overlord.log:1368:”“2018-03-08T08:08:28,703 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T19:00:00.000Z_2_0
/opt/application/sdr/logs/druid/overlord.log:1427:”“2018-03-08T08:08:33,313 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T20:00:00.000Z_0_0
/opt/application/sdr/logs/druid/overlord.log:1478:”“2018-03-08T08:08:39,173 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T20:00:00.000Z_1_0
/opt/application/sdr/logs/druid/overlord.log:1498:”“2018-03-08T08:08:44,542 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T20:00:00.000Z_2_0
/opt/application/sdr/logs/druid/overlord.log:1517:”“2018-03-08T08:08:49,268 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T21:00:00.000Z_0_0
/opt/application/sdr/logs/druid/overlord.log:1536:”“2018-03-08T08:08:53,264 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T21:00:00.000Z_1_0
/opt/application/sdr/logs/druid/overlord.log:1556:”“2018-03-08T08:09:03,707 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T21:00:00.000Z_2_0
/opt/application/sdr/logs/druid/overlord.log:1635:”“2018-03-08T08:09:08,534 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T22:00:00.000Z_0_0
/opt/application/sdr/logs/druid/overlord.log:1654:”“2018-03-08T08:09:14,863 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T22:00:00.000Z_1_0
/opt/application/sdr/logs/druid/overlord.log:1674:”"2018-03-08T08:09:23,913 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Unknown task completed: index_realtime_wassup-logs_2018-03-07T22:00:00.000Z_2_0

``

I repeat: running tasks are serving well their segments in hdfs. The problem is the fact that they are never killed and new tasks keep adding up until all the workers become unavailable and pending tasks growing over and over.
I suspect the problem coming from the Curator service which fail the communication between services maybe.
Here are concerning properties for Zookeeper and Curator service communications. These are set in common.runtime.properties on each node.

druid.zk.paths.base=/opt/application/sdr/run/druid
druid.discovery.curator.path=/opt/application/sdr/run/druid/discovery

``

I’m using druid-0.11.0 and tranquility-distribution-0.8.2. I have also noticed that nothing is written in the specified zk.paths.base and discovery.curator.path on each node. And it is sure that the applications have rights to access and write in it . I’m also running applications as root user.

Really need help???

Thanks,

Cheikh Sow

Hi

I think you should check data flow from reatime ingestion to Historical node.Because there a lot of check pointings.

First of all, I think you can not get the result of query of which timestamp is from the start time to end time that the job is not finished.
Because the full step ingestion process(from realtime ingestion to Historical node) is not normally done about the job was not finished.

You said segments was stored in HDFS(Deep storage). As you know, after that metadata will be stored in DB(like mysql and so on). And next checking point coordinator and so on.

Can you check more??

Thanks.

Hi Josh,

Yes I know all of that and I’ll check more.

But yesterday, I have noticed that I can only query segments with a timestamp within the realtime tasks.

But when I kill a realtime task (segments are in hdfs), I am not able anymore to query the corresponding tasks.

What I mean is that historical datas are not queryable anymore. So the problem should come from the historical nodes and not the realtime.

Thanks

Hi

When the segments in realtime tasks then that tasks are queryable and when the segments in historical node then that tasks also are queryable…

I think the segments in realtime tasks can not stored in historical node in your case.

So if you kill a realtime task, the segments in that realtime task can not queryable.

The time when the segments is stored in historical node is after Realtime > HDFS > MySQL > Coordiantor > zookeeper > [Historical node].

And when the segments in stored in historical node, the realtime task will be finished.

So I think we should check [Mysql > Coordinator > zookeeper].

Thanks.

Josh.

2018년 3월 9일 금요일 오후 4시 8분 30초 UTC+9, Cheikh Sow 님의 말:

Ok, thanks!

Then how can we do that? Any hint i should follow?

Oh, I just checked on the coordinator console but it assumes that my datasource is disabled.

I have created few days ago a rule to drop segments older than 15 days (drop by period). Maybe it is the reason?

Any idea?

Thanks

Hi,

I have finally solved the problem.

It was due to the fact that my datasource was in disabled state due to a drop by period rule I have setted few days ago.

To solve the problem, I have enabled the datasource and quickly delete the rule.

I’ve also made tests by creating again a drop by period rule (P1M, P30D : two tests, respectively 1 month and 30 days) and at each time, my datasource become disable and the indexing realitime tasks stuck again and can’t stop after handing the segments in hdfs. Also the historical nodes stop to load segments.

And I have to enable the datasource using the old coordinator console and quickly delete the rule, still with the old coordinator console, to have the datasource available again and indexing realtime tasks get finished and things go normal, even with the historical nodes (they load again the segments).

I use the coordinator console to create rules.

Do you have any idea of why that behavior and how to create safely a rule which delete the segments older than one month?

Thanks!

Cheikh Sow

Oh, I just checked on the coordinator console but it assumes that my datasource is disabled.

I have created few days ago a rule to drop segments older than 15 days (drop by period). Maybe it is the reason?

Any idea?

Thanks

Ok, thanks!

Then how can we do that? Any hint i should follow?

Hi

When the segments in realtime tasks then that tasks are queryable and when the segments in historical node then that tasks also are queryable…

I think the segments in realtime tasks can not stored in historical node in your case.

So if you kill a realtime task, the segments in that realtime task can not queryable.

The time when the segments is stored in historical node is after Realtime > HDFS > MySQL > Coordiantor > zookeeper > [Historical node].

And when the segments in stored in historical node, the realtime task will be finished.

So I think we should check [Mysql > Coordinator > zookeeper].

Thanks.

Josh.

2018년 3월 9일 금요일 오후 4시 8분 30초 UTC+9, Cheikh Sow 님의 말:

Hi Josh,

Yes I know all of that and I’ll check more.

But yesterday, I have noticed that I can only query segments with a timestamp within the realtime tasks.

But when I kill a realtime task (segments are in hdfs), I am not able anymore to query the corresponding tasks.

What I mean is that historical datas are not queryable anymore. So the problem should come from the historical nodes and not the realtime.

Thanks

Hi
I think you should check data flow from reatime ingestion to Historical node.Because there a lot of check pointings.

First of all, I think you can not get the result of query of which timestamp is from the start time to end time that the job is not finished.
Because the full step ingestion process(from realtime ingestion to Historical node) is not normally done about the job was not finished.

You said segments was stored in HDFS(Deep storage). As you know, after that metadata will be stored in DB(like mysql and so on). And next checking point coordinator and so on.

Can you check more??

Thanks.

You received this message because you are subscribed to a topic in the Google Groups “Druid User” group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/87H2iEEc9lY/unsubscribe.

To unsubscribe from this group and all its topics, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid-user@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/34dcbf8d-6af3-42e2-8902-858ca321fcd4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-------- Message original --------