Data ingestion on NFS storage

Hi Team,
I have setup all druid components on single server and when I ingest data task is getting completed fine.

Later, I have setup druid cluster, keeping coordinator, overlord on one server, historical and middlemanger on one server, broker on one server. I am using NFS as deep storage in this setup. When I try to ingest data in this setup its keep on saying ““Task index_stats_2019-01-14T04:16:59.697Z still running…”” either task is not getting completed with fail or sccess or throwing any error in log.

Please suggest on this how to get this working.

I doubt if NFS doesn’t meet standards of deep storage.

Thanks!

I am having same issue with S3 too.

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Task index_stats_2019-01-14T05:44:39.627Z still running…

Interesting thing is when I together middlemanager with coordinator with overlord, task is getting competed successfully. Can someone let me know if you are using cluster setup and what have you done to get that working.

Guys, can someone please respond on this.

Can you post here your machines specs and size of data you are ingesting?

Rommel Garcia
Director, Field Engineering
rommel.garcia@imply.io
404.502.9672

Hi Nazeer

NFS can be used as a deep storage. You can specify the druid.storage.type to be local and use the storage / indexing location to be NFS.

Do you see data being added to location pointed by
druid.storage. storageDirectory
or druid.indexer.logs.directory?

May be you could change the location to local disk and see if the job completes – then ensure that the druid user has right permissions to write to the NFS storage.

Thanks & Rgds

Venkat

Hi Venkat,

Thanks for the reply.

Yes, I have kept druid.storage.type as local. I am having issue when middlemanager is on separate server, when middlemanager is along with coordinator and overlord data is ingesting fine.

When middlemanager is on separate server the logs are showing below messages for hours.

2019-01-16T03:21:47,785 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:21:47.785Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"jvm/gc/mem/init","value":248512520,"dataSource":["stats"],"gcGen":["old"],"gcGenSpaceName":"sun.gc.generation.1.space.0.name: space\u0000 string [internal]","gcName":["g1"],"id":["index_stats_2019-01-16T03:00:37.253Z"]}]
2019-01-16T03:21:47,813 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:21:47.813Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"jetty/numOpenConnections","value":0,"dataSource":["stats"],"id":["index_stats_2019-01-16T03:00:37.253Z"]}]
2019-01-16T03:22:13,342 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.342Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/events/thrownAway","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,342 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.342Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/events/unparseable","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,342 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.342Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/events/processedWithError","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.342Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/events/processed","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/events/duplicate","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/rows/output","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/persists/count","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/persists/time","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/persists/cpu","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/persists/backPressure","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/persists/failed","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,343 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/handoff/failed","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,344 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.343Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/merge/time","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,344 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.344Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/merge/cpu","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,344 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.344Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/handoff/count","value":0,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,344 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.344Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/sink/count","value":50,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]
2019-01-16T03:22:13,344 INFO [MonitorScheduler-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2019-01-16T03:22:13.344Z","service":"druid/middleManager","host":"p1-lax-hismid-1.advdc.com:8100","version":"0.13.0-incubating","metric":"ingest/events/messageGap","value":-1732423066658,"dataSource":"stats","taskId":["index_stats_2019-01-16T03:00:37.253Z"],"taskType":["index"]}]

I am having cluster on 3 servers with below specs, I am just ingesting sample data may be less than mbs.

Server1 (16GB ram): Coordinator, Overlord

Server2 (32GB ram): Middlemanager, Historical

Server3 (32GB ram): Broker

Your configuration and distribution looks ok. Most likely you might have issue with configuration.

Please review the configurations and ensure they are correct.

A typical clustering method is given here, cross verify the configuration parameters (zookeeper etc):

http://druid.io/docs/latest/tutorials/cluster.html

Thanks & Rgds

Venkat

Hi,
Attaching the configs and logs which I am getting while running task when middlemanager on separate server. Please review if am missing something to make it work even on separate middlemanager.

druid.zip (8.54 KB)

separate_middlemanager_log.txt (886 KB)

“The messages you see are log emitters and not from the task.”

Venkat,

They are just emitting those logs, neither task is getting completed nor throwing any error.

can someone please have a look into this.

Can someone please look into this?

Guys, can someone please look into this?

I have setup all druid components on single server and when I ingest data task is getting completed fine.

When you had everything on one server, did you successfully ingest the exact same data that you’re trying now?

From your task log, it looks like your task is running, albeit slowly:

2019-01-16T04:43:39,353 INFO [stats-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Starting persist for interval[2018-12-10T08:00:00.000Z/2018-12-10T09:00:00.000Z], rows[74,949]

2019-01-16T04:43:39,557 INFO [stats-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Using SegmentWriteOutMediumFactory[TmpFileSegmentWriteOutMediumFactory]

2019-01-16T04:43:39,564 INFO [stats-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Completed version.bin in 4 millis.

2019-01-16T04:43:39,568 INFO [stats-incremental-persist] org.apache.druid.segment.IndexMergerV9 - Completed factory.json in 4 millis

2019-01-16T04:43:40,011 INFO [stats-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[ver] conversions with cardinality[86] in 443 millis.

2019-01-16T04:43:40,180 INFO [stats-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[product] conversions with cardinality[34] in 161 millis.

2019-01-16T04:43:40,210 INFO [stats-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[action] conversions with cardinality[0] in 14 millis.

2019-01-16T04:43:40,249 INFO [stats-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[affiliate] conversions with cardinality[2] in 26 millis.

2019-01-16T04:51:01,180 INFO [stats-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[user] conversions with cardinality[74,282] in 440,836 millis.

2019-01-16T04:51:31,157 INFO [stats-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[source] conversions with cardinality[5,275] in 29,963 millis.

2019-01-16T04:51:31,218 INFO [stats-incremental-persist] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[vert] conversions with cardinality[4] in 47 millis.

I’m guessing the browserId dim which comes after vert also has high cardinality for this interval and is taking a while here.

Maybe try:

  • Check that interval in your input data, see if there’s anything interesting going on there, it looks unusual compared to the other intervals

  • Try giving your peon more CPUs/memory

  • Ingest much less data on the cluster with separate middleManager (e.g., just a single hour) and see if that works

2019-01-16T04:44:01,569 ERROR [MonitorScheduler-0] org.apache.druid.indexing.common.stats.TaskRealtimeMetricsMonitor - [6,047] Unparseable events! Turn on debug logging to see exception stack trace.

This might be worth looking into as well, you can log the parse exceptions by setting logParseExceptions (http://druid.io/docs/latest/ingestion/native_tasks.html)

Hi Jonathan,
Thanks for the reply,

“When you had everything on one server, did you successfully ingest the exact same data that you’re trying now?”

Yes, I am able to ingest data successfully when all components are on one server.

I will try the options you suggested.