Issues with multiple data nodes - causes many errors

Hi, i am doing a POC on druid and trying to run multiple data nodes in different setups:

Setup 1.

3 nodes in cluster conf

2 data nodes, 1 master & query node

Setup 2.

3 nodes in single small conf

2 data nodes, 1 all in one node

Setup 3

3 nodes in single small conf
3 all in one nodes

I defined an ingest spec that will read from kafka topic starting from earliest and reading 200 million messages

In all setups i get errors in one of the data nodes:

2019-12-02T08:12:23,799 ERROR [ZKCoordinator–0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Failed to load segment for dataSource: {class=org.apache.druid.server.coordination.SegmentLoadDropHandler, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[user_action_cluster3_2019-11-14T12:00:00.000Z_2019-11-14T13:00:00.000Z_2019-12-02T06:45:39.461Z], segment=DataSegment{binaryVersion=9, id=user_action_cluster3_2019-11-14T12:00:00.000Z_2019-11-14T13:00:00.000Z_2019-12-02T06:45:39.461Z, loadSpec={type=>local, path=>/data/apache-druid-0.16.0-incubating/var/druid/segments/user_action_cluster3/2019-11-14T12:00:00.000Z_2019-11-14T13:00:00.000Z/2019-12-02T06:45:39.461Z/0/b8f53d19-5099-4e35-a411-d070aa2d269e/index.zip},

dimensions=[…], shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, size=19728120}}

org.apache.druid.segment.loading.SegmentLoadingException: Exception loading segment[user_action_cluster3_2019-11-14T12:00:00.000Z_2019-11-14T13:00:00.000Z_2019-12-02T06:45:39.461Z]

at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:263) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.server.coordination.SegmentLoadDropHandler.addSegment(SegmentLoadDropHandler.java:307) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:49) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.server.coordination.ZkCoordinator.lambda$childAdded$2(ZkCoordinator.java:148) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

Caused by: java.lang.IllegalArgumentException: Instantiation of [simple type, class org.apache.druid.segment.loading.LocalLoadSpec] value failed:

[/data/apache-druid-0.16.0-incubating/var/druid/segments/user_action_cluster3/2019-11-14T12:00:00.000Z_2019-11-14T13:00:00.000Z/2019-12-02T06:45:39.461Z/0/b8f53d19-5099-4e35-a411-d070aa2d269e/index.zip] does not exist

at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:3459) ~[jackson-databind-2.6.7.jar:2.6.7]

at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:3378) ~[jackson-databind-2.6.7.jar:2.6.7]

at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.loadInLocation(SegmentLoaderLocalCacheManager.java:235) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.loadInLocationWithStartMarker(SegmentLoaderLocalCacheManager.java:224) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.loadSegmentWithRetry(SegmentLoaderLocalCacheManager.java:185) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:164) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:131) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.server.SegmentManager.getAdapter(SegmentManager.java:196) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.server.SegmentManager.loadSegment(SegmentManager.java:155) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:259) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating]

… 8 more

Caused by: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class org.apache.druid.segment.loading.LocalLoadSpec] value failed: [/data/apache-druid-0.16.0-incubating/var/druid/segments/user_action_cluster3/2019-11-14T12:00:00.000Z_2019-11-14T13:00:00.000Z/2019-12-02T06:45:39.461Z/0/b8f53d19-5099-4e35-a411-d070aa2d269e/index.zip] does not exist

This segment exists in one of the nodes and the other throw this error.

Any ideas why this happens? what is the impact? how to handle?

Thanks

Hi Michael,
What is your deep storage? Is it HDFS or S3 ?
By any chance, are you using local as deep storage in multi-node cluster ?
Even though not 100% sure, I guess there is a chance for this to happen in a multi-node cluster with local as deep storage.

if your deep storage is HDFS or S3, are you finding that segment ( 2019-12-02T06:45:39.461Z/0/b8f53d19-5099-4e35-a411-d070aa2d269e/index.zip ) in your deep storage (hdfs or s3 ) ?

Thanks,

–siva

Hi, Thanks for the reply.
Yes, i am using local as deep storage in multi node cluster. Is that not supported?

I am doing a POC on druid and using 3 nodes with **local storage. **

Is this not supported for production environments? Cause we will need a solution for our on premise product which can not use a deep storage on cloud.

The setups i am testing are:\

  1. 2 data nodes & 1 node with query & master

  2. 3 all in one nodes (all processes on each node)

Do you know why do i get these errors? and how can i solve this? and what is the impact?

Thanks

Since you are using Druid in a cluster (more then one historical node) you need a common deep storage location - typically this is HDFS or S3.

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Thanks.
So if i am working on a cluster which is on premise and s3 is not an option, can i use cassandra?

Thanks

Hi Michal,

You need a shared file system to store the Druid segments. Here is additional information and includes other possibilities.

https://druid.apache.org/docs/latest/dependencies/deep-storage.html

Eric

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

There is an extension for Cassandra deep storage

druid-cassandra-storage

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io