Using Mapr-Fs as local deep store in druid

I am trying to setup druid cluster using mapr-fs as my local deep storage, for this i used mapr-loopbacknfs service to create nfs mount on each server, all the services were up and running, but when i try to ingest data using

bin/post-index-task --url http://osdev5.mycluster.com:8090/ --file retail.json

  "ioConfig": {
      "type": "index",
      "firehose": {
        "type": "local",
        "baseDir": "/opt/imply/sample-data",
        "filter": "retail*"
      }
    },

If I choose baseDir as local path like /opt/imply/sample-data I am getting below exception.

2016-08-12T11:20:33,260 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_retail_2016-08-12T11:20:28.522Z] status changed to [RUNNING].
2016-08-12T11:20:33,261 INFO [task-runner-0-priority-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_retail_2016-08-12T11:20:28.522Z]: LockListAction{}
2016-08-12T11:20:33,269 INFO [task-runner-0-priority-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_retail_2016-08-12T11:20:28.522Z] to overlord[http://osdev5.neterra.moneybookers.net:8090/druid/indexer/v1/action]: LockListAction{}
2016-08-12T11:20:33,276 INFO [main] org.eclipse.jetty.server.Server - jetty-9.2.5.v20141112
2016-08-12T11:20:33,333 INFO [task-runner-0-priority-0] io.druid.segment.realtime.firehose.LocalFirehoseFactory - Searching for all [retail*] in and beneath [/opt/imply/sample-data]
2016-08-12T11:20:33,345 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTask{id=index_retail_2016-08-12T11:20:28.522Z, type=index, dataSource=retail}]
java.lang.IllegalArgumentException: Parameter 'directory' is not a directory
  at org.apache.commons.io.FileUtils.validateListFilesParameters(FileUtils.java:545) ~[commons-io-2.4.jar:2.4]
  at org.apache.commons.io.FileUtils.listFiles(FileUtils.java:521) ~[commons-io-2.4.jar:2.4]
  at io.druid.segment.realtime.firehose.LocalFirehoseFactory.connect(LocalFirehoseFactory.java:93) ~[druid-server-0.9.1.1.jar:0.9.1.1]
  at io.druid.segment.realtime.firehose.LocalFirehoseFactory.connect(LocalFirehoseFactory.java:46) ~[druid-server-0.9.1.1.jar:0.9.1.1]
  at io.druid.indexing.common.task.IndexTask.getDataIntervals(IndexTask.java:242) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
  at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:200) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
  at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
  at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
  at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

First does this dir exists /opt/imply/sample-data ? try /opt/imply/sample-data/ instead ?

Second are you planing to index all your data on one single machine ? this might work for dev cycle, but for the production you would need to submit such task to a Hadoop cluster and use MapReduce Batch task, but i am pretty sure that will not work as well since MapR cluster uses a proprietary file system rather than HDFS. So the best way to go is to implement druid interfaces to talk MapR file system.

Good luck !

Hi Charan, can you try loading static files following this tutorial? https://imply.io/docs/latest/ingestion-batch

In general I think it will be easier than debugging the local firehose.

Hi Charan, my apologies, I misread and realized you’ve already done the local quickstart. It seems like the ingestion found the file, but isn’t able to actually read it. I’ll dig a bit more into this but for starters look into making sure you have the correct permissioning to access the file.

Hi,

I think there is an issue with configuration or network with mapr cluster. I tried below steps on another cluster and it worked fine,

Install mapr-loopbacknfs client on nodes:

yum install mapr-loopbacknfs

cp /opt/mapr/conf/nfsserver.conf /usr/local/mapr-loopbacknfs/conf/

cp /opt/mapr/conf/mapr-clusters.conf /usr/local/mapr-loopbacknfs/conf/

service mapr-loopbacknfs start

mkdir /mapr

mount localhost:/mapr /mapr

df -P

Property changes at druid end:

druid.zk.service.host=172.29.33.15:5181,172.29.33.16:5181,172.29.33.17:5181

druid.indexer.logs.directory=/mapr/aws-qa.paysafe.com/druid/indexing-logs

druid.storage.storageDirectory=/mapr/aws-qa.paysafe.com/druid/segments

druid.metadata.storage.type=derby

druid.metadata.storage.connector.connectURI=jdbc:derby://172.29.33.15:1527/var/druid/metadata.db;create=true

druid.metadata.storage.connector.host=172.29.33.15

druid.metadata.storage.connector.port=1527

But one doubt, using local storage, will it give power of map-reduce while ingesting data (assume we are deploying multiple middle managers) since it is not going via hadoop route?