Native Batch Ingestion from HDFS

Hi,

I’am new user of druid going through docs. I’ve set upped Druid tutorial cluster as in quickstart section in docs on my MapR 6.0.1 Sandbox Docker Container. I’ve loaded sample wikipedia file with native ingestion and everything works fine. Now i’ve put same file on hdfs under “/druid/tutorial-data/wikiticker-2015-09-12-sampled.json.gz”. I was looking for some simple way to load this file but i cannot find a way to configure firehose to ingest data from hdfs instead of local file system (Sadly adding hdfs:// in front of file path doesn’t work :(). Is it even possible with native ingestion? Is there some other way i should look up to?

My lightly modified spec:

{

“type” : “index”,

“spec” : {

"dataSchema" : {

  "dataSource" : "wikipedia-from-hdfs",

  "parser" : {

    "type" : "string",

    "parseSpec" : {

      "format" : "json",

      "dimensionsSpec" : {

        "dimensions" : [

          "channel",

          "cityName",

          "comment",

          "countryIsoCode",

          "countryName",

          "isAnonymous",

          "isMinor",

          "isNew",

          "isRobot",

          "isUnpatrolled",

          "metroCode",

          "namespace",

          "page",

          "regionIsoCode",

          "regionName",

          "user",

          { "name": "added", "type": "long" },

          { "name": "deleted", "type": "long" },

          { "name": "delta", "type": "long" }

        ]

      },

      "timestampSpec": {

        "column": "time",

        "format": "iso"

      }

    }

  },

  "metricsSpec" : [],

  "granularitySpec" : {

    "type" : "uniform",

    "segmentGranularity" : "day",

    "queryGranularity" : "none",

    "intervals" : ["2015-09-12/2015-09-13"],

    "rollup" : false

  }

},

"ioConfig" : {

  "type" : "index",

  "firehose" : {

    "type" : "local",

    "baseDir" : "hdfs://druid/tutorial-data/",

    "filter" : "wikiticker-2015-09-12-sampled.json.gz"

  },

  "appendToExisting" : false

},

"tuningConfig" : {

  "type" : "index",

  "maxRowsPerSegment" : 5000000,

  "maxRowsInMemory" : 25000,

  "forceExtendableShardSpecs" : true

}

}

}

``

Hey Szymon,

Today this feature isn’t supported (native batch reading from HDFS) but it is something we plan to add soon, over the next couple of releases. It would probably be in the form of a hadoop firehose. Since you’re using mapr you might be able to work around this by mounting the mapr fs as a local mount point and reading from it using the local firehose.

Gian

Hey,

Thanks for answer :slight_smile: