setting up batch ingestion from hadoop

The docs seems to indicate that I can run a single overlord process for small data loads, for example on the batch ingestion doc page it only shows an overlord being started. Is this correct?

Also, let’s say I have my data stored by hour in HDFS, something like this:

hdfs://ingest/some_data_source/2015091406/*.json

The docs seem to indicate that I can use inputSpec of type “granularity” to point to a custom path on HDFS. Something like this?

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “granularity”,

“inputPath”: “/ingest/some_data_source”,

“filePattern”: “*.json”,

“pathFormat”: “‘dt’=yyyyMMddHH”,

“dataGranularity”: ???

}

}

I’m not really sure what to set dataGranularity to above, because the docs indicate it’s an object, but there is no documentation on what that object format is. The only hint I have is that "hour means to expect directories y=XXXX/m=XX/d=XX/H=XX".

Am I on the right track, and if so, what to put for dataGranularity?

Also I assume the inputPath is expected to be an HDFS path, but the examples show it without the hdfs://host:port/ prefix. Is that an oversight in the docs or is the name node host and port configured somewhere else?

One more question:

Is it possible in the granularitySpec to specify an interval that does not match the segmentGranularity? For example:

granularitySpec: {

type: “uniform”,

segmentGranularity: “HOUR”,

queryGranularity: “NONE”,

interval: “2015-09-15/P1D”

}

Somewhat related, is it possible to provide an interval that is only a duration where start time is implicit? E.g.

granularitySpec: {

interval: “PT1H”

}

So I think I have a few answers from browsing the code.

The dataGranularity is an enum and can have values like “SECOND”, “MINUTE”, “FIFTEEN_MINUTE”, “HOUR”, …

It appears that the Granularity enum can have a formatter specified, and in particular there is a hive formatter that is pretty close to what I want:

But it’s not possible to actually set it from the druid code:

https://github.com/druid-io/druid/blob/master/indexing-hadoop/src/main/java/io/druid/indexer/path/GranularityPathSpec.java#L143

So the only option is to set pathFormat if I don’t like the default.

Inline.

The docs seems to indicate that I can run a single overlord process for small data loads, for example on the batch ingestion doc page it only shows an overlord being started. Is this correct?

The overlord acts as a driver for Hadoop based batch ingestion and kicks off a job to a remote hadoop cluster so you can use a single overlord in local mode for Hadoop batch ingestion.

Also, let’s say I have my data stored by hour in HDFS, something like this:

hdfs://ingest/some_data_source/2015091406/*.json

The docs seem to indicate that I can use inputSpec of type “granularity” to point to a custom path on HDFS. Something like this?

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “granularity”,

“inputPath”: “/ingest/some_data_source”,

“filePattern”: “*.json”,

“pathFormat”: “‘dt’=yyyyMMddHH”,

“dataGranularity”: ???

}

}

If you have data in a pre-defined directory structure and you have raw data stored in per hourly buckets, you can use “HOUR” as the granularity. If the raw data is in daily folders, you can “DAY”.

Yes, you can have an interval in the granularity spec. It does not have to exactly match segment intervals.

So I think I have a few answers from browsing the code.

The dataGranularity is an enum and can have values like “SECOND”, “MINUTE”, “FIFTEEN_MINUTE”, “HOUR”, …

It appears that the Granularity enum can have a formatter specified, and in particular there is a hive formatter that is pretty close to what I want:

https://github.com/metamx/java-util/blob/master/src/main/java/com/metamx/common/Granularity.java#L300

But it’s not possible to actually set it from the druid code:

https://github.com/druid-io/druid/blob/master/indexing-hadoop/src/main/java/io/druid/indexer/path/GranularityPathSpec.java#L143

So the only option is to set pathFormat if I don’t like the default.

Yeah. You can also extend the codebase with other options as well

Hi,
Thanks for the replies. So just one clarification:

“The overlord acts as a driver for Hadoop based batch ingestion and kicks off a job to a remote hadoop cluster so you can use a single overlord in local mode for Hadoop batch ingestion.”

So even in my production environment, given that I have a hadoop cluster that can do the batch indexing, I should be able to forgo the middle-managers and simply run one overlord + my hadoop cluster? I.e. the overlord is simply submitting the real work to hadoop and also providing a basic console to see job statuses?

And the middle managers are more for if I have a druid setup without hadoop?

This all makes sense to me, but I just wanted to be sure my understanding is correct.

Inline.

Hi,
Thanks for the replies. So just one clarification:

“The overlord acts as a driver for Hadoop based batch ingestion and kicks off a job to a remote hadoop cluster so you can use a single overlord in local mode for Hadoop batch ingestion.”

So even in my production environment, given that I have a hadoop cluster that can do the batch indexing, I should be able to forgo the middle-managers and simply run one overlord + my hadoop cluster? I.e. the overlord is simply submitting the real work to hadoop and also providing a basic console to see job statuses?

Yes. If you don’t need realtime ingestion, you can simply use the overlord as a driver to start your Hadoop ingestion job. The Hadoop cluster does all the real work.

And the middle managers are more for if I have a druid setup without hadoop?

They are mainly used for doing realtime ingestion.