Hadoop indexing issue

Hi!

During evaluation of Druid I’m trying to load data for three weeks in order to see what would be the query performance.

I’ve set up EMR hadoop cluster, as explained in the documentation, then I’ve started the indexer job.

According to Hadoop, the job has succeeded:

User:
ubuntu
Name:
inappevents-determine_partitions_hashed-Optional.of([2015-11-01T00:00:00.000Z/2015-11-24T00:00:00.000Z])
Application Type:
MAPREDUCE
Application Tags:
State:
FINISHED
FinalStatus:
SUCCEEDED
Started:
29-Nov-2015 07:34:38
Elapsed:
6hrs, 38mins, 41sec
Tracking URL:
History
Diagnostics:

However, the task submittion command has failed with the following message:

Task index_hadoop_inappevents_2015-11-29T07:16:47.249Z still running…
Task index_hadoop_inappevents_2015-11-29T07:16:47.249Z still running…
Task index_hadoop_inappevents_2015-11-29T07:16:47.249Z still running…
Task index_hadoop_inappevents_2015-11-29T07:16:47.249Z still running…
Traceback (most recent call last):
File “bin/post-index-task”, line 89, in
main()
File “bin/post-index-task”, line 85, in main
task_status = await_task_completion(args, task_id, complete_timeout_at)
File “bin/post-index-task”, line 66, in await_task_completion
raise Exception(“Task {0} did not finish in time!”.format(task_id))
Exception: Task index_hadoop_inappevents_2015-11-29T07:16:47.249Z did not finish in time!

``

Moreover, I’m not seeing any data in Druid.

And the most craziest thing is that it seems like Druid has submitted another job to Hadoop without asking for permission :slight_smile:

Application Overview
User:
ubuntu
Name:
inappevents-index-generator-Optional.of([2015-11-01T00:00:00.000Z/2015-11-24T00:00:00.000Z])
Application Type:
MAPREDUCE
Application Tags:
State:
RUNNING
FinalStatus:
UNDEFINED
Started:
29-Nov-2015 14:30:41
Elapsed:
3mins, 55sec
Tracking URL:
ApplicationMaster
Diagnostics:

Below is the indexing spec file:

{
“type” : “index_hadoop”,
“spec” : {
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “granularity”,
“inputPath” : “s3n://af-druid/input/inappevents”,
“dataGranularity”: “day”,
“filePattern”: “..gz",
“pathFormat”: “‘dt’=yyyy-MM-dd”
}
},
“dataSchema” : {
“dataSource” : “inappevents”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “none”,
“intervals” : [“2015-11-01/2015-11-24”]
},
“parser” : {
“type” : “string”,
“parseSpec” : {
“format” : “json”,
“dimensionsSpec” : {
“dimensions” : [“app_id”, “media_source”, “campaign”, “partner”, “fb_adgroup”, “fb_adset”, “af_siteid”, “af_sub1”, “af_sub2”, “af_sub3”, “af_sub4”, “af_sub5”, “country”, “region”, “city”, “ip”, “platform”, “device_type”, “event_name”, “sdk_version”]
},
“timestampSpec” : {
“format” : “auto”,
“column” : “timestamp”
}
}
},
“metricsSpec” : [
{
“name” : “count”,
“type” : “count”
},
{
“name” : “monetary”,
“type” : “longSum”,
“fieldName” : “monetary”
}
]
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec” : {
“type” : “hashed”,
“targetPartitionSize” : 5000000
},
“jobProperties” : {
“fs.s3.awsAccessKeyId” : "
",
“fs.s3.awsSecretAccessKey” : "
",
“fs.s3.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
“fs.s3n.awsAccessKeyId” : "
",
“fs.s3n.awsSecretAccessKey” : "
*”,
“fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
“io.compression.codecs” : “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec”
}
}
}
}

``

Can anyone tell me what’s happening here?

Thanks,

Michael

Ok, found the "problem" - there are two tasks submitted to Hadoop, where the first one can be omitted by specifying number of segments in the spec file.

As you’ve discovered, Druid may submit more than one Hadoop job depending on the ingestion spec. If numShards is not specified, Druid will run a determine_partitions job before running the index-generator to split the data into segments that contain [targetPartitionSize] rows per segment.

If you’re looking to benchmark query performance, you’ll want to check to make sure that your partitions are between 500MB to 1GB in size for optimal performance. You can see this easily using the coordinator console.

BTW, this error:
raceback (most recent call last):
File “bin/post-index-task”, line 89, in
main()
File “bin/post-index-task”, line 85, in main
task_status = await_task_completion(args, task_id, complete_timeout_at)
File “bin/post-index-task”, line 66, in await_task_completion
raise Exception(“Task {0} did not finish in time!”.format(task_id))
Exception: Task index_hadoop_inappevents_2015-11-29T07:16:47.249Z did not finish in time!

is fixed in the latest version of the IAP. For IAP questions, the user groups are a good place to get a fast response on your issues.