HadoopDruidIndexer failed with wikipedia demo

Hi Guys,

I am running the hadoopDruidIndexer to make batch ingest, and get the following failure messge:

spec_file.json (2.52 KB)

runtime.properties (2.15 KB)

Can you confirm how you are running the ingestion and that
“ls /wikipedia_input/wikipedia_data.json” exists from the directory you are running this job from?

Hi Fangjin,

The command:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:/home/ubuntu/hadoop-2.3.0/etc/hadoop io.druid.cli.Main index hadoop spec_file.json


the input file does exist:

ubuntu@master:~$ hdfs dfs -ls /wikipedia_input

Found 1 items

-rw-r–r-- 1 ubuntu supergroup 1678 2015-10-11 06:32 /wikipedia_input/wikipedia_data.json


and the index log is also attached.

Thank you for your reply.


在 2015年10月12日星期一 UTC+8上午4:08:36,Fangjin Yang写道:

hadoop_indexer.log (49.5 KB)

I switch to hadoop index hadoop with overlord service. And it worked.
What is difference between with them, and why hadoop index hadoop is the proposal?

在 2015年10月12日星期一 UTC+8下午12:15:20,芦康平写道:

We are thinking about using the indexing service long term mainly for realtime ingestion. There’s a long thread about it here:

For hadoop based batch ingestion, the indexing service provides locks so that if you are doing both realtime and batch ingestion, the jobs won’t clobber each other and overwrite data.

That make sense. Thank you Fangjin.


在 2015年10月15日星期四 UTC+8上午7:08:31,Fangjin Yang写道:

Hi Fangjin,

Further question: I am really confused with the TaskLock, and when will the overlap happen?

Usually, we use batch ingestion with the same interval about 1~2 hour behind realtime consuming. From my understanding, the version will be newer than the realtime ingestion one. There won’t be the overwriting stuff.

What’s the scenario of overwriting? And what for the TaskLock?


在 2015年10月15日星期四 UTC+8上午7:08:31,Fangjin Yang写道:


task locks are to prevent multiple tasks from same datasource and interval running on the same time,
there can be multiple scenarios of when multiple tasks for same intervals can overwrite changes done by one task


lets consider a case when you have a segment with version V1 in druid

and you run a mergeTask and batch index task in parallel to fix your data on the same time,

without any locks, It might happen that the batch index task will create newer version V2 from the new fixed data

but then the mergeTask which was running in parallel and merging segments from older version generate newer segments with version V3 and overwrite any fixes done by the batch index task.