The Map input records is not equal to Map output records?

Hi,
when i use the indexing service to ingest data with index hadoop,and the job is successful,but i find the input records is not equal the out records.some records has been discarded.

see below:

2015-07-28T12:15:20,039 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 96%

2015-07-28T12:15:24,059 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 97%
2015-07-28T12:16:59,417 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 98%
2015-07-28T12:21:00,535 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 99%
2015-07-28T12:24:24,513 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 100%
2015-07-28T13:11:01,180 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Job job_1431571906396_22852 completed successfully
2015-07-28T13:11:01,383 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Counters: 53
	File System Counters
		FILE: Number of bytes read=27922640633
		FILE: Number of bytes written=55943546674
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=25971081731
		HDFS: Number of bytes written=2618327862
		HDFS: Number of read operations=1191
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=236
	Job Counters
		Failed map tasks=10
		Killed reduce tasks=9
		Launched map tasks=298
		Launched reduce tasks=109
		Other local map tasks=10
		Data-local map tasks=278
		Rack-local map tasks=10
		Total time spent by all maps in occupied slots (ms)=64678934
		Total time spent by all reduces in occupied slots (ms)=280668248
		Total time spent by all map tasks (ms)=32339467
		Total time spent by all reduce tasks (ms)=70167062
		Total vcore-seconds taken by all map tasks=32339467
		Total vcore-seconds taken by all reduce tasks=70167062
		Total megabyte-seconds taken by all map tasks=132462456832
		Total megabyte-seconds taken by all reduce tasks=574808571904
	Map-Reduce Framework
		Map input records=36490889
		Map output records=36490840
		Map output bytes=27776676541
		Map output materialized bytes=27922812701
		Input split bytes=38016
		Combine input records=0
		Combine output records=0
		Reduce input groups=9
		Reduce shuffle bytes=27922812701
		Reduce input records=36490840
		Reduce output records=0
		Spilled Records=72981680
		Shuffled Maps =28800
		Failed Shuffles=0
		Merged Map outputs=28800
		GC time elapsed (ms)=6669226
		CPU time spent (ms)=115836130
		Physical memory (bytes) snapshot=492926922752
		Virtual memory (bytes) snapshot=1436702986240
		Total committed heap usage (bytes)=817816928256
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=25971043715
	File Output Format Counters
		Bytes Written=0
2015-07-28T13:11:01,525 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-11T00:00:00.000+08:00_2015-01-12T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments
2015-07-28T13:11:01,531 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-12T00:00:00.000+08:00_2015-01-13T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments
2015-07-28T13:11:01,535 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-13T00:00:00.000+08:00_2015-01-14T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments
2015-07-28T13:11:01,540 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-14T00:00:00.000+08:00_2015-01-15T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments.

     As the info above,the map input records isn't equal the map output recodes,why?what's you advice here?Thanks.

Hi,

Do you have “ignoreInvalidRows” (http://druid.io/docs/0.8.0/ingestion/batch-ingestion.html#tuningconfig) set to true? Can you share full task log?

– Himanshu

There may also be some records that are outside the “intervals” in your job json. Those will be ignored.