druid-parquet extension, SegmentDescriptorInfo is not found

Hey guys,

We really need to use parquet ingestion and tried the new contrib extension but can’t deal with it.

We run a distant hadoop cluster (EMR). It works perfectly with other non parquet tasks.

Did the following:

  • pull-deps the druid-parquet extension

  • added “druid-parquet-extensions” in druid.extensions.loadList on middleManager and overlord

  • send the following task:

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “io.druid.data.input.parquet.DruidParquetInputFormat”,
“paths”: “s3://”
}
},
“metadataUpdateSpec”: {},
“dataSchema”: {
“dataSource”: “parquetTest”,
“parser”: {
“type”: “parquet”,
“parseSpec”: {
“format”: “timeAndDims”,
“timestampSpec” : {
“column” : “<timestamp_dim>”,
“format” : “yyyyMMdd”
},
“dimensionsSpec” : {
“dimensions”: [<some_dims>],
“dimensionExclusions” : ,
“spatialDimensions” :
}
}
},
“metricsSpec” : [
{ “type” : “count”, “name” : “visits” }
],
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “day”,
“intervals” : [ “2016-04-06/2016-04-07” ]
}
},
“tuningConfig” : {
“type”: “hadoop”,
“jobProperties” : {
<s3_access_key_properties>,
“io.compression.codecs” : “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec”
},
“partitionsSpec”: {
“type”: “hashed”,
“numShards”: 1,
“assumeGrouped”: true
}
}
}
}

``

The indexing task go to the hadoop cluster, succeed but then fail on druid:

2016-07-28T12:57:11,864 ERROR [task-runner-0-priority-0] io.druid.indexer.IndexGeneratorJob - [File /tmp/druid-indexing/parquetTest/2016-07-28T125526.206Z/ad78ef9e4d534e368a06b725d07d7807/segmentDescriptorInfo does not exist.] SegmentDescriptorInfo is not found usually when indexing process did not produce any segments meaning either there was no input data to process or all the input events were discarded due to some error

"meaning either there was no input data to process or all the input events were discarded due to some error" seems important because no error detected on hadoop cluster and in the task logs we can see:

parquet
bytesread=22918190
bytestotal=685943714
timeread=73696

Any idea or help is really appreciate !

Ben

Can you post the task logs and also the logs from Hadoop around the actual indexing process? They should contain the error

I finally found where the problem was catch:

Aug 1, 2016 9:29:44 AM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build ?(.*)\)
	at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
	at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
	at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
	at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
	at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:557)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:795)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

From the task log on hadoop cluster.
Any idea?

Hi,Benjamin

This is not a fatal problem and there may be many many WARNING like this in your log as I have met before. But it is not fatal, fatal problems should be look into it more carefully. Any more detailed logs?

在 2016年8月1日星期一 UTC+8下午5:40:34,Benjamin Angelaud写道:

Hi Niglin,
Thanks for your answer !

The only one fatal log i met was the one in my first post:

2016-07-28T12:57:11,864 ERROR [task-runner-0-priority-0] io.druid.indexer.IndexGeneratorJob - [File /tmp/druid-indexing/parquetTest/2016-07-28T125526.206Z/ad78ef9e4d534e368a06b725d07d7807/segmentDescriptorInfo does not exist.] SegmentDescriptorInfo is not found usually when indexing process did not produce any segments meaning either there was no input data to process or all the input events were discarded due to some error

Sorry, any more logs? It is a bit too hard for me to know what happens.

在 2016年8月4日星期四 UTC+8下午6:47:43,Benjamin Angelaud写道:

Seems like all the line are discarded ?!

stdout (170 KB)

I don’t think this WARNING will take the flow down. There are just duplicate warnings in logs. I have used this extension for months now. In my experience, this will not lead to parquet read fail. More logs exclude this warning should be look into more carefully.

在 2016年8月9日星期二 UTC+8下午6:37:09,Benjamin Angelaud写道:

The most common reason for all the lines being discarded is a timestamp issue. Perhaps either it your timestamps are not in the right format or column (based on your timestampSpec), or else it’s in the right format/column but outside the range of your job “intervals”.

I can’t find any more logs …
The timestamp field is correct with correct format. The interval matches the data period ingested…

Don’t know where to look anymore…

Hi Ben, can you rerun the job and post the entire task log?

Hey Fangjin,

Here’s the task log ! I didn’t find anything into it …

TaskLogParquetIngestion.txt (467 KB)

Hi, Benjamin. You can find your hadoop logs through something like this (from your log) org.apache.hadoop.mapreduce.Job - The url to track the job: http://ip-10-22-3-141.eu-west-1.compute.internal:20888/proxy/application_1471336224220_0004/, and then check the logs of map containers to get more info. If no privacy issues with your data, you can send one to me (with your task json), so I can do some test.

在 2016年8月16日星期二 UTC+8下午5:04:51,Benjamin Angelaud写道:

Here’s a map log. Don’t know if you need more ?! I can send you the json task or more, just tell me !

mapLogsParquet.txt (440 KB)

Please send me your json task and a piece of parquet file, I will try it on my side.

在 2016年8月16日星期二 UTC+8下午8:16:39,Benjamin Angelaud写道:

I have checked your parquet file and json config file and found that time dimension of Integer is not correctly parsed into DateTime. Change column (dateenr in your case) to string will make it running normally. And at last, you should take more care about your date interval, I found your date interval provided is not consistent with your data.

在 2016年8月16日星期二 UTC+8下午8:16:39,Benjamin Angelaud写道:

Hey Ninglin,

Thanks for all your support !

I will try it asap.

What do you mean by " I found your date interval provided is not consistent with your data."

Ben

Date interval provided in your json file is 20160406 but 20160404 in parquet file.

在 2016年8月22日星期一 UTC+8下午3:23:31,Benjamin Angelaud写道:

The problem was effectively that the timestamp was an Integer and not a string !
Could you add this information in the readme.md on git ?! It would be great for other people

And thanks again ! Awesome help ! I hope it will be a core extension :wink:

Ben