Problem with delta ingestion, bug?

Hello,
I am trying to do delta ingestion. I thought everything was going fine, until we checked the data, and it looks like data is getting dropped.

I cannot see any warnings or errors that would suggest that anything is wrong.

Here is my ingestionSpec. Can someone look at it and let me know if anything stands out?

I am using 0.8.1-rc2

{

“dataSchema” : {

“dataSource” : “events”,

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “json”,

“timestampSpec” : {

“column” : “event_time_registered”,

“format” : “posix”

},

“dimensionsSpec” : {

“dimensions”: [

“app_guid”,

“app_id”,

“request_ip”,

“request_ua”

],

“dimensionExclusions” : ,

“spatialDimensions” :

}

}

},

“metricsSpec” : [

{

“type” : “count”,

“name” : “count”

},

],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “HOUR”,

“queryGranularity” : “MINUTE”,

“intervals” : [ “2015-09-13T03:00:00/2015-09-13T06:00:00” ]

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “multi”,

“children”: [

    {

        "type" : "dataSource",

        "ingestionSpec" : {

            "dataSource": "events",

            "interval": "2015-09-13T03:00:00/PT3H"

        }

    },

    {

        "type" : "static",

        "paths": "hdfs://ipaddress/file1.gz,hdfs://ipaddress/file2.gz,hdfs://ipaddress/file3.gz"

    }

  ]

},

“metadataUpdateSpec” : {

“type”:“mysql”,

“connectURI” : “jdbc:mysql://ipaddress/druid”,

“password” : “druid”,

“segmentTable” : “druid_segments”,

“user” : “druid”

},

“segmentOutputPath” : “hdfs://ipaddress/druid/deepStorage/”

},

“tuningConfig” : {

“type” : “hadoop”,

“workingPath”: “/tmp”,

“partitionsSpec” : {

“type” : “dimension”,

“partitionDimension” : “app_id”,

“targetPartitionSize” : 5000000,

“maxPartitionSize” : 7500000,

“assumeGrouped” : false,

“numShards” : -1

},

“shardSpecs” : { },

“leaveIntermediate” : false,

“cleanupOnFailure” : true,

“overwriteFiles” : false,

“ignoreInvalidRows” : false,

“jobProperties” : { },

“combineText” : false,

“persistInHeap” : false,

“ingestOffheap” : false,

“bufferSize” : 134217728,

“aggregationBufferRatio” : 0.5,

“rowFlushBoundary” : 2000000

}

}

I checked it with 0.8.1 release, and the same issue appears.
I am not seeing any errors in the logs.

It would appear that the druid segments are simply being overwritten and not appended to.

Hi,

Can you file an issue in the github issues so we can track this problem?

Will update the thread with more findings.

Hi,

I don’t anything wrong on the first look. Can you describe a bit more about which events are dropped? For example do they all belong to the paths from hdfs or from the existing data read from druid. If you don’t use delta ingestion, does the problem get solved?

– Himanshu

Also, can you please give us full task log?

Hi,
I just emailed you guys.

Let me know if you want me to file a bug after reading it.

Thanks!

Hi guys,
Any chance to look at the bug on delta ingestion?

I reproduced the issue with a smaller data set and using the current stable release 0.8.1

Basically, the counts do not come as as one would expect between batching 10 files, and doing 7 batch + 3 delta.

Please let me know.

Thanks,

Johnny

Thanks Johnny, if you have a reproducible data set, do you mind filing a github issue and provide steps to reproduce? Thanks!

Xavier

Hi

I am getting below issues while reading the existing segments.

I am actually trying  delta ingest but job is failing as it is not able to read existing segments.

“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “dataSource”,
“ingestionSpec” : {
“dataSource”: “count”,
“intervals”: [“2015-09-16/2015-09-17”]
}
}
}

I have checked the data exists by querying and partition is created on disk .
But I get this error

DatasourceInputFormat - Exception thrown finding location of splits
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2015-09-16T00:00:00.000Z_2015-09-17T00:00:00.000Z

I have checked the segments using
http://localhost:8081/druid/coordinator/v1/metadata/datasources/wikiticker/segments?full

the segment exists.

[{"dataSource":"wikiticker","interval":"2015-09-16T00:00:00.000Z/2015-09-17T00:00:00.000Z",
"version":"2016-11-10T10:54:31.814Z","loadSpec":
{"type":"local","path":"/root/Documents/druid-0.9.1.1/var/druid/segments/wikiticker/wikiticker/2015-09-16T00:00:00.000Z_2015-09-17T00:00:00.000Z/2016-11-10T10:54:31.814Z/0/index.zip"},
"dimensions":"channel,cityName,comment,countryIsoCode,countryName,isAnonymous,isMinor,isNew,isRobot,isUnpatrolled,metroCode,namespace,page,regionIsoCode,regionName,user",
"metrics":"count,added,deleted,user_unique","shardSpec":{"type":"none"},
"binaryVersion":9,"size":8471,
"identifier":"wikiticker_2015-09-16T00:00:00.000Z_2015-09-17T00:00:00.000Z_2016-11-10T10:54:31.814Z"}]

I am actually trying to delta ingest but job is failing as it is not able to read existing segments.

 "ioConfig":{

         "type":"hadoop",

         "inputSpec":{

            "type":"multi",

            "children":[

               {

                  "type":"dataSource",

                  "ingestionSpec":{

                     "dataSource":"wikiticker",

                     "intervals":["2015-09-16/2015-09-17"]

                  }

               },

               {

                  "type":"static",

                  "paths":"quickstart/delta_ingest_data.json"

               }

            ]

         }

      }

regards
Satish S

Can you try updating to 0.9.2-rc2? There was a bug related to relative Hadoop paths that was fixed in this version. You can get it here: http://druid.io/downloads.html