"intervals" mismatching the "timestampSpec" column specified

Hey everyone,

I’m using the following JSON for batch ingestion from HDFS SequenceFile:

{

“type” : “index_hadoop”,

“spec” : {

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“inputFormat”: “org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat”,

“paths” : “/druid/26b75-3-a7f6-873c2220e-u017afc3f38a5f4924e-2_10”

}

},

“dataSchema” : {

“dataSource” : “subscription_static_single”,

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “day”,

“queryGranularity” : “none”

},

“parser” : {

“type” : “hadoopyString”,

“parseSpec” : {

“format”: “tsv”,

“delimiter”: “\uFFFD”,

“dimensionsSpec” : {

“dimensions” : [“subscriber_id”,“circle_id”,“msisdn”,“temporary_block_flag”,“service_offering_value”,“account_group_id”,“community_id_1”,“community_id_2”,“community_id_3”,“activation_date”,“service_offering_bit”,“global_id”,“account_prepaid_empty_limit”,“data_block_status_3g”,“data_block_status_4g”,“field3”,“field4”,“field5”,“status”,“balance_date”,“file_date”,“file_id”,“load_date”,“load_user”]

},

“columns”: [“subscriber_id”,“circle_id”,“msisdn”,“temporary_block_flag”,“refill_failed_counter”,“main_account_balance”,“life_cycle_notification_report”,“service_offering_value”,“account_group_id”,“community_id_1”,“community_id_2”,“community_id_3”,“activation_date”,“service_offering_bit”,“global_id”,“account_prepaid_empty_limit”,“data_block_status_3g”,“data_block_status_4g”,“field3”,“field4”,“field5”,“status”,“balance_date”,“file_date”,“file_id”,“load_date”,“load_user”],

“timestampSpec” : {

“format”: “yyyy-mm-dd”,

“column” : “file_date”

}

}

},

“metricsSpec” : [

{

“name” : “count”,

“type” : “count”

},

{

“fieldName” : “refill_failed_counter”,

“type” : “doubleSum”,

“name”:“refill_failed_counter”

},

{

“fieldName” : “main_account_balance”,

“type” : “doubleSum”,

“name”:“main_account_balance”

},

{

“fieldName” : “life_cycle_notification_report”,

“type” : “doubleSum”,

“name”:“life_cycle_notification_report”

}

]

},

“tuningConfig” : {

“type” : “hadoop”,

“partitionsSpec” : {

“type” : “hashed”,

“targetPartitionSize” : 5000000

},

“jobProperties” : {}

}

}

}

The file_date column contains only 2017-09-22 value. But Druid determines intervals to be “2017-01-21T00:00:00.000Z/2017-01-22T00:00:00.000Z”.

Is there any explanation for this weird determination scheme druid uses for intervals ? How is there a gap of 9 months in this?

Hi Akul,

I think you meant to use yyyy-MM-dd for your format. Lowercase mm means minutes.

Hey Gian,

Thanks for the input. This works well. But still I have doubt regarding the behaviour.

Now I’m getting the intervals [“2017-09-21T00:00:00.000Z/2017-09-22T00:00:00.000Z”].

Isn’t the left side (“2017-01-21T00:00:00.000Z”) is inclusive and right side (“2017-01-22T00:00:00.000Z”) exclusive of date range when I have only 2017-09-22 value in my dataset being ingested ?

When you had your format set to yyyy-mm-dd then Druid interpreted the string “2017-09-22” as probably 2017-01-22 00:09:00 (interpreting 09 as minutes and leaving everything else at the lowest possible value). Now that you fixed it, it should parse the dates right.

Are you saying that you get intervals “2017-09-21T00:00:00.000Z/2017-09-22T00:00:00.000Z” even though the value is 2017-09-22? It might be a time zone issue. If so, the behavior will change in Druid 0.12.0 to use UTC for parsing rather than local time of your Hadoop cluster (which is probably what’s getting used right now). You can get that behavior today by adding -Duser.timezone=UTC to the javaopts of your Hadoop cluster.

Oki Cool.

But can I do it the otherway round ? As in, Can I use the Hadoop cluster’s timezone to ingest data during submitting task to the druid overlord ?

In general Druid is moving towards interpreting all strings as UTC unless they embed some timezone information (for example: “2017-09-22T00:00:00-08:00” would be UTC-8). I think it would be reasonable to add a “timeZone” parameter to the timestampSpec though - although it’s not currently supported. If you are inclined to do a patch to add this feature then it’d be welcome.

Hey Gian,

Will definitely try implementing the patch. Thanks a lot for your advice on the issue.

It will be great if you can help me with some of the following usecases for druid:

  1. I’ve tried ingesting complete hive table (consisting of 1273 SequenceFiles) into druid . The job was successful but it took approx 1.5Hrs for the task to complete. The total size of hive table is 18GB and the size of segment created in hdfs is 41GB. Is this the usual time taken by mapreduce job of druid for the given size of data or am I messing up with any druid configuration ?

  2. Is there any configuration to change the segment file format which is by default .zip. Can I further reduce the file size of the index.zip files ?

  3. Can I fire adhoc update/delete query on specific rows in the dataSource ?

Hi Akul,

For Hadoop based ingestion you can try the usual map/reduce tunings to improve time. Check what phase your job is spending most of its time in (map, shuffle, or reduce) and tune accordingly. If it’s map, you could try splitting big files or combining small files. If it’s reduce then you could try slightly reducing targetPartitionSize, or switching to a finer segmentGranularity, to increase the number of reducers. Just be careful that you don’t want your segments to be too small or else query performance can suffer.

Druid just has one segment format, what you see is what you get. It does have compression options, but by default it uses maximum compression, so there is nothing to change there if that’s what you want. (However there are some plans in the works to explore changing the default to something more balanced)

Druid doesn’t support ad hoc updates and deletes, although you can update and delete entire time ranges using batch ingestion. And in some cases you can use the “lookups” feature to avoid needing updates at all.