Hey everyone,
I’m using the following JSON for batch ingestion from HDFS SequenceFile:
{
“type” : “index_hadoop”,
“spec” : {
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“inputFormat”: “org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat”,
“paths” : “/druid/26b75-3-a7f6-873c2220e-u017afc3f38a5f4924e-2_10”
}
},
“dataSchema” : {
“dataSource” : “subscription_static_single”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “none”
},
“parser” : {
“type” : “hadoopyString”,
“parseSpec” : {
“format”: “tsv”,
“delimiter”: “\uFFFD”,
“dimensionsSpec” : {
“dimensions” : [“subscriber_id”,“circle_id”,“msisdn”,“temporary_block_flag”,“service_offering_value”,“account_group_id”,“community_id_1”,“community_id_2”,“community_id_3”,“activation_date”,“service_offering_bit”,“global_id”,“account_prepaid_empty_limit”,“data_block_status_3g”,“data_block_status_4g”,“field3”,“field4”,“field5”,“status”,“balance_date”,“file_date”,“file_id”,“load_date”,“load_user”]
},
“columns”: [“subscriber_id”,“circle_id”,“msisdn”,“temporary_block_flag”,“refill_failed_counter”,“main_account_balance”,“life_cycle_notification_report”,“service_offering_value”,“account_group_id”,“community_id_1”,“community_id_2”,“community_id_3”,“activation_date”,“service_offering_bit”,“global_id”,“account_prepaid_empty_limit”,“data_block_status_3g”,“data_block_status_4g”,“field3”,“field4”,“field5”,“status”,“balance_date”,“file_date”,“file_id”,“load_date”,“load_user”],
“timestampSpec” : {
“format”: “yyyy-mm-dd”,
“column” : “file_date”
}
}
},
“metricsSpec” : [
{
“name” : “count”,
“type” : “count”
},
{
“fieldName” : “refill_failed_counter”,
“type” : “doubleSum”,
“name”:“refill_failed_counter”
},
{
“fieldName” : “main_account_balance”,
“type” : “doubleSum”,
“name”:“main_account_balance”
},
{
“fieldName” : “life_cycle_notification_report”,
“type” : “doubleSum”,
“name”:“life_cycle_notification_report”
}
]
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec” : {
“type” : “hashed”,
“targetPartitionSize” : 5000000
},
“jobProperties” : {}
}
}
}
The file_date column contains only 2017-09-22 value. But Druid determines intervals to be “2017-01-21T00:00:00.000Z/2017-01-22T00:00:00.000Z”.
Is there any explanation for this weird determination scheme druid uses for intervals ? How is there a gap of 9 months in this?