I am starting a thread here to see if anyone have any experience in reading data from s3 bucket and batch ingesting into druid.
Within the bucket, there are directories, the batch automatically will traverse the whole bucket by default. Inside the bottom of all directory structures there will be list of files, There will be numerous json data objects which can be ingested into the druid.
Here are some problem I am seeing:
- The data don’t rollup. i.e. The count is always 1, even there could be multiple entry of same time chunk, dimensions. I tried the forceGuranteeRollup but did not help.
- The job only read 1 entry per file in s3. Given there could be multiple json in the file, it tends to only count one.
Is it something wrong with my ingestion spec? Or there are some other settings I need to configured outside of the spec in order to fix the 1 and 2 above.
Here is how the job being called:
curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @firehose.json http://{druid}:8090/druid/indexer/v1/task
The ingestion spec (firehose.json) looks like this:
"type": "index_parallel",
"spec": {
"dataSchema": {
"dataSource": "batch_event_stats",
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"segmentGranularity": "hour",
"queryGranularity": "hour",
"rollup": true
},
"parser": {
"parseSpec": {
"format" : "json",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "jq",
"name": "eventName",
"expr": ".payload.events[0].eventName"
},
{
"type": "jq",
"name": "eventTime",
"expr": ".payload.events[0].eventTime"
}
]
},
"timestampSpec": {
"column": "eventTime",
"format": "posix"
},
"dimensionsSpec": {
"dimensions": ["carrier", "eventName", "scope", "source"]
}
}
}
},
"ioConfig": {
"type": "index_parallel",
"firehose": {
"type": "static-s3",
"prefixes": ["s3://ccapp-druid-raw-qa"]
},"appendToExisting": false
},
"tuningconfig": {
"type": "index_parallel",
"maxNumSubTasks": 2
}
}
}