Hi All,
I’m new to druid. I’ve been trying to load data from Google Cloud Storage. Here is my task configuration:
{
“type” : “index_hadoop”,
“spec” : {
“dataSchema” : {
“dataSource” : …,
“parser” : {
“type”: “hadoopyString”,
“parseSpec” : {
“format” : “csv”,
“timestampSpec” : {
“column” : “timestamp”,
“format” : “auto”
},
“columns” : […],
“dimensionsSpec” : {
“dimensions” : […]
}
}
},
“metricsSpec” : […],
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “hour”,
“intervals” : [“2014-02-01/2014-05-01”]
}
},
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “granularity”,
“dataGranularity” : “DAY”,
“inputPath” : “gs://some-bucket/dump”, //it contains gs://some-bucket/dump/y=yyyy/m=MM/d=dd
“filePattern” : “.*”
}
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec”: {
“type”: “dimension”,
“targetPartitionSize”: 5000000,
“partitionDimension”: “someDimension”,
“assumeGrouped” : true
},
“jobProperties” : {},
“combineText”: true,
“useCombiner”: true,
“numBackgroundPersistThreads”: 10
}
}
}
``
There are about 4000 files for each day with the size about 1MB per file. It took more than 9 hours to finish the batch ingestion process. After checking the log, I thought that the ‘event’ part which took that long process time.
2017-03-01T09:40:29,685 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2017-03-01T09:40:29.685Z”,“service”:“druid/middleManager”,“host”:“some-host”,“version”:“0.9.3-SNAPSHOT”,“metric”:“jvm/bufferpool/capacity”,“value”:0,“bufferpoolName”:“mapped”,“dataSource”:[“somedatasource”],“id”:[“index_hadoop_somedatasource_2017-03-01T06:42:20.393Z”]}]
2017-03-01T09:40:29,685 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2017-03-01T09:40:29.685Z”,“service”:“druid/middleManager”,“host”:“some-host”,“version”:“0.9.3-SNAPSHOT”,“metric”:“jvm/bufferpool/used”,“value”:0,“bufferpoolName”:“mapped”,“dataSource”:[“somedatasource”],“id”:[“index_hadoop_somedatasource_2017-03-01T06:42:20.393Z”]}]
2017-03-01T09:40:29,685 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2017-03-01T09:40:29.685Z”,“service”:“druid/middleManager”,“host”:“some-host”,“version”:“0.9.3-SNAPSHOT”,“metric”:“jvm/bufferpool/count”,“value”:0,“bufferpoolName”:“mapped”,“dataSource”:[“somedatasource”],“id”:[“index_hadoop_somedatasource_2017-03-01T06:42:20.393Z”]}]
``
Is it possible to load data from Google Cloud Storage for production?
*for additional information, I set druid.indexer.runner.type=remote for overlord runtime properties
Thank you