Loading data from Google Cloud Storage

Hi All,

I’m new to druid. I’ve been trying to load data from Google Cloud Storage. Here is my task configuration:

{

“type” : “index_hadoop”,

“spec” : {

“dataSchema” : {

“dataSource” : …,

“parser” : {

“type”: “hadoopyString”,

“parseSpec” : {

“format” : “csv”,

“timestampSpec” : {

“column” : “timestamp”,

“format” : “auto”

},

“columns” : […],

“dimensionsSpec” : {

“dimensions” : […]

}

}

},

“metricsSpec” : […],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “day”,

“queryGranularity” : “hour”,

“intervals” : [“2014-02-01/2014-05-01”]

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “granularity”,

“dataGranularity” : “DAY”,

“inputPath” : “gs://some-bucket/dump”, //it contains gs://some-bucket/dump/y=yyyy/m=MM/d=dd

“filePattern” : “.*”

}

},

“tuningConfig” : {

“type” : “hadoop”,

“partitionsSpec”: {

“type”: “dimension”,

“targetPartitionSize”: 5000000,

“partitionDimension”: “someDimension”,

“assumeGrouped” : true

},

“jobProperties” : {},

“combineText”: true,

“useCombiner”: true,

“numBackgroundPersistThreads”: 10

}

}

}

``

There are about 4000 files for each day with the size about 1MB per file. It took more than 9 hours to finish the batch ingestion process. After checking the log, I thought that the ‘event’ part which took that long process time.

2017-03-01T09:40:29,685 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2017-03-01T09:40:29.685Z”,“service”:“druid/middleManager”,“host”:“some-host”,“version”:“0.9.3-SNAPSHOT”,“metric”:“jvm/bufferpool/capacity”,“value”:0,“bufferpoolName”:“mapped”,“dataSource”:[“somedatasource”],“id”:[“index_hadoop_somedatasource_2017-03-01T06:42:20.393Z”]}]

2017-03-01T09:40:29,685 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2017-03-01T09:40:29.685Z”,“service”:“druid/middleManager”,“host”:“some-host”,“version”:“0.9.3-SNAPSHOT”,“metric”:“jvm/bufferpool/used”,“value”:0,“bufferpoolName”:“mapped”,“dataSource”:[“somedatasource”],“id”:[“index_hadoop_somedatasource_2017-03-01T06:42:20.393Z”]}]

2017-03-01T09:40:29,685 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2017-03-01T09:40:29.685Z”,“service”:“druid/middleManager”,“host”:“some-host”,“version”:“0.9.3-SNAPSHOT”,“metric”:“jvm/bufferpool/count”,“value”:0,“bufferpoolName”:“mapped”,“dataSource”:[“somedatasource”],“id”:[“index_hadoop_somedatasource_2017-03-01T06:42:20.393Z”]}]

``

Is it possible to load data from Google Cloud Storage for production?

*for additional information, I set druid.indexer.runner.type=remote for overlord runtime properties

Thank you