Compact segments using hadoop indexer

Hi,

Having trouble with the hadoop druid indexer for compacting existing small segments into larger one. We are using the 0.12.0 version of the indexing package.

We have some smaller segments we want to compact into larger segments so there are fewer segments to keep track of.

We execute the compact command like this:

/opt/druid-0.12.0; java -Xmx512m -Ddruid.storage.storageDirectory=hdfs://10.2.19.128:8020/druid_indexed_data_dev
-Ddruid.storage.type=hdfs -Dfile.encoding=UTF-8 -classpath
extensions/druid-parquet-extensions/:extensions/druid-avro-extensions:extensions/druid-hdfs-storage:lib/:/opt/druid-0.12.0/config/_common:/etc/hadoop
io.druid.cli.Main index hadoop
/var/log/company/druid_index_spec_publisher_v2_dev_2018-08-06T000000_1537472792.85.json

``

It would be really helpful to see an example of a working compact task spec for hadoop indexing.

The one we have come up with isn’t working:

{
‘spec’:{
‘tuningConfig’:{
‘forceExtendableShardSpecs’:True,
‘type’:‘index_hadoop’,
‘maxRowsInMemory’:25000,
‘targetPartitionSize’:5000000
},
‘ioConfig’:{
‘type’:‘hadoop’,
‘segmentOutputPath’:‘hdfs://10.2.19.128:8020/druid_indexed_data_dev’,
‘metadataUpdateSpec’:{
‘connectURI’:‘jdbc:mysql://10.2.19.24:3306/druid?characterEncoding=UTF-8’,
‘password’:‘NOPE’,
‘type’:‘mysql’,
‘user’:‘druid’,
‘segmentTable’:‘druid_segments’
}
},
‘interval’:‘2018-08-06T00/2018-08-06T01’,
‘dataSource’:‘publisher_v2_dev’,
‘type’:‘compact’,
‘id’:‘druid_test_compaction’
}
}

``

Generally we index the segments using this:

{
“type”:“index_hadoop”,
“spec”:{
"dataSchema": {...},
“hadoopDependencyCoordinates”:[
“org.apache.hadoop:hadoop-client:2.7.3”
],
“ioConfig”:{
“type”:“hadoop”,
“segmentOutputPath”:“hdfs://10.2.19.128:8020/druid_indexed_data_dev”,
“inputSpec”:{
“filter”:“part-”,
“paths”:“hdfs://pipeline/2018-08-06/17/publisher/druid/part-00099.gz”,
“type”:“static”
},
“metadataUpdateSpec”:{
“connectURI”:“jdbc:mysql://10.2.19.24:3306/druid?characterEncoding=UTF-8”,
“password”:“NOPE”,
“type”:“mysql”,
“user”:“druid”,
“segmentTable”:“druid_segments”
}
},
“tuningConfig”:{
“ingestOffheap”:false,
“rowFlushBoundary”:500000,
“jobProperties”:{
“mapreduce.reduce.memory.mb”:“8192”,
“mapreduce.reduce.java.opts”:"-Xmx6144m -XX:+UseG1GC",
“mapreduce.job.user.classpath.first”:“true”
},
“combineText”:false,
“partitionsSpec”:{
“type”:“hashed”,
“targetPartitionSize”:5000000
},
“cleanupOnFailure”:true,
“indexSpec”:{
“bitmap”:{
“type”:“roaring”
}
},
“persistInHeap”:false,
“aggregationBufferRatio”:0.5,
“leaveIntermediate”:false,
“overwriteFiles”:true,
“workingPath”:"/var/druid/hadoop-tmp",
“shardSpecs”:{

     },
     "bufferSize":5000000,
     "type":"hadoop",
     "useCombiner":false,
     "maxRowsInMemory":500000,
     "ignoreInvalidRows":false
  }

}
}

``

I’m unsure how to translate the index task to a compaction task that is run via the hadoop indexer. The documentation here: http://druid.io/docs/latest/ingestion/compaction.html
says that some sections are not required (tuningConfig, context, and id) but if you submit without tuningConfig it will throw a NullPointer.

Any help greatly appreciated.

Hey Asher,

At the moment I don’t think the compact tasks can index using Hadoop. It uses Druid’s native indexing.

You can create an index_hadoop task using existing segments as an input though which is little more complex but is capable of doing the same thing. I’ve included the structure of what that might look like below. Some more examples can be seen here: http://druid.io/docs/latest/ingestion/update-existing-data.html

{
 {
    "type": "index_hadoop",
    "spec": {
        "ioConfig": {
            "type": "hadoop",
            "inputSpec": {
                "type": "dataSource",
                "ingestionSpec": {
                    "dataSource": "{replace_me}",
                    "intervals": [
                        "{replace_me}"
                    ]
                }
            }
        },
        "dataSchema": {},
        "tuningConfig": {
            "type": "hadoop",
            "partitionsSpec": {}
        }
    }
}