Question about append batch indexing

Hi! I’m new to druid, and i have a question about “appendToExisting” property in batch indexing.

I understand this when “appendToExisting” is on, new segment will be append to previous segment instead of replacing it.

So, i tested in druid v0.10 and found previous segment is gone when after finish loading new segment.

Below is my ingestion json.

– 2016-01-03 –

{

“type”: “index_hadoop”,

“spec”: {

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“inputFormat”: “io.druid.data.input.parquet.DruidParquetInputFormat”,

“paths”: “hdfs://hadoop_host:8020/store_info/ymd=2016-01-03”

},

“appendToExisting” : “true”

},

“dataSchema”: {

“dataSource”: “test-datasource”,

“parser”: {

“type”: “parquet”,

“parseSpec”: {

“format”: “timeAndDims”,

“timestampSpec”: {

“column”: “ymd”,

“format”: “yyyy-MM-dd”,

“missingValue” : “2016-01-03”

},

“dimensionsSpec”: {

“dimensions”: [“product_name”]

}

}

},

“metricsSpec”: [

{

“type” : “count”,

“name” : “count”

},

{

“type” : “longSum”,

“name” : “cntSum”,

“fieldName” : “cnt”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “year”,

“queryGranularity”: “day”,

“intervals”: [“2016-01-03/2016-01-04”]

}

},

“tuningConfig”: {

“type”: “hadoop”,

“jobProperties”: {

“mapreduce.job.classloader”: “true”,

“mapreduce.job.classloader.system.classes”: “-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.”,

“mapreduce.map.memory.mb”: 4096,

“mapreduce.map.java.opts”: “-server -Xmx4096m -Duser.timezone=UTC -Dfile.encoding=UTF-8”,

“mapreduce.reduce.memory.mb”: 20480,

“mapreduce.reduce.java.opts”:"-server -Xmx20g -Duser.timezone=UTC -Dfile.encoding=UTF-8"

},

“partitionsSpec”: {

“type”: “dimension”,

“partitionDimension” : “product_name”

},

“buildV9Directly” : “true”,

“forceExtendableShardSpecs” : “true”

}

},

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.6.0”]

}

– 2016-01-04 –

{

“type”: “index_hadoop”,

“spec”: {

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“inputFormat”: “io.druid.data.input.parquet.DruidParquetInputFormat”,

“paths”: “hdfs://hadoop_host:8020/store_info/ymd=2016-01-04”

},

“appendToExisting” : “true”

},

“dataSchema”: {

“dataSource”: “test-datasource”,

“parser”: {

“type”: “parquet”,

“parseSpec”: {

“format”: “timeAndDims”,

“timestampSpec”: {

“column”: “ymd”,

“format”: “yyyy-MM-dd”,

“missingValue” : “2016-01-04”

},

“dimensionsSpec”: {

“dimensions”: [“product_name”]

}

}

},

“metricsSpec”: [

{

“type” : “count”,

“name” : “count”

},

{

“type” : “longSum”,

“name” : “cntSum”,

“fieldName” : “cnt”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “year”,

“queryGranularity”: “day”,

“intervals”: [“2016-01-04/2016-01-05”]

}

},

“tuningConfig”: {

“type”: “hadoop”,

“jobProperties”: {

“mapreduce.job.classloader”: “true”,

“mapreduce.job.classloader.system.classes”: “-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.”,

“mapreduce.map.memory.mb”: 4096,

“mapreduce.map.java.opts”: “-server -Xmx4096m -Duser.timezone=UTC -Dfile.encoding=UTF-8”,

“mapreduce.reduce.memory.mb”: 20480,

“mapreduce.reduce.java.opts”:"-server -Xmx20g -Duser.timezone=UTC -Dfile.encoding=UTF-8"

},

“partitionsSpec”: {

“type”: “dimension”,

“partitionDimension” : “product_name”

},

“buildV9Directly” : “true”,

“forceExtendableShardSpecs” : “true”

}

},

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.6.0”]

}

I ingested first(2016-01-03) and after second(2016-01-04) sequentially.

Is it a segment granulity issue?

I expected second will be appended to old one(first).

But as soon as second loaded, old one is removed from historical node.

In my use case, real time ingestion is not mine and batch hadoop ingestion is.

I also read delta indexing document, unfortunately i can not use it.

Q1. Why old segment is gone after new segment is loaded? (Even “appendToExisting” is set to ‘true’)

Q2. Can explain “appendToExisting” use case?

Thanks a lot.

Have a nice day! :slight_smile:

Hi,

But as soon as second loaded, old one is removed from historical node.

How did you check the old segments are removed from historicals? If you saw segments are removed from some specific historicals, they might be moved to some other historicals. You can see the segment distribution on overlord or coordinator consoles.

Jihoon

2017년 7월 21일 (금) 오전 11:09, 기준 0ctopus13prime@gmail.com님이 작성:

I checked segment status in three ways.

  1. In mysql. Used was 0

  2. Coordinator UI.

Together for few seconds(like 30s) and suddenly gone.

  1. By submit select query. ‘2016-01-01’ return nothing, only contains ‘2016-01-02’ interval

BTW, It’s nice to meet you online!

I hope to see you next meet-up!

2017년 7월 21일 금요일 오전 11시 44분 26초 UTC+9, Jihoon Son 님의 말:

It’s nice to see you here too!

It’s weird. Would you check your logs if there are some weird things?

2017년 7월 21일 (금) 오후 3:25, 기준 0ctopus13prime@gmail.com님이 작성: