Re-ingestion vs Re-indexing

I have been using the druid for more than a week now.
Our use case is such that the metrics data is going to change often(dimensions are mostly gonna stay constant) may be once in a month for some interval of time.
We are ingesting the data into druid using Hadoop based batch Ingestion.
“Ioconfig & dataSchema” for the initial ingest is below
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : “rawData.json”
}
},
“dataSchema” : {
“dataSource” : “dataSource”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “year”,
“queryGranularity” : “month”,
“intervals” : [“2015-01-01/2018-01-01”]
},
For the update I was trying to re-ingest the data for some time interval say “2016-01-01/2016-06-01” using the below specs

“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : “rawDataUpdated.json”
}
},
“dataSchema” : {
“dataSource” : “dataSource”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “year”,
“queryGranularity” : “month”,
“intervals” : [“2016-01-01/2016-06-01”]
}

I could see the updates getting reflected, I also read somewhere that druid uses MVCC, that means it is creating new segments and discarding the older ones.

My question is, Is my assumption above right? and also is re-ingestion recommended for my use case or I should consider Re-indexing.

Yes, by default it will replace data in segments that intersect with given interval. So be careful - if your update interval is smaller than segment granularity you can loose data. For example if your segment granularity is 1 year and you update data for 1 day interval, than other 364 days in segment will be empty. So probably you need to update full one year segment or you will need Delta ingestion, see http://druid.io/docs/latest/ingestion/update-existing-data.html section “multi”, to combine existing data with the new one

You can also merge

Thanks Remizov for your reply.
So If I go by the re-ingest example spec
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : “rawDataUpdated.json”
}
},
“dataSchema” : {
“dataSource” : “dataSource”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “year”,
“queryGranularity” : “month”,
“intervals” : [“2016-01-01/2016-06-01”]
}
And suppose at the initial ingest one of the segment created falls under 2016-2017, and by the above re-ingestion spec I will loose the data for 361 days and only retain data for the period “2016-01-01/2016-06-01”. Is this right?
How about if I go with the Index task.

“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “year”,
“queryGranularity” : “month”,
“intervals” : [“2016-01-01/2016-06-01”]
},
“ioConfig” : {
“type” : “index”,
“firehose” : {
“type” : “local”,
“baseDir” : “someLocation”,
“filter” : “rawDataUpdated.json”
}
},

Will I loose the data in this scenario also or it will re-index only for this period “2016-01-01/2016-06-01” and retain all other days of a Year.

Yes, in both cases it should replace the segment data loosing the rest of segment.
In your case you need yo combine data in dataSource with new data from batch file

Here is example for firehose ioConfig
“ioConfig” : {
“type” : “index”,
“firehose” : {
“type”:“combining”,
“delegates” : [
{
“type” : “local”,
“baseDir”: “someLocation”,
“filter” : “rawDataUpdated.json”
},
{
“type” : “ingestSegment”,
“dataSource” : “yourDataSourceName”,
“interval” : [“2016-01-01/2017-01-01”]
}
]
}
},

``

Ingestsegment type should contain interval from dataSource matching segment that intersects with interval from dataRawJson. With such code you won’t loose data for the rest of segment, but you should be sure that rawDataUpdate.json will contain only new data that is not in dataSource yet to not duplicate