Updates & Dynamic dimensions and metrics

My source dataset (on HDFS) is organized in YYYY/MM/DD format in Avro format. This data is updated every day in 10 day window. For Ex: data is generated for 1 to 10 days on Day11. On Day 12, day2 to day 12 is updated. Hence my source data keeps getting updated. Due course of time new dimensions and metrics will be inserted.

Questions

  1. Is there a data loader to directly load Avro dataset into Druid ? (Now, i convert the data into JSON before ingesting).

  2. How do i update data set in Druid as my source data set keeps updating. (Obvious solution is to drop last 10 days of data and insert again, but this means there is customer down time. How do i avoid that ?)

How data will be read when new dataset is being computed for the same time duration. Will there be a downtime ?

  1. How and what will happen when i change schema and ingest data with it.

Regards,

Deepak

#2. I had 2 records with metrics (m1 = 100, m1 = 100). I ingested into druid and it showed my aggregated value m1=200.
I did not change the timestamp value but updated metric value for both records (m1 = 300, m1 = 300) and after a while (after data is loaded into historicals), i see m1=600.

So am assuming that re-ingestion is the solution and it does not corrupt dataset and only updates it. and there was no need to stop any services.

#3. Same goes for this. I was able to add new dimensions without any downtime.

I am using Imply and i had to bring down the services and start again as the UI (Pivot) was not able to pick up the changes for new dimension additions.

Hello,

Druid uses MVCC, thus newly indexed data overshadows the previous segments and there should be no downtime in serving queries. So, yes data can be re-indexed whenever you require changes to schema, roll up etc. You can read more re-indexing use cases here - http://druid.io/docs/latest/ingestion/faq.html

For Avro you can look into - https://github.com/druid-io/druid/tree/master/extensions/avro-extensions and for some related docs see this - https://github.com/druid-io/druid/pull/1858

Not sure about issue with pivot though, may be some imply guy can answer that question. If not you can try posting your issue on imply user group.

#2. I had 2 records with metrics (m1 = 100, m1 = 100). I ingested into druid and it showed my aggregated value m1=200.
I did not change the timestamp value but updated metric value for both records (m1 = 300, m1 = 300) and after a while (after data is loaded into historicals), i see m1=600.

So am assuming that re-ingestion is the solution and it does not corrupt dataset and only updates it. and there was no need to stop any services.

#3. Same goes for this. I was able to add new dimensions without any downtime.

I am using Imply and i had to bring down the services and start again as the UI (Pivot) was not able to pick up the changes for new dimension additions.

Thanks for the response.

I could not understand what i need to fill for

“parser” : {

“type” : “avro_hadoop”,

“parsSpec” : {

“format” : “timeAndDims”,

“timestampSpec” : {},

“dimensionsSpec” : {}

}

}

For JSON, (I had)

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “json”,

“dimensionsSpec” : {

“dimensions” : [

“qualifiedTreatmentId”,

“treatmentVersion”,

“browser”,

“classifier”,

“browserVersion”,

“isGeo”,

“geoBuyerCountryId”,

“vertical”,

“operatingSystem”,

“siteId”,

“experimentChannel”,

“finalGroupId”

]

},

“timestampSpec” : {

“format” : “yyyy-MM-DD”,

“column” : “endDate”

}

}

},

However as i want to use Avro Ingestion directly. I have

“parser” : {

“type” : “avro_hadoop”,

“parseSpec” : {

"format" : “json”,

“dimensionsSpec” : {

“dimensions” : [

“qualifiedTreatmentId”,

“treatmentVersion”,

“browser”,

“classifier”,

“browserVersion”,

“isGeo”,

“geoBuyerCountryId”,

“vertical”,

“operatingSystem”,

“siteId”,

“experimentChannel”,

“finalGroupId”

]

},

“timestampSpec” : {

“format” : “yyyy-MM-DD”,

“column” : “endDate”

}

}

},

What should the value of format be ?

I had JSON earlier and commit page you showed has timeAndDims.

Regards,

Deepak

“timeAndDims” seems to be a valid parseSpec type. Is it not working ? Remember this code was recently merged and is not part of any stable release as of now it is just there in the master branch. Also, just to be clear, I have not personally used this feature. Thanks

Thanks for the response.

I could not understand what i need to fill for

“parser” : {

“type” : “avro_hadoop”,

“parsSpec” : {

“format” : “timeAndDims”,

“timestampSpec” : {},

“dimensionsSpec” : {}

}

}

For JSON, (I had)

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “json”,

“dimensionsSpec” : {

“dimensions” : [

“qualifiedTreatmentId”,

“treatmentVersion”,

“browser”,

“classifier”,

“browserVersion”,

“isGeo”,

“geoBuyerCountryId”,

“vertical”,

“operatingSystem”,

“siteId”,

“experimentChannel”,

“finalGroupId”

]

},

“timestampSpec” : {

“format” : “yyyy-MM-DD”,

“column” : “endDate”

}

}

},

However as i want to use Avro Ingestion directly. I have

“parser” : {

“type” : “avro_hadoop”,

“parseSpec” : {

"format" : “json”,

“dimensionsSpec” : {

“dimensions” : [

“qualifiedTreatmentId”,

“treatmentVersion”,

“browser”,

“classifier”,

“browserVersion”,

“isGeo”,

“geoBuyerCountryId”,

“vertical”,

“operatingSystem”,

“siteId”,

“experimentChannel”,

“finalGroupId”

]

},

“timestampSpec” : {

“format” : “yyyy-MM-DD”,

“column” : “endDate”

}

}

},

What should the value of format be ?

I had JSON earlier and commit page you showed has timeAndDims.

Regards,

Deepak

My question was “type” : “avro_hadoop”, signifies that the data that is to be read is in Avro format.
What does the ““format” : “timeAndDims”,” imply ?

The parseSpec for Avro is teh same as for any format. Follow docs for other formats.

I see. So, as per my understanding the parser defines how the ingested data can be parsed so that it can be converted into Druid specific row called InputRow and parseSpec supplies the necessary information about how to parse timestamp and dimension information needed to create the InputRow.

My question was “type” : “avro_hadoop”, signifies that the data that is to be read is in Avro format.
What does the ““format” : “timeAndDims”,” imply ?