How to roll-up on csv data

Hi , I am facing a problem in druid. I have a csv file and when I am passing same file with more value of same column then it is creating two separate records of same header.

Example: If I have a field { “date”: “17.03.2020” , “data” : 20 }, and when I am entering same data { “date”: “17.03.2020” , “data” : 30 } ,there should be count of 2 and sum would have been 50 of same day. But it is now creating same day with two different fields.

How can I do roll-up here based on certain query ? I am unable to understand since druid is new for me.

**my spec :- **

{

“type”: “index_parallel”,
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “local”,
“baseDir”: “/Users/rg/Downloads/”,
“filter”: “newsqllab_untitled_query_2_06T072452.csv”
},
“inputFormat”: {
“type”: “csv”,
“findColumnsFromHeader”: true
},
“appendToExisting”: true
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”
}
},
“dataSchema”: {
“dataSource”: “test1”,
“granularitySpec”: {
“type”: “uniform”,
“queryGranularity”: “HOUR”,
“rollup”: true,
“segmentGranularity”: “DAY”
},
“timestampSpec”: {
“column”: “created_at”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions”: [
“app_type”,
“attestation_url”,
“date”,
“deleted”,
“device_id”,
“device_type”,
“event_type”,
“ip_address”,
“month”,”country”,”state”,”district”,”location”,”city”,
“net_work_provider”,
“offline_to_online_sync”,
“os_type”,
“program_name”,
“role”,
“scan_in_time”,
“scan_out_time”,
“session_end_date”,
“session_name”,
“session_start_date”,
“timestamp”,
“topic_name”,
“traineerole”,
“updated_at”,
“user_id”
]
},
“metricsSpec”: [
{
“name”: “count”,
“type”: “count”
},
{
“name”: “sum_event_id”,
“type”: “longSum”,
“fieldName”: “event_id”
},
{
“name”: “sum_no_of_participants”,
“type”: “longSum”,
“fieldName”: “no_of_participants”
},
{
“name”: “sum_program_id”,
“type”: “longSum”,
“fieldName”: “program_id”
},
{
“name”: “sum_session_id”,
“type”: “longSum”,
“fieldName”: “session_id”
},
{
“name”: “sum_topic_id”,
“type”: “longSum”,
“fieldName”: “topic_id”
}
]
}
}{
“type”: “index_parallel”,
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “local”,
“baseDir”: “/Users/rg/Downloads/”,
“filter”: “newsqllab_untitled_query_2_06T072452.csv”
},
“inputFormat”: {
“type”: “csv”,
“findColumnsFromHeader”: true
},
“appendToExisting”: true
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”
}
},
“dataSchema”: {
“dataSource”: “test1”,
“granularitySpec”: {
“type”: “uniform”,
“queryGranularity”: “HOUR”,
“rollup”: true,
“segmentGranularity”: “DAY”
},
“timestampSpec”: {
“column”: “created_at”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions”: [
“app_type”,
“attestation_url”,
“date”,
“deleted”,
“device_id”,
“device_type”,
“event_type”,
“ip_address”,
“month”,”country”,”state”,”district”,”location”,”city”,
“net_work_provider”,
“offline_to_online_sync”,
“os_type”,
“program_name”,
“role”,
“scan_in_time”,
“scan_out_time”,
“session_end_date”,
“session_name”,
“session_start_date”,
“timestamp”,
“topic_name”,
“traineerole”,
“updated_at”,
“user_id”
]
},
“metricsSpec”: [
{
“name”: “count”,
“type”: “count”
},
{
“name”: “sum_event_id”,
“type”: “longSum”,
“fieldName”: “event_id”
},
{
“name”: “sum_no_of_participants”,
“type”: “longSum”,
“fieldName”: “no_of_participants”
},
{
“name”: “sum_program_id”,
“type”: “longSum”,
“fieldName”: “program_id”
},
{
“name”: “sum_session_id”,
“type”: “longSum”,
“fieldName”: “program_id”
},
{
“name”: “sum_topic_id”,
“type”: “longSum”,
“fieldName”: “topic_id”
}
]
}
}

``

How can I query based on aggregate by date and state and roll up ? kindly help.

Hi Rohit,

In druid rows are rolled up based in queryGranularity as in your query case queryGranularity is set to HOUR , rows won’t be rolled up on daily basis. You can set queryGranularity to DAY and run the ingestion.

If all the dimension values are the same for the given queryGranularity period the rows will be rolled up and matrices will be aggregated.

Please go through below to understand more on Rollup in druid :
https://druid.apache.org/docs/latest/tutorials/tutorial-rollup.html
https://www.youtube.com/watch?v=u551R7voe7w

Thanks and Regards,

Vaibhav

Thanks , but even I put DAY , it was not rolling up.

I didn’t understood this one “If all the dimension values are the same for the given queryGranularity period the rows will be rolled up and matrices will be aggregated.”

Hi Rohit,

I think you missing very of the basics here ,I am typing this mesage in my moblie hence these could be some mistakes (parden me for that).

You need to understand below item first in druid inorder to understand roll-up :

  1. Dimensions

  2. Metrices

3)QueryGranarity

  • Dimensions in druid are stored as-is, so they can be filtered on, grouped by, or aggregated at query time.

-Metric column are stored pre-aggregated so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like Hyper log log sketch or Quantile sketch etc.

Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.

**-**The queryGranularity field determines how data gets bucketed across the time dimension, or how it gets aggregated by hour, day, minute.

For example : if you have set queryGranularity to DAY that means , let’s understand with an example

{“timestamp”: "2013-09-01T14:32:00,“event”:“click”,“noOfTimes”:“1” }
{“timestamp”: "2013-09-01T23:32:00,“event”:“click”,“noOfTimes”:“1”}
{“timestamp”: “2013-09-01T03:32:45Z”,“event”:“click”,“noOfTimes”:"1}

And let say timestamp and event is defined as a dimension and noOfTimes as metrics in your ingestion spec with queryGranularity equals to DAY.

Then you would see , you will get one row ingested as below :

__time: "2013-09-01T00:00:00,

“event”:“click”

“noOfTimes”:“3”

why so –

All the 3 rows are rolled up to one row as you have specified the to bucket the time chunk by DAY ) queryGranularity ( that means you can think of your input data do not have hour,minute,sec level details , only day level data)

Also the other dimension , we have one here in our example which is ‘event’ which has got the same value (click) for all three rows then the metrics will be aggregated and input rows will be rolled up which will result into single row as above.

So if you have set queryGranularity as Day,your data after the ingestion won’t have hour-minute-sec… level details into it ,you can only query the data at day level but if that’s the requirement roll-up will significantly reduce the datasize/volume .

But roll-up will only take place if your dimension column values are same for different rows and that make sense as if dimension cllolumn values are not same they are different row.

That means if you have even a single dimension in your data which is sort of Primary key (all unique values) , Roll-up will gona happen irrespective of queyGranularity .

Hope this helps.

Thanks and regards,

Vaibhav

A typo in previous response : correcting here(missed NOT).

That means if you have even a single dimension in your data which is sort of Primary key (all unique values) , Roll-up will NOT gona happen irrespective of queyGranularity .

Hope this helps.

Thanks and regards,

Vaibhav

Roll

I got you point . Thank you so much for explaining.
One more doubt. ( What I want to achieve).

Suppose I have a column ‘count’ and ‘no_of_participants’ . So metrics is doing the sum of ‘no_of_participants’ and count of roll-up is happening if I have same timestamp with multiple ‘no_of_participants’ datas in one file.

If I put same data on another file what is happening is that the previous data is getting replace with the newer file data. I want that if I pass second file data iwith same timestamp , it must get roll-up and ‘no_of_participants’ field get added with previous data. Is it possible ? for ( local )