Index exclusion of a dimension

Hi,
I’m ingesting data to Druid trough Kafka and everything is fine. Increasing the traffic the number of rows are increasing and it’s happening that I’ve more rows with the same fields values in the same second.

Into my spec file I’ve

“timestampSpec” : {

        "column" : "insert_datetime",

        "format" : "yyyy-MM-dd HH:mm:ss"

},

So at Druid level more rows collapse into the same. For this reason I’ve added a new filed called session_id that is unique for each row.

In this way I’m able to have all rows managed on Druid but the size of the segments is increased a lot (5 times more).

Session_id is a filed with a very very hight variability (too hight!) and at the end I didn’t need it in any query.

So is there any way to exclude session_id from indexing process?

Thanks

Maurizio

Hi Maurizio,

It is possible to exclude the session_id field from the indexing process, but then you would end up with rows for the same timestamp getting rolled up (aggregated) again, like you had before. You index size increase is likely due to two factors 1) the number of rows grew due to events not getting rolled up 2) the additional session_id column requiring extra storage. Do you know how many more rows the new index had? That might give you a clue as to which one of the two is the biggest factor.

You could try reducing the length of your session_id, to keep the strings as short as possible, that may also help.

Alternatively, you can also have Druid aggregate at millisecond granularity if your timestamps are fine-grained enough. That would only cause rows for the same millisecond to get combined, would that be acceptable?

Alternatively, you can use “dimensionExclusions” to exclude a dimension from being ingested:
http://druid.io/docs/latest/ingestion/index.html

Hi Fangjin,
as per my understanding dimensionExclusion remove the field from dimensions, but doing that is my row still unique?
Doesn't excluding dimension is comparable to omitting field from the json input?

Thanks
Maurizio

Hi Fangjin,
as per my understanding dimensionExclusion remove the field from dimensions, but doing that is my row still unique?
Doesn't excluding dimension is comparable to omitting field from the json input?

Thanks
Maurizio

Hi Maruizio, adding a dimension to dimension exclusions excludes the dimension from being indexed. It can still be a part of your raw data. Perhaps I misunderstood your initial question, what exactly are you trying to do?

Hi Fangjin,
trafic I’m ingesting is increasing and some rows that I ingest are identical (same fields value and same timestamp).

Druid as per my understanding is merging these multiple rows into one, so I’ve added a unique session id.

Obviously my segment size was increased a lot because session_id has a very hight variability (all rows has a different value).

Suggested solutions from Xavier was reducing the length of session_id and moving to timestamp with milliseconds.

So request was if there was a way to define an ingested field that should be than skipped from indexing process (I don’t know if this is possible).

What I’ve currently done is to replace the session_id with a counter that reset every second, in this way I’ve a counter that repeats every second (so not hight variability) that guarantee the uniqueness of rows.

Any other suggestion that can help me?

Thanks

Maurizio

Maurizio, what exactly is wrong with the solution with “dimensionExclusions”? Based on your description, it seems to do exactly what you need?

Hi Fangjin,
there is something I’ve probably miss understood

This is an example of data ingested from Kafka

{“advertiser_id”:“514”,“banner_size”:“320x480”,“bid”:“1”,“bid_price”:“3.000000”,“browser_id”:“1”,“campaign_id”:“3185”,“carrier_id”:“583”,“company_id”:“2”,“city_id”:“10358”,“domain”:“0”,“insert_datetime”:“2015-10-19 09:21:21”,“offer_id”:“2414”,“session_bid_id”:“O_7244_0001_001445260881416094030728391807”}

{“advertiser_id”:“550”,“banner_size”:“320x250”,“bid”:“1”,“bid_price”:“2.000000”,“browser_id”:“2”,“campaign_id”:“3186”,“carrier_id”:“0”,“company_id”:“2”,“city_id”:“10358”,“domain”:“0”,“insert_datetime”:“2015-10-19 09:21:21”,“offer_id”:“2414”,“session_bid_id”:“O_7244_0001_001445260881416094030728391807”}

{“advertiser_id”:“521”,“banner_size”:“320x480”,“bid”:“1”,“bid_price”:“1.000000”,“browser_id”:“1”,“campaign_id”:“3185”,“carrier_id”:“583”,“company_id”:“2”,“city_id”:“10358”,“domain”:“0”,“insert_datetime”:“2015-10-19 09:21:21”,“offer_id”:“2414”,“session_bid_id”:“O_7244_0001_001445260881416094030728391807”}

{“advertiser_id”:“514”,“banner_size”:“320x480”,“bid”:“1”,“bid_price”:“3.000000”,“browser_id”:“1”,“campaign_id”:“3185”,“carrier_id”:“583”,“company_id”:“2”,“city_id”:“10358”,“domain”:“0”,“insert_datetime”:“2015-10-19 09:21:21”,“offer_id”:“2414”,“session_bid_id”:“O_7244_0001_001445260881416094030728391200”}

``

Here we have 4 rows with the same insert_datetime, as you can see first row and last one are identical except the session_bid_id.

session_bid_id value change for each row, and it’s generating a lot of segment size.

Now main question is:

  • if I put insert_datetime into dimensionExclusions will these two rows be joined into one or not?

Thanks

Maurizio

Sorry I’ve miss typed,

Now main question is:

  • if I put session_bid_id into dimensionExclusions will these two rows be joined into one or not?

Hi Marurizio, if you put session_bid_id into dimensionExclusions, the dimension will be excluded from indexing so the two rows will be joined into one.

Ok, so I’ve correctly understood.
Due to the fact that I need to have rows counted as two and not joined into one, I need to have something that allow Druid to consider it as two distinct rows.

Current solution I’ve implemented is to have a smaller unique key. My current timestamp has precision at second level so every second I create a counter that reset next second.

In this way I’ve a key that maintain rows unique but has a value that repeat every second.

This is the best I’ve found as solution.

Regards,

Maurizio

Hi Maurizio,

If you want to count the number of rows before rollup,

you can add a count Aggregator and do a longSum over it to get the number of rows in your raw dataset.

In this case you can still rollup your data and get the benefits in space and still keepthe info about how many events in your dataset were rolled up to make a row in druid.