Complex queries and schemaless metrics

Hi All,

I am new to Druid and so far it looks great! So thank you for the great work! :slight_smile:

We are evaluating Druid’s suitability and performance since our use case involves collecting analytics from our users, which can go to some TB per day.

Before deciding to use Druid, we need to answer the following questions and I would really appreciate your help with:

  1. Some of our clients are e-commerce firms and we would like to be able to issue a performant query to answer the following question on cart abandonment:
  • Which (and also how many) are the distinct users who have left an abandoned cart in the last x minutes?

Our data model is modelled around events, so when a user adds something to the cart during a session x, we store an “ADDED_TO_CART” event with the sessionId as a dimension and when a purchase is completed, we store a “PURCHASED” event which also contains that sessionId as a point of reference.

If we would do it in SQL, we would issue a join on the same table to check that an “ADDED_TO_CART” event has occurred in the last x minutes AND there is NO “PURCHASE” event that has occurred after the “ADDED_TO_CART” event for this sessionId.

I am struggling to find a way to express this is the Druid query language.

Would it be more performant to split it into 2 queries and merge the results in another application layer?

What do you think would be the best way to write a query like this?

  1. We are looking for a system that has a very flexible schema, so that we can allow our clients to create their own events. (Just like you can do in Google Analytics for example, by posting your own events for tracking)

I can see that Druid offers schemaless dimensions, which is great. From what I understand though, the metrics should be specified from the beginning in the ingestion schema. Is this correct?

We have some metrics that we know beforehand, but how can we offer aggregations on custom attributes that our users specify using our library?

If this is not supported, which do you think would be the best way to go around this issue?

As an example, a client might choose to send the following custom event {“eventName” : “holiday_package_selected”, location:“Bahamas”, “price” : 2000, “people” : 2}

We would like them to be able to do aggregations on “price” and “people”, but since we cannot know beforehand what the client would like to track, and there are no schemaless metrics, we are stuck…

One thought we had is to somehow dynamically modify our schema when an event with a new number value comes through, set it as a metric and re-index our data.

Do you think re-indexing the data to include these integer values as metrics, would have performance issues for all other clients of the application?

Apologies for any obvious questions and thank you in advance for the response. :slight_smile:

Regards,

Petros

Hey Petros,

For (1) Druid can help you with some things, especially when they involve counting. You can get an approximate count of users over a particular time window that did add to cart but did not purchase in that window by using theta sketches (http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html) to do a count of “A intersect (not B)”.

Getting the full list of users is tougher. I think what you want is to group by user, check if there was a cart added, check if there was a purchase, and then do a “HAVING added_to_cart > 0 AND purchase = 0”. You can do this with a groupBy query, but this will break down if you have a lot of users matching your query, as groupBy has a limit on the size of the resultset that can be returned. Even if you raised Druid’s limits, or if it didn’t have limits at all, this kind of query is a big self-join and would take a while to return.

IMO if you want to do arbitrary session-oriented queries (not just counting things with sketches), you’re better off pre-processing your data. Your queries will be much faster. You can do this by having a stream processor transform a stream of events into a stream of sessions, and then loading that session stream into Druid. You could also load both the event and session streams (as separate datasources) and choose which one to use based on the query you’re doing.

For (2) the simplest thing to do is put values you want to aggregate into fields like m01, m02, m03 and then define metrics like sum01, min01, max01, etc. So in your example, the thing that is emitting to Druid should emit {“eventName” : “holiday_package_selected”, location:“Bahamas”, “m01” : 2000, “m02” : 2}. If you want those values to be dimensions too, you can also include them under their original names.

Hi Gian,

Thank you for your prompt response.

1a) You mentioned there is a limit on the groupBy query. How much is this limit?

2a) For allowing arbitrary metrics, you mentioned to set sum01, sum02, … etc. as metrics. This means that we have to basically allocate from the start x number of custom metrics a client is allowed to potentially use. So having for example, 5000 clients times a limit of for example 50 custom metrics per client = 250,000 pre-allocated metrics… Wouldn’t a schema with 250k preallocated metrics definitions heavily affect overall performance? Seems like a lot of wasted space to pre-allocate.

Is my understanding correct or did I misinterpret your suggestion and how Druid handles storage?

2b) Do you know if there is a way to change the definition for the real-time ingestion dynamically without having to restart the Tranquility server?

2c) We have read the documentation on multitenancy and since we have a lot of clients that each will always ever access his own data, we were thinking of using single-dimension partitioning by clientId. Is there a way to update the metrics definition only for data where clientId=X? Would something like that help in our use case?

2d) From my understanding, if a property is not declared as a metric in the schema, it is treated as a dimension and it is stored as a string value, even if it is a number in the original event payload. Our ultimate goal is to be able to allow our clients to query their custom metrics within 1-2 seconds.

Do you think leaving our schema with only the original metrics but writing a Javascript Aggregator when we query which will do the actual sum of a dimension (eg. m02) to get the total will be more performant than storing 250k metrics or not?

Giving the opportunity to our clients to define their own attributes & slice-n-dice their data is very important to us and will be a decisive point on whether we will end up using Druid or not in our architecture. That’s why we are really keen in finding a solution.

Thanks again for your response!

Petros

Hi Gian/others,

Could you please kindly follow up on this one please?

If you require more information, please let me know and I’d be happy to provide.

Really appreciate your help,

Petros

Inline.

Hi Gian/others,

Could you please kindly follow up on this one please?

If you require more information, please let me know and I’d be happy to provide.

Really appreciate your help,

Petros

Hi Gian,

Thank you for your prompt response.

1a) You mentioned there is a limit on the groupBy query. How much is this limit?

http://druid.io/docs/0.9.0/configuration/historical.html

2a) For allowing arbitrary metrics, you mentioned to set sum01, sum02, … etc. as metrics. This means that we have to basically allocate from the start x number of custom metrics a client is allowed to potentially use. So having for example, 5000 clients times a limit of for example 50 custom metrics per client = 250,000 pre-allocated metrics… Wouldn’t a schema with 250k preallocated metrics definitions heavily affect overall performance? Seems like a lot of wasted space to pre-allocate.

Is my understanding correct or did I misinterpret your suggestion and how Druid handles storage?

For metrics, you should really be finding a common schema for them. For example, why not just have one metric that represents all doubleSums instead of a one metric column per client?

2b) Do you know if there is a way to change the definition for the real-time ingestion dynamically without having to restart the Tranquility server?

No there is not.

2c) We have read the documentation on multitenancy and since we have a lot of clients that each will always ever access his own data, we were thinking of using single-dimension partitioning by clientId. Is there a way to update the metrics definition only for data where clientId=X? Would something like that help in our use case?

Just have your application filter on clientID=X for client X’s view of the data.

2d) From my understanding, if a property is not declared as a metric in the schema, it is treated as a dimension and it is stored as a string value, even if it is a number in the original event payload. Our ultimate goal is to be able to allow our clients to query their custom metrics within 1-2 seconds.

Do you think leaving our schema with only the original metrics but writing a Javascript Aggregator when we query which will do the actual sum of a dimension (eg. m02) to get the total will be more performant than storing 250k metrics or not?

Storing 250k metrics makes no sense.

Giving the opportunity to our clients to define their own attributes & slice-n-dice their data is very important to us and will be a decisive point on whether we will end up using Druid or not in our architecture. That’s why we are really keen in finding a solution.

Please understand we try to answer community questions when we are able. If you need dedicated help, please try http://imply.io/