Druid core developers,
I’m in the process of evaluating Druid for a project I’m working on that requires us to provide our users the ability to cut/slice real-time event data based on any field available. The core challenge is to generate aggregate views similar to the below based on the streaming events:
field groupings team1 team2 team3 ------------------|-----------|-----------|-----------| - countryA | 200 | 180 | 130 | - stateA | 110 | 90 | 70 | eventType1 | 30 | 20 | 20 | eventType2 | 70 | 40 | 30 | eventType3 | 10 | 30 | 20 | - stateB | 90 | 90 | 60 | eventType1 | 30 | 25 | 10 | eventType2 | 20 | 25 | 25 | eventType3 | 40 | 40 | 25 | - countryB | 155 | 121 | 210 | - stateA | 90 | 70 | 100 | eventType1 | 45 | 20 | 55 | eventType2 | 20 | 15 | 20 | eventType3 | 25 | 35 | 25 | - stateB | 65 | 51 | 110 | eventType1 | 15 | 11 | 10 | eventType2 | 40 | 30 | 35 | eventType3 | 10 | 10 | 65 |
- Aggregates must be generated on-demand by the user and delivered under 2s. - Aggregate counts need to be accurate as of that moment and cannot be pre-computed. - Aggregation comprises of basic counts (group-by) with additional filter criteria specified. - Users have access to multiple views and can switch between views while using the application. - Each view represents a different set of aggregated data. (i.e. country->state->city or website->page->user) - Users can create new views on the fly based on existing dimensions. - Views contain aggregated data for only 'open' events. - The system will have up to 3000 concurrent users all querying to generate various aggregate views. - ``` - Events have a time-stamp field (i.e. creation time), but the set of data users will query for are based on events with a status='open' - Events can remain open for days/weeks...possibly months. - New events initially arrive with status='open'. Once the user takes action on the event (a separate screen), the event will be marked 'closed'. - The corresponding event in Druid needs to be updated to 'closed' when the update arrives so the data can be excluded from the users view. - Closed events should be moved out of the main store into historical.
- The complete universe of dimensions isn't known and need to be added as new fields are introduced - As new event types are added to the system, this can introduce new fields users will want to group on. As a result, new dimensions need to be added to the live cluster. - When a new dimension is added, the data needs to be re-indexed to make the new dimension available for querying new as well as historical data.
The core requirements are as follows:Open questions:
- - How can new dimensions be added on the fly and how does re-indexing happen? - - How can event data be updated and how is re-indexing of that event performed? - - If we have a high-volume stream of updates (500 msgs/sec), is the re-indexing of this stream data an expensive operation and would Druid be able to keep up?
Appreciate there’s a lot of detail to cover here. I am available to discuss offline if that is easier.