Druid evaluation

Druid core developers,
I’m in the process of evaluating Druid for a project I’m working on that requires us to provide our users the ability to cut/slice real-time event data based on any field available. The core challenge is to generate aggregate views similar to the below based on the streaming events:

  field groupings     team1       team2       team3
------------------|-----------|-----------|-----------|
- countryA        |    200    |    180    |    130    |
  - stateA        |    110    |     90    |     70    |
      eventType1  |     30    |     20    |     20    |
      eventType2  |     70    |     40    |     30    |
      eventType3  |     10    |     30    |     20    |
  - stateB        |     90    |     90    |     60    |
      eventType1  |     30    |     25    |     10    |
      eventType2  |     20    |     25    |     25    |
      eventType3  |     40    |     40    |     25    |
- countryB        |    155    |    121    |    210    |
  - stateA        |     90    |     70    |    100    |
      eventType1  |     45    |     20    |     55    |
      eventType2  |     20    |     15    |     20    |
      eventType3  |     25    |     35    |     25    |
  - stateB        |     65    |     51    |    110    |
      eventType1  |     15    |     11    |     10    |
      eventType2  |     40    |     30    |     35    |
      eventType3  |     10    |     10    |     65    |

- Aggregates must be generated on-demand by the user and delivered under 2s.
- Aggregate counts need to be accurate as of that moment and cannot be pre-computed.
- Aggregation comprises of basic counts (group-by) with additional filter criteria specified.
- Users have access to multiple views and can switch between views while using the application.
- Each view represents a different set of aggregated data. (i.e. country->state->city or website->page->user)
- Users can create new views on the fly based on existing dimensions.
- Views contain aggregated data for only 'open' events.
- The system will have up to 3000 concurrent users all querying to generate various aggregate views.
- ```
  - Events have a time-stamp field (i.e. creation time), but the set of data users will query for are based on events with a status='open' 
  - Events can remain open for days/weeks...possibly months.
  - New events initially arrive with status='open'. Once the user takes action on the event (a separate screen), the event will be marked 'closed'.
  - The corresponding event in Druid needs to be updated to 'closed' when the update arrives so the data can be excluded from the users view.
  - Closed events should be moved out of the main store into historical.
- The complete universe of dimensions isn't known and need to be added as new fields are introduced
- As new event types are added to the system, this can introduce new fields users will want to group on. As a result, new dimensions need to be added to the live cluster.
- When a new dimension is added, the data needs to be re-indexed to make the new dimension available for querying new as well as historical data.

The core requirements are as follows:Open questions:

- - How can new dimensions be added on the fly and how does re-indexing happen?
- - How can event data be updated and how is re-indexing of that event performed? 
- - If we have a high-volume stream of updates (500 msgs/sec), is the re-indexing of this stream data an expensive operation and would Druid be able to keep up?

Appreciate there’s a lot of detail to cover here. I am available to discuss offline if that is easier.

Thanks,

Fred

Hi Fred,

I do not have answers to all your questions (who ever does?) but I can shed light on a few things. Most of the things that you specified sound great and very easy to achieve.

The only thing that might present an opportunity for creativity is the way you want to do updates (i’ll let someone more knowledgeable than me go into it).

What caught my eye about your post was the nested table that you are trying to get as the end result. You might want to have a look at Plywood ( https://github.com/implydata/plywood ) and the Pivot UI ( https://github.com/implydata/pivot ) two open source projects that try to make Druid more accessible.

Plywood is a smart query layer on top of Druid built (in part) to be able to generate nested queries like you have shown.

Pivot is a UI built using Plywood and thus can do this with ease:

If nothing else you can run Pivot with the --verbose flag to see how to most efficiently make the query that you desire.

Best regards,

Vadim

Thanks Vadim. This is very close to what I need. The only missing bit is having the columns as a dimension for a specific field. For example, using the ‘team’ field would provide a set of columns left-to-right of distinct teams, i.e. team A, team B, etc…with an aggregate count provided for each. Can that be done?

That is on the roadmap for Pivot and it is possible in Plywood.
For simplicity I will post the PlyQL query to demonstrate how it might work below.

I am only posting it because if you run plyql with the --verbose flag you can see how it structures its Druid queries and it might save you a couple of hours of work.

If you have any questions about Plywood, PlyQL, or Pivot or are interested in making use of those projects please post questions on their mailing list: https://groups.google.com/forum/#!forum/imply-user-group

Here is a query:

I made a nested table of Country (with limit) > State (with limit) > Language (with limit) by Page Type (no limit) with the idea that Language is like your eventType and pageType is like your “team”

If “team” is a very high cardinality dimension you might want to apply a filter of WHERE team IN (‘team1’, ‘team2’, …, ‘teamN’)

plyql --host localhost:8082 --data-source wikipedia -interval P1W -q "

SELECT

country as ‘Country’,

COUNT(*) as ‘Count’,

(

SELECT

region as ‘State’,

COUNT(*) as ‘Count’,

(

SELECT

language as ‘Language’,

COUNT(*) as ‘Count’,

(

SELECT

newPage as ‘NewPage’,

COUNT(*) as ‘Count’

GROUP BY 1

ORDER BY NewPage ASC

) as ‘PageType’

GROUP BY 1

ORDER BY Count DESC

LIMIT 3

) as ‘Langages’

GROUP BY 1

ORDER BY Count DESC

LIMIT 3

) as ‘Countries’

GROUP BY 1

ORDER BY Count DESC

LIMIT 3

"