Druid as "BIG DATA" solution - particularly supporting unique metrics

Hi Guys,

Hoping you can help me
with the research that I started for my project on Druid.

We are talking about a system with retention of 12 month and
around 100~ milliard rows of data per year.

The objective is to make analytic queries on web service;
therefore there are a few items I need to fill, please see the list below:

  • ** From user/concurrency point of
    view**:

  • Is there a limitation for
    concurrent users running queries?

  • In case I need to add more
    users, what is the cost? (More hardware/servers/memory?)

  • From data point of view:

  • Is there a limitation on how
    many dimensions/lookup tables I can design? Is there any kind of
    performance implications?

  • What will be the cost of
    adding another dimension/lookup table? (More
    hardware/servers/memory or redesign of the solution?)

  • What about the history, as I
    mention before we are talking about one year retention, but if I need to
    add more retention time, example 3 years, what will be the implication?

  • From Programming point of view:

  • Are the windows functions
    supported?

  • If
    not, is there a workaround for it?

  • If
    partially, which one is supported or not supported.

  • Are Joins between tables
    supported? (Is there a limitation for table size?)

  • Are nested queries supported?

  • Are functions like Count, sum,
    avg., max, min, etc. supported? How is it working with uniqueness? (Count
    (distinct X), Sum (distinct X), etc.)

  • Are ranking functions, like in
    MSSQL, supported? (row_number, rank, dense_tank, ntile)

  • How Druid is
    supporting the comparison between populations? Example, I need to get the
    population of users which didn’t purchase in the last 3 days but they
    click on my site at least 1 times. (when purchase and click are different
    events types in the system)

  • Flexibility from integration
    point of view (could it work with Java/python/etc.)

Some words about the company I’m working for, it is a personalized
retargeting specialist providing web and app advertisers with display ads
(banners), in real time, for visitors who have left their sites without
completing a purchase. These users are served ads as they continue surfing the
web or browsing other apps. Personalized retargeting is a form of online
targeted advertising, in which online advertising is delivered to consumers
based on their previous actions (such as pages browsed, products added to
basket) on a company’s website or app.

I will appreciate any help you can provide, since the answer based
on your experience to pointing me to documents/tutorials/white papers/etc.

If any additional details are needed please let me know.

Thanks!
Guy

Druid is designed to handle slice and dice analytics on more data than that per DAY.

One of the advantages it has for slice and dice analytics (other than its speed as massive scale) is its ability to store more columns. Most DB solutions have a limit of about 1600 columns per table, so if you have more than 1600 columns, you have to use a join or do something crazy within the table. Druid has no such fundamental limit.

Lookups are Druid’s current way of doing small table to large table joins, where “small” is on the order tens to thousands of rows, and “large” is on the order of billions (or millions if you want).

Hi Charles,

Can i do nested queries in druid?

For example, if i have two groups of data sets and i want to check which users are in group A but not in group B.

Are ranking functions, like in MSSQL, supported? (row_number, rank, dense_tank, ntile).

Thanks!

You can issue nested groupBy queries. Not all ranking functions are supported right now in Druid.