Hi Gian,
That’s an interesting idea, thanks for the writeup!
It definitely could work for a lot of cases. But I believe the performance of that solution would be inferior to a case-optimized system , which I would like to build. One of the reasons is that actual_time column would be very high cardinality (different value for each row) with an extreme dictionary, so compression of that would be terrible. The right encoding of that column would use advanced integer compression ideas significantly lowering the size of it. Of course, the on-going work to support int/long dimensions in Druid might include optimizing column like that.
So I assume, that Druid in current shape with applying workarounds like yours would not be optimal for my case. Let me know if you disagree.
But let’s continue brainstorming:
I see and appreciate Druid from two different angles:
-
As a very well designed distributed system with no-compromise fault-tolerance, scalability, with real-time ingestion, tiered historical storage, caching and many more.
-
As a database optimized for OLAP-like processing and extremely good at it.
Now, what I would like to get is a system that fulfills 1), but instead of 2) is optimized for behavior analysis / pattern matching. That would probably mean redesign or extend the segment format, so that:
a) events are ordered first by actor_id, then by time. Series of each actor are analyzed independently (well, in most cases, let’s stick to that).
b) data is very-well compressed, leveraging properties of the case, especially the time column ( per-actor monotonically increasing integers). Lowering the size is so important to read data faster, to keep more data in memory and/or in cache.
c) it is column-oriented - I bet columnar storage would outperform row-oriented for the case of wide events.
I understand, that Druid is not there yet with all the points at once. Now I see two options:
-
To implement my own database reimplementing the features of Druid appreciated in “angle” 1 above.
-
To gut Druid storage layer and implement specialized version for my type of analyses. I see that there is already work planned to generalize Druid storage layer and support this option [1](https://github.com/druid-io/druid/issues/2965).
What do you think? Is it feasible to extend Druid to match path analysis case? Do you think it’s a lot of work? Has anyone done such thing in past (maybe for other use case)?
Thanks a lot for any advice,
Krzysztof
P.S. From the usability POV, to better understand, what type of workload / functionality I would like to get, I feel its good to compare it to existing systems. Some of them are: “Pattern Matching” feature of Oracle DB [2](https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8963), Teradata Aster nPath [3](https://developer.teradata.com/aster/articles/aster-npath-guide).
P.P.S. I incidentally haven’t mentioned anything about the language for querying behavioral data and the lack of it in Druid, as this problem is inferior to the right storage design and can be solved later.
pt., 16.09.2016 o 00:12 użytkownik Gian Merlino gian@imply.io napisał: