Questions about druid-hadoop-inputformat

Hi
my 2 cents

1 this code is actually loading druid segments from HDFS thus your Druid Historical nodes are not in the picture anyway.

2 if you trace your code to a lower level i guess you will find out that most of that time is read from HDFS or S3 plus Unziping Druid raw segments (You will see maybe 90% of the time is IO/Unzip)

am not sure what is your use case but you can workaround this by querying Druid historicals rather than cold data from deep storage. That is how Hive2 queries Druid and gets subsecond query response time https://hortonworks.com/blog/sub-second-analytics-hive-druid/.

Hi, Slim

Thank you for your reply.

  1. You’re right. The code reads segments from HDFS. What I mentioned about Historical was my mistake.

My purpose of using druid-hadoop-inputformat is using Druid’s Segment files as a replacement for Parquet to take advantage of Inverted Index.

  1. As far as I know, and as described here, Hive/Druid integration is built on top of Druid Broker.

To build my Web Analytics, I need to run complex queries (like INNER JOIN, LEFT JOIN, sub-queries). But Druid doesn’t have this feature, and When I tested Hive/Druid, it does not support JOIN/sub-queries over Druid as well. Hive/Druid integration just convert SELECT statement into Druid’s DSL and execute the DSL over a Broker. So, If SELECT statement is so complex that it couldn’t be converted into DSL, the execution fails. (Please let me know if I am wrong)

I think if druid-hadoop-inputformat would support high read-throughput, and would use inverted index for filters pushed-down, Druid Segment would be a perfect column db format in many cases in terms of analytic OLAP and filtered OLTP queries.

That’s why I asked two questions in this post.

Regards,

Jason

2017년 12월 11일 월요일 오전 11시 56분 4초 UTC+9, Slim Bouguerra 님의 말: