Query Druid from Spark

Hello everyone!

I need to query Druid periodically (every night, for instance). Thus, I am wondering what would be the best approach to do this.

You think the best approach is to query Druid from a Spark job? If this is the best approach, can you please give me some guidelines about how to do this?

Thanks in advance.

Best regards,

José Correia

Hello again!

In terms of performance, it’s best to query Druid from spark using HTTP post with native Druid query language or SQL support with Calcite and avatica (http://druid.io/docs/latest/querying/sql.html)?

Druid says “Druid SQL translates SQL into native Druid queries on the query broker”. This way, I think that can be some delay using SQL, comparing to native Druid query language. Am I right?

Best regards,

José Correia

segunda-feira, 14 de Maio de 2018 às 15:56:44 UTC+1, José Correia escreveu:

I need to query Druid periodically (every night, for instance). Thus, I am wondering what would be the best approach to do this. You think the best approach is to query Druid from a Spark job?

That depends on your requirements, for the most basic use case that fits that description, you could probably set up a cronjob that sends a JSON query to Druid periodically.

This way, I think that can be some delay using SQL, comparing to native Druid query language. Am I right?

Yes, there would be some overhead from the SQL planning. If it’s a big concern, you could prepend "EXPLAIN PLAN FOR " to the SQL query you’re issuing to get the equivalent native Druid query, and submit that native query for future requests.

Thanks,

Jon

Hello Jon,

thanks for the reply.

I’m querying Druid from Spark, because I need to treat and use the data returned from the query. Thus, now I’m evaluating if it’s best to use direct HTTP post query or jdbc conector. What’s your opinion?

Best regards,

José Correia

terça-feira, 15 de Maio de 2018 às 22:02:37 UTC+1, Jonathan Wei escreveu:

The question is do you really need the flexibility of a Declarative language like SQL and use it as your base query language thus you will have to pay the cost of Lexer then cost of optimizer then cost of set/deser of Data over the JDBC connection.

Or you are okay with using the Druid physical (kind of imperative language) and avoid all the cost added by Lexer/Optimizer/SerDesr (FYI you can query druid using Smile format if you are into minimizing IO).

Hope this makes it more clear

Fwiw I measured the overhead of Druid SQL over Druid native queries, and included a slide about it in a talk earlier this year: https://speakerdeck.com/implydatainc/nosql-no-more-sql-on-druid-with-apache-calcite-strata-sj-2018?slide=54. It was maybe 20–40ms. However that was a local benchmark. In practice I’ve noticed serde overhead can be lower with SQL since it usually transfers less data than Druid’s native API (especially if you put it in resultFormat “array”), so it is possible to actually come out ahead with SQL.

Thanks, Slim,

do you have any example using Smile format?

I couldn’t find any information about that on Druid docs.

Best regards,

José Correia

quinta-feira, 17 de Maio de 2018 às 15:45:26 UTC+1, Slim Bouguerra escreveu: