Looking to query Druid daily from a spark cluster for further analytical processing, (and then stream the results of that back into another Druid datastore).
I was wondering about the best approach to do this? I have seen a couple of older posts that mention http post or jdbc - will these both do the job, is one better better than the other? Other methods also welcome. Would really appreciate any guidance or resources on how do this!
Manu was spot-on (as always ).
One thing I can say is that we use the JDBC option (https://druid.apache.org/docs/latest/querying/sql.html#jdbc) via periodic Spark jobs in order to query Druid and perform various ETLs.
It basically allows you to run a query on top of Druid (via JDBC) from a Spark application, get the result as a DataFrame, and process it as you normally do in Spark.
This work OK, but it has 2 cons that immediately come to mind (as opposed to https://github.com/apache/druid/issues/9780 Manu mentioned above):
- If your need is to extract a really large dataset from Druid (as opposed to running some kind of aggregation query), you might encounter scalability issues (e.g your broker might crash on OOM).
- It only allows you to QUERY Druid, not INSERT the results back to Druid.
Let me know if that helps.
Thankyou for the link, very informative! It’s a shame that spark is not currently supported.
Do you by any chance have any recommendations for any other tools for big data analysis and machine learning type analysis, which has read-write support with Druid? (spark was just the go to option given its popularity.)
Thanks for the reply! This is interesting and something I will definitely explore further.
1. How large do you mean by a large dataset? I was wondering if the subset of data I would be reading would fall under this?
2. Could you write the data back to Druid using a streaming tool like Kafka? Or would you advise against this for performance etc?
Also, if you have any recommendations for any other big data ML tools with good Druid support, I’d be happy to look into those too
To answer your questions:
- It’s hard to say in advance how large you can have the resultset, since it depends on several factors (e.g your brokers’ spec/config).
Theoretically, it can be millions of rows.
I would suggest that you first run your Spark job with the relevant query against some kind of sandbox Druid cluster, since it can potentially make your brokers crash, or at least limit your query to a small subset of the required data (e.g by only selecting 1 time period), and if that works - gradually expand the query by removing filters.
- You can write the results to Kafka from your Spark job, and then use Druid’s Kafka ingestion capabilities (I personally haven’t tried it, so I can’t say how well it works).
- As per ML tools, we use a few proprietary tools, but I have encountered https://github.com/yahoo/sherlock in the past, if you want to take a look.