How can we query to druid from Spark Distributed cluster?
Can we use Druid as data storage for spark?
Right now I am experimenting spark and Druid. I am querying single broker by HTTP Query. Am I going right direction?
Any suggestions will help me.
I guess it depends on what you want to do.
If you just want to pull out some data or aggregates, the best thing would be to make Druid queries.
If you want all the data then it’s probably going to be faster to read the segments directly off deep storage.
I need whole data for the particular spatial query and then I will do analysis on that data in spark.
If I use deep storage how it will work?
I am using S3 for deep storage. do I need to change deep storage?
If you need all the data matching a spatial filter, then IMO, it still makes sense to use Druid to do that search. When I said try reading directly from deep storage if you need “all the data” I really meant all the data (no filter).
The best way to get out the raw data matching a filter is to use the “scan” query (http://druid.io/docs/latest/development/extensions-contrib/scan-query.html) which is currently a community-contributed extension. I think it might make its way into core at some point though.
Or, if you want aggregated data, use one of Druid’s aggregation queries like topN or groupBy.
Thank you very much.
I will definitely try all option and let you know.
When I use groupBy or select query from spark HTTP its taking time to get data from the broker and I am expecting millions of data row in one query result.
I am thinking to implement pagination in Select HTTP query with multiple brokers. So load will not come on one broker to merge 15 nodes historical data.
I am abhishek I tried connecting spark and druid uding sparkline but failed would you please tell me how you connected.
I am not using sparkline.
I am just querying to druid broker through HTTP POST request from spark worker.
yes, you need to add extension for scan query in brokers and historical nodes.