Druid deep storage and indexer options


I would like to ask whether its possible to configure HDFS as druid deep storage, but still to be able to load (index) data from s3 ? do I need for that to add the s3 extension ?

Also I would like to ask whether its possible to define different deep storage for indexing tasks ?

for example I would like to use HDFS as the default deep storage but for the new kafka indexer I would like to use s3 as the deep storage.


You should be able to have HDFS as deep storage but load data from wherever you want. If you’re loading data from S3 without using M/R jobs (index_hadoop task in remote mode) then you need the S3 extension too.

The deep storage is meant to be a cluster wide config though, so it’s not straightforward to define different ones for different tasks. You might be able to do it somehow (you need some way of overriding the jvm system properties passed down to the tasks) but I’m not totally sure if it would work or not.