We have setup real-time ingestion in our Druid cluster through tranquility and we are using Azure Blob Storage for deep storage - everything works fine so far.
We now want to setup batch ingestion through Hadoop and are stuck. Our process is as follows:
- We post the indexing task json using the ‘post-index-task’ script.
- We have successfully setup the corresponding plugins and are actually reading the data from Azure Data Lake, stored in parquet format
- Map/Reduce actually completes successfully and tries to store the data in Azure Blob Storage.
- We get the following error:
- java.lang.NullPointerException: segmentOutputPath
Apparently druid-azure-extensions does not really support batch ingestion through the Hadoop job (https://groups.google.com/forum/#!topic/druid-user/FxINZsIcZS4, https://groups.google.com/forum/#!topic/druid-user/c3xVhPjFUPg).
We understand completely that this is a community extension, so it has its limitations. Our question is what are our current options for batch ingestion? eg.
- Try a simple index task (http://druid.io/docs/latest/ingestion/tasks.html#index-task) - will this work? our dataset is not very large
- Use spark through this plugin: https://github.com/metamx/druid-spark-batch
- Move away from azure blob storage altogether and try other providees
Would you suggest any of the above, or any other solution?