Hadoop batch ingestion on Azure Blob Storage


We have setup real-time ingestion in our Druid cluster through tranquility and we are using Azure Blob Storage for deep storage - everything works fine so far.

We now want to setup batch ingestion through Hadoop and are stuck. Our process is as follows:

  • We post the indexing task json using the ‘post-index-task’ script.
  • We have successfully setup the corresponding plugins and are actually reading the data from Azure Data Lake, stored in parquet format
  • Map/Reduce actually completes successfully and tries to store the data in Azure Blob Storage.
  • We get the following error:
  • java.lang.NullPointerException: segmentOutputPath
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229

Apparently druid-azure-extensions does not really support batch ingestion through the Hadoop job (https://groups.google.com/forum/#!topic/druid-user/FxINZsIcZS4, https://groups.google.com/forum/#!topic/druid-user/c3xVhPjFUPg).

We understand completely that this is a community extension, so it has its limitations. Our question is what are our current options for batch ingestion? eg.

  1. Try a simple index task (http://druid.io/docs/latest/ingestion/tasks.html#index-task) - will this work? our dataset is not very large
  2. Use spark through this plugin: https://github.com/metamx/druid-spark-batch
  3. Move away from azure blob storage altogether and try other providees

Would you suggest any of the above, or any other solution?