Druid Segment data format

Hi all,

I have a question related to the format of the segment storage. I wanted to know if there is a way where Druid can store data in some custom format to deep storage.

e.g. I want to use Druid for my analytics needs and use HDFS to store the data (segments in druid) generated by real time nodes. However say I also want to use the same

data stored on HDFS and run say Map-Reduce jobs directly on the data for some insights. Is it possible using the current segment storage format to support such a




I believe there were some folks working on this. I’ll let Himanshu respond with his thoughts here.


Yes, you can use druid-mr in https://github.com/himanshug/druid-hadoop-utils to do that.
see sample application at, https://github.com/himanshug/druid-hadoop-utils/blob/master/druid-mr/src/main/java/com/yahoo/druid/hadoop/example/SamplePrintMRJob.java

pls let me know if you find any issues.

I’m thinking of moving the DruidInputFormat with some finishing touches in druid itself so that it is accessible easily.

– Himanshu

Thanks himanshu.



Hi Himanshu - We also had similar requirements where we want to druid as component to store data for specific application and keep deep storage of the same for analytics usage.

And for that, we are required to read the data present in HDFS (smoosh files of druid). Is there anything latest on this topic you can share ?


Arpan Khagram

+91 8308993200