Batch indexing without Spark

Hi all,

I’d like to introduce Druid in our data stack, but the dependency on Hadoop is a real show-stopper for me.

I have a GlusterFS cluster, which I’d like to use as deep storage for Druid, and a Spark + Tachyon cluster with GlusterFS used as an underfs for Tachyon.

I’d want to use the Indexing Service for real-time ingestion, but I’ll also need to periodically do some batch ingestion of large Parquet files (> 1G) stored in GlusterFS.

What would be the best (or is there none) way of accomplishing this without adding Hadoop to the stack ?

We’re working on https://github.com/druid-io/druid/issues/1642 and that work should be completed in a month or so, afterwards you can stream files, data in Kafka, etc. directly into Druid without any Dependency on Hadoop. IThe slow way of indexing static catch data is to use “index tasks”, which don’t require Hadoop: http://druid.io/docs/latest/misc/tasks.html (search for Index Task)

Nice plan.
After reading the docs more thoroughly, I still have a question answered tough, related to the input format, for both real-time and batch ingestion.

We’re pushing protobuf formatted messages to kafka, how can I configure ingestion so that Druid inderstand this protobuf input ?

Also, I have lot of big parquet files on glusterfs, how can I make druid batch ingest from these files ? Is it possible right now ?

Just filed because it seems there’s enough interest from the community to get it out as swiftly as possible.

Here’s the status of the project so far:

  • Memory problems… Since the druid indexing task is very memory intensive, it is very easy to run out of heap memory. There are many tweaks in how spark operates that depend on if you’re running stand-alone cluster, yarn, or mesos that greatly affect how to mitigate this risk. There are also a few tweaks in how druid handles memory pressure (though only at a pretty high level and in a “please let this work” kind of way). In general though, Spark batch indexing seems to take up notably more heap memory than the hadoop reducers.
  • Limited input format support. Right now the project only supports string input (no druid segment input). It also only supports passing an explicit list of files instead of the multitude of file specifications supported by the hadoop version
  • It can produce binary identical output files compared to the hadoop job. (This should be a no-duh statement, but is a testament to the versatility of the basic indexing paradigm)
  • Currently uses RDD, there is another community member who is investigating DataFrame support, but the first version will most likely just use RDDs
  • Assembly… I had to package the whole lot as an assembly because of horrible library/version problems between druid/spark/hadoop(spark depends on hadoop) and getting classloader craziness straightened out. And because of that…
  • Its “big” (for a jar)… like about 100MB that needs distributed. If you’re running in an fine grain mode where each task gets a copy of the jars for the job, you can easily fill up a drive much faster than you expected.
  • Limited version support - It requires some patches not in druid stable (should be in 0.8.3), some spark patches not in spark stable (should be in 1.6.0)
    The project is open source but not public. If you would like access to it for testing/development purposes (aka, alpha version) let me know and I can get you access to what is out there.

Cheers,

Charles Allen