Invoking druid-spark-batch

Hello all,

Since I’ve hit a wall running the druid hadoop indexing service (our cloud provider is limiting the size of intermediate objects uploaded to hdfs that get’s created during the indexing run), I was looking for other ways of doing this. druid-spark-batch seems like a promising approach, if only for a minor thing: I can’t figure out how to run it.

The project’s README talks about the various things needed but does not have any examples calling it. I’ve built the jar (for spark-1.6.0 and druid-0.9.0) and put it under the extensions dir, where io.druid.cli.Main' can find it, but I don't know how to invoke it. Does it supplement the index hadoop` job? Or should a provide a different name in the command line?

Thanks,
Vassilis

On a tangent note,

I’d like to ask if it’s sensible/feasible to connect spark and druid using tranquility, but for a batch job.
I’m wondering since I have not seen any examples of this use case on the wild or on this list.

Thanks,
Vassilis

Hi Vassilis!

Thanks for the interest in the spark batch indexer. I recommend using the pull-deps command with the options -h org.apache.spark:spark-core_2.10:1.5.2-mmx4 and -c io.druid.extensions:druid-spark-batch_2.10:0.9.0-0 (with the appropriate versions of course) in addition to other extensions you may want.

This should set up your extension directories. Then you have to make sure the extension druid-spark-batch_2.10 is in the extension loadList

Let me know if that helps,

Charles Allen

Hey Vassilis,

Tranquility’s windowPeriod concept makes most batch loads un-workable. (unless you are loading really recent micro batches, like with Spark Streaming)

Hey Charles, thanks for the prompt response.

Τη Τετάρτη, 4 Μαΐου 2016 - 3:12:08 π.μ. UTC+3, ο χρήστης charles.allen έγραψε:

Hi Vassilis!

Thanks for the interest in the spark batch indexer. I recommend using the pull-deps command with the options -h org.apache.spark:spark-core_2.10:1.5.2-mmx4 and -c io.druid.extensions:druid-spark-batch_2.10:0.9.0-0 (with the appropriate versions of course) in addition to other extensions you may want.

This should set up your extension directories. Then you have to make sure the extension druid-spark-batch_2.10 is in the extension loadList

I’ve set up the extension loadList as remarked, but as I said, I don’t have a clue on how to invoke it. Running the default hadoop indexer is as simple as:

java -DotherArgs -cp druid-0.9.0:druid-0.9.0/lib/*:$HADOOP_CONF_DIR io.druid.cli.Main index hadoop --no-default-hadoop $daily.spec

``

What’s the appropriate invocation for the spark indexer?

Vassilis

Hi Gian,

Τη Τετάρτη, 4 Μαΐου 2016 - 3:23:39 π.μ. UTC+3, ο χρήστης Gian Merlino έγραψε:

Hey Vassilis,

Tranquility’s windowPeriod concept makes most batch loads un-workable. (unless you are loading really recent micro batches, like with Spark Streaming)

Thanks for clearing up that bit.

Vassilis

The spark indexer only works as an indexing service task, so you have to have the overlord setup (and probably a middle manager).

The spark indexer extension should be on the load list for the overlord (so it can parse the task) and the middle managers (so they can run the spark driver).

I see now. I’ll have to try and deploy this setup and let you know.

Thanks for helping out,
Vassilis