Druid batch indexing

hi there,

Following up on the question on StackOverflow,

A noob question. What are the trade-offs running Druid batch indexing on external cluster vs druid one (apart from obvious ACL)? Is there any documentation that better describe the internals of what happen during the batch indexing? Also, just wonder is it possible to build offline indexing (similar to EmbeddedSolrServer in Solr) and point S3 location for deep storage?

Hi,
you mean the difference between Hadoop index task and the index task? This document might be helpful. http://druid.io/docs/latest/ingestion/tasks.html

The main difference is Hadoop index task works in a distributed manner while the index task is a single process job. We are currently improving the index task.

Also, just wonder is it possible to build offline indexing (similar to EmbeddedSolrServer in Solr) and point S3 location for deep storage?

I’m not familiar with what EmbeddedSolrServer is, but yes it’s possible. You can set deep storage to s3 (http://druid.io/docs/latest/development/extensions-core/s3.html) and use Hadoop index task or the index task for offline indexing (http://druid.io/docs/latest/ingestion/tasks.html). If you have your data in s3, you can use the index task with StaticS3Firehose, or the Hadoop index task using EMR.

Jihoon

2017년 6월 2일 (금) 오전 8:07, Ananth Durai vananth22@gmail.com님이 작성:

Thanks, Jihoon. The explanation is super helpful. I’ve one follow up question. Please correct me if I’m wrong, my understanding after reading the code is;

Each HadoopIndexTask publishes segments with RemoteTaskActionClient which calls Overlord to index the tasks. The overlord will again call the middle manager just like the real-time indexing path. If that’s correct won’t it create a skew load on middle manager peons?

The background of my question is; I’ve S3 as my deep storage. Batch index in most cases are either back fill or deduplication tasks that doesn’t need to be served from real time nodes. I’m thinking of is there anyway that I can by-pass the whole indexing process and publish the segments directly into S3, by deleting the old segments. This will help to create a predictable load on middle manager nodes. (I don’t know how the index partitioned and stored on S3 or how the broker refreshed the metadata internally, so the whole concept is little blur to me)

Hi Ananth,

I’m not sure why you want to bypass the indexing process, but druid’s segments are not just raw data. They include data stored in its own data format, dictionaries, and bitmap indexes. This (http://druid.io/docs/latest/design/segments.html) will help you understand the druid format.

If you’re concerned with the load balancing on peons, you can set worker select strategy in your worker configuration (http://druid.io/docs/latest/configuration/indexing-service.html). There are several available options like equalDistribution.

Jihoon

2017년 6월 6일 (화) 오전 7:43, Ananth Durai vananth22@gmail.com님이 작성: