Question about indexing process

I’ve been playing with Druid a fair amount over the last two days, and I’m very impressed by the scaling potential of the system. I’m doing a small proof-of-concept now with a single node, and I’ve seen sub-50ms performance on aggregate queries with a dataset of around 1 million entries. I know that dataset size is still in the competitive range of SQL-based databases, but our production dataset is orders of magnitude larger.

In working with the system, I’ve gotten fairly familiar with the architecture but I’m not entirely clear on how the indexing system actually works under the hood. I’m not referring to the overlord-manager-peon bit, I’m more curious as to what’s going on during an indexing task execution. I’m using the standard indexing task (i.e., not Hadoop) and I’ve seen the individual speed of loads during a task increase non-linearly with progressively larger datasets. A set with 500k entries is indexed completely in around 3 minutes, but a set with 5 million takes hours to index.

I know that’s probably not enough information to identify potential bottlenecks, but that’s not what I’m looking for from this post; rather, I’d like to understand more about how the indexer works so I can rule out any obvious issues. Could anyone explain at a high level what the index task is doing when it runs?

Hi Taylor,

The indexing task is really inefficient with larger datasets and is generally not recommended for production. The main inefficiency is that it first figures out how many segments it wants to create, then it re-scans all your data for each segment with a different filter each time. The hadoop indexer uses hadoop’s shuffle to avoid excessive re-scanning, and is generally going to be a lot faster.

Btw, just to be more clear. The thing that is not very efficient is the “index” task. The “index_hadoop” task uses hadoop and is generally a better choice if you have a hadoop cluster available.

Ah, that makes sense. So when I see lines like this in the log:

2015-04-15T22:00:02,733 INFO [task-runner-0] io.druid.indexing.common.task.IndexTask - Determining partitions for interval[2015-04-15T00:00:00.000Z/2015-04-15T01:00:00.000Z] with targetPartitionSize[5000000]
2015-04-15T22:00:02,733 INFO [task-runner-0] io.druid.segment.realtime.firehose.LocalFirehoseFactory - Searching for all [*.json] in [/tmp/output]
2015-04-15T22:00:41,191 INFO [task-runner-0] io.druid.indexing.common.task.IndexTask - Estimated approximately [37,226.533930] rows of data.

it’s actually looking through all data in the provided set for data in the interval it’s currently on?

Yeah, it is.

Cool. I have one other question that may or may not be related. What does druid actually store in those segments? Are all of the records that I were indexed saved, or does Druid only persist key metrics (cardinality, count, etc.) to its persistent store?

One addendum to that question: if it is keeping all of the data somehow, is it possible to retrieve the individual records for each segment? I’m only curious about how this data is stored, I don’t actually intend to use the system to start pulling out individual records like that.

Hi Taylor, if you are interested in individual Druid records, you can try out the experiemental select query:

Druid indexes are a column-oriented form of what this blog post calls “beta” data: It’s the table you would get if you grouped all your raw records by their dimensions and aggregated each set of records along all metrics.