I’ve been playing with Druid a fair amount over the last two days, and I’m very impressed by the scaling potential of the system. I’m doing a small proof-of-concept now with a single node, and I’ve seen sub-50ms performance on aggregate queries with a dataset of around 1 million entries. I know that dataset size is still in the competitive range of SQL-based databases, but our production dataset is orders of magnitude larger.
In working with the system, I’ve gotten fairly familiar with the architecture but I’m not entirely clear on how the indexing system actually works under the hood. I’m not referring to the overlord-manager-peon bit, I’m more curious as to what’s going on during an indexing task execution. I’m using the standard indexing task (i.e., not Hadoop) and I’ve seen the individual speed of loads during a task increase non-linearly with progressively larger datasets. A set with 500k entries is indexed completely in around 3 minutes, but a set with 5 million takes hours to index.
I know that’s probably not enough information to identify potential bottlenecks, but that’s not what I’m looking for from this post; rather, I’d like to understand more about how the indexer works so I can rule out any obvious issues. Could anyone explain at a high level what the index task is doing when it runs?