I was reading through the white paper (http://static.druid.io/docs/druid.pdf) and in the section that talks about the real time nodes, the following is stated:
“A single ingestion endpoint also allows for data streams
to be partitioned such that multiple real-time nodes each ingest a
portion of a stream. This allows additional real-time nodes to be
seamlessly added. In practice, this model has allowed one of the
largest production Druid clusters to be able to consume raw data at
approximately 500 MB/s (150,000 events/s or 2 TB/hour)”.
I assume the partitioning happens via kafka partitions? And if so, how does that change the process of hand-off to deep storage? It seems that there will need to be communication between realtime nodes which have each locally persisted partitions of the same data source (some kind of distributed merge). I was just wondering how that works since the white paper doesn’t describe it.