Loading pre-aggregated data using batch indexer

Hi,

I wonder whether the following scheme would work:

  1. Load raw data in real-time, using the following configuration:

“dimensionsSpec” : {
“dimensions”: [“dim1”, “dim2”, “dim3”]

}

“metricsSpec” : [{ “type” : “count”, “name” : “count” }]

``

The data contains, obviously, the following fields:

timestamp,dim1,dim2,dim3

``

  1. Then, load pre-aggregated data in batch indexer into the same data-source with the following configuration:

“dimensionsSpec” : {
“dimensions”: [“dim1”, “dim2”, “dim3”]

}

“metricsSpec” : [{ “type” : “longSum”, “name” : “count”,“fieldName”:“count” }]

``

The aggregated data in this case consists of:

timestamp(truncated),dim1,dim2,dim3,count

``

Do you see anything wrong with such approach? Will the segments created by real-time ingestion be replaced by the batch index process, given the data source name remains the same?

PS: The whole purpose of loading pre-aggregated data is to speed the Hadoop indexing process.

Thanks!

You can use the exact same dataSchema for realtime and batch. Druid segments use MVCC so any segments created by batch after realtime processing is done will automatically replace the segments generated by realtime ingestion.

Hello Fangjin,

What I meant is that the schema is a bit different between batch and real-time, in the first case it contains "count" field and a measurement that sums on this field (to produce total count of events), while at the latter case the data is raw, and the count is achieved by special "count" measurement type (see my original qurstion for more explanation).

Can this work?

Thanks!

Yes.

Druid segments are all versioned and Druid uses MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_control)