Use Multiple Datasources for different sensors?

Hi

We need to ingest various metrics from a lot of different sensors (might be hundred of thousands of sensors pretty soon).

Some questions we would appreciate some help with:

  1. Datasources

Would the best approach be to link each sensor to a separate datasource in Druid? or would it be better to use a single datasource, and define the sensorID as a dimension field?

  1. Realtime vs batch ingestion, or something in between?

We also have a semi-realtime approach, i.e. about 80% of the sensors’ data would be uploaded once or twice day, so within a 24hr window period, but about 20% of the sensors might only be uploaded once every few days, so we find that:

a) if we use realtime nodes, then the window period and segment granularity gets too big if we want to support this scenario (so we are thinking of different configured realtime nodes, some with smaller window periods, and some with bigger)

b) if we use batch ingestion, then if we later ingest data on the same datasource for an earlier time period for which data was already ingested, then the existing data gets overwritten (i.e. new data is seen as duplicates, even though the dimension field values are different)

It seems we need something between realtime and batch ingestion. Any approach/config/pattern you can recommend?

Kind regards,

Herman

Hi,

Please see inline.

Hi

Thanks for the feedback. I saw the post where you discussed the “delta ingestion” approach and various alternative sharding specs etc. It sounds ideal for our use case.

What do you think are realistic timelines for the delta ingestion related features?

Hi,

A PR implementing delta ingestion is up for review ( https://github.com/druid-io/druid/pull/1374 ). I think it might be available in 0.8.1 or may be 0.8.2 in the worst case.

– Himanshu