We need to ingest various metrics from a lot of different sensors (might be hundred of thousands of sensors pretty soon).
Some questions we would appreciate some help with:
Would the best approach be to link each sensor to a separate datasource in Druid? or would it be better to use a single datasource, and define the sensorID as a dimension field?
- Realtime vs batch ingestion, or something in between?
We also have a semi-realtime approach, i.e. about 80% of the sensors’ data would be uploaded once or twice day, so within a 24hr window period, but about 20% of the sensors might only be uploaded once every few days, so we find that:
a) if we use realtime nodes, then the window period and segment granularity gets too big if we want to support this scenario (so we are thinking of different configured realtime nodes, some with smaller window periods, and some with bigger)
b) if we use batch ingestion, then if we later ingest data on the same datasource for an earlier time period for which data was already ingested, then the existing data gets overwritten (i.e. new data is seen as duplicates, even though the dimension field values are different)
It seems we need something between realtime and batch ingestion. Any approach/config/pattern you can recommend?