I have a druid job which is ingesting from kinesis. The ingest task duration is 1 hour and the query granularity is 1m. The records have a couple of dimensions and a single boolean field(represented as 1/0). Some of the records in the stream are duplicates(i.e. have the same dimension values and timestamp). These duplicate records may appear anywhere between 45 mins-60 mins apart and they transition the boolean field from 1 to 0. Since the records can appear so late, it appears that they can go to different segments. When i do a query, i deal with this by doing a nested query of the form select … from (select MIN(field), dim1, dim2, dim3 FROM … GROUP BY dim1, dim2, dim3…))
However, this group by appears pretty expensive (the map is dim1dim2dim3*(endTime-startTime) entries big?) and doing this seems to have a noticeable effect on query time. Any suggestions on what to do to deal with this kind of situation?