Druid queries return stale data

Hi,

So I’ve got a python script producing data for ingestion into a kafka topic. A kafka console consumer echoing the data on screen.

I’ve got an ingestion spec for druid which matches the data the script pumps out.

My kafka producer shows the following:

{“st_id”: 12, “store_id”: 84, “part_sku”: “M-960”, “rep_id”: 1, “qty”: 1, “cust_id”: 4945, “discount_amt”: 20, “trxn_amt”: 62.26, “trxn_time”: “2019-04-09 06:25:23.437”}

My kafka consumer echos the following:

{“st_id”: 12, “store_id”: 84, “part_sku”: “M-960”, “rep_id”: 1, “qty”: 1, “cust_id”: 4945, “discount_amt”: 20, “trxn_amt”: 62.26, “trxn_time”: “2019-04-09 06:25:23.437”}

As you can see below, my scan query drops trxn_time in its output in favour of __time… not sure why?

My major concern si that when I hit the druid datasource, repeatedly, with either a scan or select query I get the exact same data set returned to me over and over again.

The producer is running, and when I check the druid coordinator UI I see lots of shards and segments, and all tasks are successful in the overlord UI.

Under these conditions I would expect the results of the scan query to constantly be changing to the latest record in the db… instead it is giving me a record from last night.

My scan query looks like:

{

“queryType”: “scan”,

“dataSource”: “d3”,

“resultFormat”: “list”,

“columns”:,

“intervals”: [“2019-04-08/2020-04-10”],

“batchSize”:20480,

“limit”:1

}

Every time I run it I get the following result:

[ {

“segmentId” : “d3_2019-04-08T18:30:00.000Z_2019-04-08T18:45:00.000Z_2019-04-08T22:55:08.848Z”,

“columns” : [ “__time”, “store_id”, “part_sku”, “rep_id”, “qty”, “cust_id”, “discount_amt”, “trxn_amt” ],

“events” : [ {

“__time” : 1554748740000,

“store_id” : 3,

“part_sku” : “I-983”,

“rep_id” : 12,

“qty” : 3,

“cust_id” : 8901,

“discount_amt” : 25,

“trxn_amt” : 77

} ]

The simplest explanation is that my query is incorrect, but I am at a loss as to how it needs to be modified.

Any ideas?

you have a limit 1 without an orderby on time. So it will give you the first record it sees in any order.

I would suggest to try a timeseries query with a count aggregator on the number of rows present in druid.

You should see the count being increased as the ingestion happens.