Select Query Pagination Sending Back Duplicates

I am trying to use the select query to paginate results. I noticed that when I paginate for more than 2 pages, I’m getting back duplicate entries. Here are a few scenarios where the threshold is 20:

35 total results

First Request

Sends back 20 results with the following pagingIdentifiers:

events_2015-10-05T17:05:00.000Z_2015-10-05T17:10:00.000Z_2015-10-05T17:05:00.000Z=0

events_2015-10-06T15:45:00.000Z_2015-10-06T15:50:00.000Z_2015-10-06T15:45:00.000Z=0

events_2015-10-07T19:15:00.000Z_2015-10-07T19:20:00.000Z_2015-10-07T19:15:00.000Z=0

events_2015-10-07T21:25:00.000Z_2015-10-07T21:30:00.000Z_2015-10-07T21:25:00.000Z=0

events_2015-10-08T18:20:00.000Z_2015-10-08T18:25:00.000Z_2015-10-08T18:20:00.000Z=0

events_2015-10-09T15:25:00.000Z_2015-10-09T15:30:00.000Z_2015-10-09T15:25:00.000Z=0

events_2015-10-09T15:30:00.000Z_2015-10-09T15:35:00.000Z_2015-10-09T15:30:00.000Z=0

events_2015-10-09T15:50:00.000Z_2015-10-09T15:55:00.000Z_2015-10-09T15:50:00.000Z=0

events_2015-10-09T16:20:00.000Z_2015-10-09T16:25:00.000Z_2015-10-09T16:20:00.000Z=0

events_2015-10-09T21:35:00.000Z_2015-10-09T21:40:00.000Z_2015-10-09T21:35:00.000Z=0

events_2015-10-09T21:40:00.000Z_2015-10-09T21:45:00.000Z_2015-10-09T21:40:00.000Z=0

events_2015-10-09T22:50:00.000Z_2015-10-09T22:55:00.000Z_2015-10-09T22:50:00.000Z=0

events_2015-10-10T01:20:00.000Z_2015-10-10T01:25:00.000Z_2015-10-10T01:20:00.000Z=0

events_2015-10-10T04:05:00.000Z_2015-10-10T04:10:00.000Z_2015-10-10T04:05:00.000Z=0

events_2015-10-12T04:30:00.000Z_2015-10-12T04:35:00.000Z_2015-10-12T04:30:00.000Z=0

events_2015-10-13T15:05:00.000Z_2015-10-13T15:10:00.000Z_2015-10-13T15:05:00.000Z=0

events_2015-10-13T20:00:00.000Z_2015-10-13T20:05:00.000Z_2015-10-13T20:00:00.000Z=0

events_2015-10-16T05:55:00.000Z_2015-10-16T06:00:00.000Z_2015-10-16T05:55:00.000Z=0

events_2015-10-16T15:20:00.000Z_2015-10-16T15:25:00.000Z_2015-10-16T15:20:00.000Z=0

events_2015-10-16T15:25:00.000Z_2015-10-16T15:30:00.000Z_2015-10-16T15:25:00.000Z=0

Second Request

Using the following pagingIdentifiers:

events_2015-10-05T17:05:00.000Z_2015-10-05T17:10:00.000Z_2015-10-05T17:05:00.000Z=1

events_2015-10-06T15:45:00.000Z_2015-10-06T15:50:00.000Z_2015-10-06T15:45:00.000Z=1

events_2015-10-07T19:15:00.000Z_2015-10-07T19:20:00.000Z_2015-10-07T19:15:00.000Z=1

events_2015-10-07T21:25:00.000Z_2015-10-07T21:30:00.000Z_2015-10-07T21:25:00.000Z=1

events_2015-10-08T18:20:00.000Z_2015-10-08T18:25:00.000Z_2015-10-08T18:20:00.000Z=1

events_2015-10-09T15:25:00.000Z_2015-10-09T15:30:00.000Z_2015-10-09T15:25:00.000Z=1

events_2015-10-09T15:30:00.000Z_2015-10-09T15:35:00.000Z_2015-10-09T15:30:00.000Z=1

events_2015-10-09T15:50:00.000Z_2015-10-09T15:55:00.000Z_2015-10-09T15:50:00.000Z=1

events_2015-10-09T16:20:00.000Z_2015-10-09T16:25:00.000Z_2015-10-09T16:20:00.000Z=1

events_2015-10-09T21:35:00.000Z_2015-10-09T21:40:00.000Z_2015-10-09T21:35:00.000Z=1

events_2015-10-09T21:40:00.000Z_2015-10-09T21:45:00.000Z_2015-10-09T21:40:00.000Z=1

events_2015-10-09T22:50:00.000Z_2015-10-09T22:55:00.000Z_2015-10-09T22:50:00.000Z=1

events_2015-10-10T01:20:00.000Z_2015-10-10T01:25:00.000Z_2015-10-10T01:20:00.000Z=1

events_2015-10-10T04:05:00.000Z_2015-10-10T04:10:00.000Z_2015-10-10T04:05:00.000Z=1

events_2015-10-12T04:30:00.000Z_2015-10-12T04:35:00.000Z_2015-10-12T04:30:00.000Z=1

events_2015-10-13T15:05:00.000Z_2015-10-13T15:10:00.000Z_2015-10-13T15:05:00.000Z=1

events_2015-10-13T20:00:00.000Z_2015-10-13T20:05:00.000Z_2015-10-13T20:00:00.000Z=1

events_2015-10-16T05:55:00.000Z_2015-10-16T06:00:00.000Z_2015-10-16T05:55:00.000Z=1

events_2015-10-16T15:20:00.000Z_2015-10-16T15:25:00.000Z_2015-10-16T15:20:00.000Z=1

events_2015-10-16T15:25:00.000Z_2015-10-16T15:30:00.000Z_2015-10-16T15:25:00.000Z=1

Sends back the remaining 15 results and everything is working as expected.

74 Total Results

First request

Sends back 20 results with the following pagingIdentifiers:

events_2015-10-05T17:55:00.000Z_2015-10-05T18:00:00.000Z_2015-10-05T17:55:00.000Z=0

events_2015-10-05T20:40:00.000Z_2015-10-05T20:45:00.000Z_2015-10-05T20:40:00.000Z=0

events_2015-10-06T03:45:00.000Z_2015-10-06T03:50:00.000Z_2015-10-06T03:45:00.000Z=0

events_2015-10-06T03:50:00.000Z_2015-10-06T03:55:00.000Z_2015-10-06T03:50:00.000Z=0

events_2015-10-06T06:05:00.000Z_2015-10-06T06:10:00.000Z_2015-10-06T06:05:00.000Z=2

events_2015-10-06T15:05:00.000Z_2015-10-06T15:10:00.000Z_2015-10-06T15:05:00.000Z=0

events_2015-10-08T13:05:00.000Z_2015-10-08T13:10:00.000Z_2015-10-08T13:05:00.000Z=0

events_2015-10-09T15:25:00.000Z_2015-10-09T15:30:00.000Z_2015-10-09T15:25:00.000Z=10

Second request

Using the following pagingIdentifiers:

events_2015-10-05T17:55:00.000Z_2015-10-05T18:00:00.000Z_2015-10-05T17:55:00.000Z=1

events_2015-10-05T20:40:00.000Z_2015-10-05T20:45:00.000Z_2015-10-05T20:40:00.000Z=1

events_2015-10-06T03:45:00.000Z_2015-10-06T03:50:00.000Z_2015-10-06T03:45:00.000Z=1

events_2015-10-06T03:50:00.000Z_2015-10-06T03:55:00.000Z_2015-10-06T03:50:00.000Z=1

events_2015-10-06T06:05:00.000Z_2015-10-06T06:10:00.000Z_2015-10-06T06:05:00.000Z=3

events_2015-10-06T15:05:00.000Z_2015-10-06T15:10:00.000Z_2015-10-06T15:05:00.000Z=1

events_2015-10-08T13:05:00.000Z_2015-10-08T13:10:00.000Z_2015-10-08T13:05:00.000Z=1

events_2015-10-09T15:25:00.000Z_2015-10-09T15:30:00.000Z_2015-10-09T15:25:00.000Z=11

Sends back 20 results with the following pagingIdentifiers:

events_2015-10-09T15:55:00.000Z_2015-10-09T16:00:00.000Z_2015-10-09T15:55:00.000Z=1

events_2015-10-09T16:20:00.000Z_2015-10-09T16:25:00.000Z_2015-10-09T16:20:00.000Z=0

events_2015-10-09T21:30:00.000Z_2015-10-09T21:35:00.000Z_2015-10-09T21:30:00.000Z=1

events_2015-10-09T21:40:00.000Z_2015-10-09T21:45:00.000Z_2015-10-09T21:40:00.000Z=0

events_2015-10-09T23:55:00.000Z_2015-10-10T00:00:00.000Z_2015-10-09T23:55:00.000Z=0

events_2015-10-10T03:00:00.000Z_2015-10-10T03:05:00.000Z_2015-10-10T03:00:00.000Z=0

events_2015-10-10T04:05:00.000Z_2015-10-10T04:10:00.000Z_2015-10-10T04:05:00.000Z=4

events_2015-10-11T01:20:00.000Z_2015-10-11T01:25:00.000Z_2015-10-11T01:20:00.000Z=3

events_2015-10-11T01:25:00.000Z_2015-10-11T01:30:00.000Z_2015-10-11T01:25:00.000Z=0

events_2015-10-11T15:45:00.000Z_2015-10-11T15:50:00.000Z_2015-10-11T15:45:00.000Z=0

events_2015-10-11T17:50:00.000Z_2015-10-11T17:55:00.000Z_2015-10-11T17:50:00.000Z=0

Third request

Using the following pagingIdentifiers:

events_2015-10-09T15:55:00.000Z_2015-10-09T16:00:00.000Z_2015-10-09T15:55:00.000Z=2

events_2015-10-09T16:20:00.000Z_2015-10-09T16:25:00.000Z_2015-10-09T16:20:00.000Z=1

events_2015-10-09T21:30:00.000Z_2015-10-09T21:35:00.000Z_2015-10-09T21:30:00.000Z=2

events_2015-10-09T21:40:00.000Z_2015-10-09T21:45:00.000Z_2015-10-09T21:40:00.000Z=1

events_2015-10-09T23:55:00.000Z_2015-10-10T00:00:00.000Z_2015-10-09T23:55:00.000Z=1

events_2015-10-10T03:00:00.000Z_2015-10-10T03:05:00.000Z_2015-10-10T03:00:00.000Z=1

events_2015-10-10T04:05:00.000Z_2015-10-10T04:10:00.000Z_2015-10-10T04:05:00.000Z=5

events_2015-10-11T01:20:00.000Z_2015-10-11T01:25:00.000Z_2015-10-11T01:20:00.000Z=4

events_2015-10-11T01:25:00.000Z_2015-10-11T01:30:00.000Z_2015-10-11T01:25:00.000Z=1

events_2015-10-11T15:45:00.000Z_2015-10-11T15:50:00.000Z_2015-10-11T15:45:00.000Z=1

events_2015-10-11T17:50:00.000Z_2015-10-11T17:55:00.000Z_2015-10-11T17:50:00.000Z=1

Sends 20 results but some of them are duplicate.

If I keep paginating past this point, the same results occur, i.e. I get back duplicate entries. After multiple requests, I stop getting unique results. If I set a larger threshold (for example 100 or 1000), no duplicate results occur.

When looking at the docs:

http://druid.io/docs/latest/development/select-query.html

The last code block states to increment the offset by 1. However, when there is more than one pagingIdentifier, does that mean every pagingIdentifier needs to be incremented within each subsequent request until no results are found? Also, how should this be treated on each subsequent request?

Unfortunately, I can’t show the actual results due to this work being behind a firewall. Let me know if any other information is needed/needs further clarification.

  • Geoff

Hi Geoff, we should improve this API, but right now, for every paging Identifier you get back, you have increase by 1 and send the query again. With 0.9.0 coming out soon, this may be a good chance to increase the paging identifier by 1 right now…

Ohhh, I see what you mean. So for each pagingIdentifier you get back, you need to append it to the existing pagingIdentifiers and then continue to increment each. So something like this:

request 1

sent pagingIdentifiers

none

returned pagingIdentifiers

foo=1

bar=1

request 2

sent pagingIdentifiers

foo=2

bar=2

returned pagingIdentifiers

baz=1

bat=1

request 3

sent pagingIdentifiers

foo=3

bar=3

baz=2

bat=2

returned pagingIdentifiers

duh=1

meh=1

and so forth. Now that I think of it, that makes more sense. I saw you posted some additional requirements for the select query (getting #1 would be awesome), but what is here definitely works. Thanks again for your help!

  • Geoff

Yeah, the idea is that there is some work that needs to be done on the client side, which should probably be improved :stuck_out_tongue:

You use the pagingIdentifiers you get back, increment the values you get back by 1, and send the new pagingIdentifiers with the next request.