Plywood pagination on Druid datasource

Hi everyone!

I’m not sure about where to ask this question, but since Plywood is very close to Druid I thought that someone could help me here.

I need to use the powerful Druid index to filter events on some value and then extract users ID from it with pagination. I could do it natively with POST request to brokers, however I’m trying to use plywood in a back-end nodejs app.

Following plywood web page example:

var ex = ply()

.apply(“wiki”,

('wiki').filter((“time”).in({

start: new Date(“2015-09-12T00:00:00Z”),

end: new Date(“2015-09-13T00:00:00Z”)

}).and($(‘channel’).is(‘en’)))

)

.apply(‘select’, $(‘wiki’).select(‘user’))

That works as expected and I get all the users that match the filter criteria. However if the dataset is too long the query timeouts. So what I need is to use druid’s pagination on the select query with a limit of records returned per query, which is present on the documentation http://druid.io/docs/latest/querying/select-query.html .

Is there any way I can achieve this with plywood?

Thanks!

Hi, the druid pagination feature works by passing an offset token to you that you then pass in the next query.

You can do essentially the same thing in Plywood but at a higher level. The results will be produced in an incrementing time order so if you start with the query:

var LAST_TIMESTAMP = new Date(“2000-01-01T00:00:00Z”);

var ex = ply()

.apply(“wiki”,

('wiki').filter(YOUR_FILTER.and((‘time’).greaterThan(LAST_TIMESTAMP))

)

.apply(‘select’, $(‘wiki’).select(‘user’).limit(PAGE_SIZE))

and in the next query you set LAST_TIMESTAMP to the last timestamp you got and call the same query again.

There are plans to add ‘proper’ (offset based) pagination to Plywood, but nothing will ever be as efficient as the ‘last token’ pagination that is done by Druid / described above.

It should be noted that if you are trying to extract users IDs for the interval it would be much much faster to extract them as distinct values.

var ex = ply()

.apply(“wiki”,

('wiki').filter((“time”).in({

start: new Date(“2015-09-12T00:00:00Z”),

end: new Date(“2015-09-13T00:00:00Z”)

}).and($(‘channel’).is(‘en’)))

)

.apply(‘select’, $(‘wiki’).split(’$user’, ‘user’))

This will do a groupBy query and will probably be MUCH faster and will give you distinct users.

If you have more than 500k users and are not running Druid 0.9.2 (which has a new, better groupBy engine) which will fail.

If you find yourself in that situation you could add a limit:

var ex = ply()

.apply(“wiki”,

('wiki').filter((“time”).in({

start: new Date(“2015-09-12T00:00:00Z”),

end: new Date(“2015-09-13T00:00:00Z”)

}).and($(‘channel’).is(‘en’)))

)

.apply(‘select’, $(‘wiki’).split(’$user’, ‘user’).limit(10000))

And paginate it the same as in my earlier reply.

Hi Vadim.

I used the same approach that you suggested, even with the split part because I noticed that getting uniques is much faster than getting all the events.

Because the rows are above 500k users, I have to use a limit, otherwise query could timeout. I then de-duplicate the users on the app that receives the user IDs.

Luckily we use 0.9.2 version which is really fast on the group by query.

Thanks a lot for your detailed explanation it was really helpful!

Regards,

Vadim one question,

For the last example you provided:

var ex = ply()

.apply(“wiki”,

('wiki').filter((“time”).in({

start: new Date(“2015-09-12T00:00:00Z”),

end: new Date(“2015-09-13T00:00:00Z”)

}).and($(‘channel’).is(‘en’)))

)

.apply(‘select’, $(‘wiki’).split(’$user’, ‘user’).limit(10000))

How do I grab de timestamp? Only the user is retrieved with that response and I can’t find the way to return the last timestamp.

Thanks!

@Frederico
Did you find an answer to your last question?

“How do I grab de timestamp? Only the user is retrieved with that response and I can’t find the way to return the last timestamp.”

Also my question, how would you grab the last timestamp when you split?

Thanks

Hi there xyz,

Nope, we finally did our own query parser because plywood was really hard to understand for us. I believe that Druid launched a sql-like query layer which could help you on those select queries:

http://druid.io/docs/latest/querying/sql.html

About your last question, I just googled up a little bit about “last/first” aggregators and seems that they were already implemented:

http://druid.io/docs/latest/querying/aggregations.html

But never tested them so not sure if they achieve exactly what you need.

Regards,