Timeseries query: how to filter events by a secondary time dimension at the bucket level

Hi all,

I have a bit of a funny use case that I’m struggling to find a solution to. Let me explain. My schema includes a secondary time dimension, with values that are millis since epoch.

I am trying to build a timeseries query that does the following:

  1. Aggregate data with a granularity of daily buckets

  2. Filter out any events that their secondary timestamp is not within the current bucket time range

In the doc, I saw the option to use the special __time dimension (http://druid.io/docs/latest/querying/dimensionspecs.html) to get the current bucket’s time according to the query granularity.

I was thinking to use a Javascript function or filter to compare the __time dimension with my secondary time dimension, but there doesn’t seem to be any way to do that.

Am I missing something? Is there a solution to this use case or is it not supported at the moment?

Thanks in advance,

Amir

Hi Amir, I believe you should be able to do that with the javascript filter. What is the SQL query you are trying issue? You might want to try https://github.com/implydata/plyql to answer your query.

Is this a one-time "ad-hoc" use case or are you consistently going to
be doing this query of "include only events that happened in the
current time bucket". If you are going to do it consistently, I
highly recommend either adding a new column which is essentially a
boolean of "in current time bucket" and then using that to filter your
queries. You might also consider completely eliminating the secondary
time column in favor of this annotation.

If this is an ad hoc thing and you just want to be able to do it, then
yeah, your approach should work.

--Eric

Hi Eric,

Thanks for your reply.

I understand your suggestion, but the problem is that the query granularity is user-specified, and can be hourly, daily, weekly etc.

In that case I will have to define a boolean dimension for each granularity that I want to support.

I am looking for a way to filter those values by comparing the __time dimension with the secondary time dimension,

something along the lines of “where secondary_time between __time and (__time + bucket_length)” (bucket_length = query granularity in millis)

Unfortunately, javascript filters only take one argument, so I can’t see how I can compare the __time dimension with the secondary time dimension.

Thanks,

Amir

Hi Fangjin,

I’m not sure how it’s possible to do that using javascript filter, since the filter takes only one argument and I am trying to compare values of two different dimensions (__time and the secondary time dimension).

I looked at plyql but it doesn’t seem to cover this scenario.

I think that a javascript filter that takes a list of arguments can solve the problem here, I think I’ll write one :slight_smile:

Thanks,

Amir

Hi Amir, what is the SQL query you are trying to issue? I think https://github.com/implydata/plyql, leveraging Plywood, will be able to support your use case.

Javascript aggregators should be able to take multiple arguments. You
specify multiple columns and then define your function with multiple
arguments and it should just work.

--Eric

Yes, but I’m referring to javascript filters, not aggregators. I am trying to filter based on two dimensions, and a javascript filter will only accept one argument as described in http://druid.io/docs/latest/querying/filters.html
“The function takes a single argument, the dimension value, and returns either true or false”

Another problem is that __time dimension doesn’t seem to get through to the javascript filter. I added debug printing to the filter, and when I use “__time” the filter function won’t even get called.

Filter definision is:

“filter”: {

"dimension": "__time", 

"function": "function(t) { java.lang.System.out.println(t); return true; }", 

"type": "javascript"

}

How can I use __time in javascript filters?

Hrm, for what you are trying to do, the way indexes are applied won't
gain you any benefit anyway (they are applied once for the whole
thing, where you need to apply a different bitmap for different time
points), so doing it as a filter will only be of limited benefit.

Try implementing it as logic on the aggregator, basically, have an if
statement in a javascript aggregator that checks the predicate and
only then actually aggregates. That should work and give you some
feel for run times.

If it turns out that this needs to be run a lot, though, the
javascript aggregator can be problematic (it generates new classes in
the perm gen and can result in lengthy pauses to clean up the perm
gen), so you might need to consider implementing a native aggregator
that does the same logic (and it could do it in a slightly more
optimized fashion).

--Eric

Just for the record Plywood + PlyQL can not help in this case. What is needed for that support is to have javascript filters (or dim extraction functions) that can take multiple columns just like Tal keeps saying.

Please +1 this issue: https://github.com/druid-io/druid/issues/1885

Also filed this issue: https://github.com/druid-io/druid/issues/2187

Guys, seriously, if this was possible PlyQL could pretty much support full on SQL filters for Druid.

Ok, got some progress here - I’m able to use the javascript aggregator for comparing the __time dimension with my secondary time dimension.
However, the __time dimension doesn’t work as expected. The doc says that __time is the current time bucket according to the query aggregation granularity.

But the results show that __time takes the values of the minimum query granularity from ingestion spec (namely, “queryGranularity” property), regardless of the actual query granularity.

What is the correct behaviour then?

Thanks,

Amir

The __time dimension is the time column in Druid. The entries are the rolled up rows in Druid at ingestion time, not query time. Where in the docs does it say what __time is representing? It should definitely be fixed if that’s the case.

Hi Fangjin,
I was referring to http://druid.io/docs/0.8.2/querying/dimensionspecs.html under extraction functions:

Hi Amir, thanks for this catch.

BTW, the feature you require of supporting multiple dims in js filter is in progress:https://github.com/druid-io/druid/pull/2217

Hi Amit, do you mind submitting a PR to fix the description of the __time dimension?
https://github.com/druid-io/druid/blob/master/docs/content/querying/dimensionspecs.md