Is it not possible to use limitSpec on a postAggregation?

I’m trying to add support for post aggregations in Panoramix (https://github.com/mistercrunch/panoramix) and hit a problem when trying to use limitSpec on a postAggregation metric. The documentation is unclear as whether this is possible or not, is that impossible? Any suggestion for a workaround?

Here’s the error msg + query:

Druid Error: Unknown column in order clause[OrderByColumnSpec{dimension=‘bookings_per_guest’, direction=DESCENDING}] Query is: {
“dimensions”: [
“dim_destination_market”
],
“aggregations”: [
{
“type”: “count”,
“name”: “count”
},
{
“filter”: {
“type”: “selector”,
“dimension”: “dim_metric_name”,
“value”: “m_bookings”
},
“aggregator”: {
“type”: “longSum”,
“fieldName”: “m_value”,
“name”: “m_bookings”
},
“type”: “filtered”,
“name”: “m_bookings”
}
],
“intervals”: “2015-04-29T07:01:45/2015-07-29T07:01:45”,
“limitSpec”: {
“limit”: 5,
“type”: “default”,
“columns”: [
{
“direction”: “descending”,
“dimension”: “bookings_per_guest”
}
]
},
“granularity”: “all”,
“postAggregations”: [
{
“fields”: [
{
“fieldName”: “m_bookings”,
“type”: “fieldAccess”
},
{
“fieldName”: “count”,
“type”: “fieldAccess”
}
],
“type”: “arithmetic”,
“name”: “bookings_per_guest”,
“fn”: “/”
}
],
“queryType”: “groupBy”,
“dataSource”: “fct_kpis”
}

Hrm, I want to think that we fixed that in a recent version, it might’ve been something else group by and limitSpec related though. What version of Druid are you running?

–Eric

Hey Eric,

Remember me? We spoke for a moment along with the Presto guys before giving our respective talks at the Hadoop Summit. Did you have time to take a look at Panoramix? I’m still early on in the project but it’s really promising! Here: https://github.com/mistercrunch/panoramix

I was just told that we are running Druid 0.7.3

I’ll dig in the PR history to see if I can dig out the PR you are talking about.

Max

Is this the PR ? https://github.com/druid-io/druid/pull/1361

Hrm, yeah, looks like that PR was done after we cut 0.7.3 so it should
be in 0.8. The upgrade to 0.8 should be seamless from the query side
Max, so if you can upgrading it should resolve the problem.

I do remember the conversation and I have tried panoramix :). It's
quite cool, we were playing around with it just the other day. It's
cool that it does user management, none of the stuff that has been
setup so far does that, I don't think. Also, the ability to register
and save aggregators is also cool, it took us a little bit to find out
how to do it, but once we found it we were pretty happy :).

--Eric

Very nice! We’ll upgrade asap then.

Panoramix is still pretty rough around the edges but things are falling into place nicely. I’m also learning Druid internals and the API as I go along, so there’s room for improvements.

Related question, is there any way I can retrieve the aggregators info along with the segment metadata? I’m also wondering about what happens when aggregators are heterogenous over time (on different segments), do queries raise errors?

Max

is there any way I can retrieve the aggregators info along with the segment metadata?

Currently
this information is not present in the segment. Users have to remember the type of aggregator used while ingesting the data. If I am not wrong there are efforts in the direction to preserve this information.

I’m also wondering about what happens when aggregators are heterogenous over time (on different segments), do queries raise errors?

As
far as I understand it will just try to aggregate it. If it succeeds than the query will return just fine, for example aggregating longs with
doubles, however it may give incorrect results. So it will depend on the aggregator specified while querying, if that aggregator can handle the type of data present in that column than all is fine otherwise it may throw error in cases like using longSum for some complex metric. In the end it is on the user to make sure aggregator specified is compatible with the data present in the segments.

-Parag

Very nice! We’ll upgrade asap then.

Panoramix is still pretty rough around the edges but things are falling into place nicely. I’m also learning Druid internals and the API as I go along, so there’s room for improvements.

Related question, is there any way I can retrieve the aggregators info along with the segment metadata? I’m also wondering about what happens when aggregators are heterogenous over time (on different segments), do queries raise errors?

Max

Does this info exist anywhere in the metadata database? Since I’m building a UI it would be nice for the UI to expose what’s there instead of having the user input that configuration a second time. I’ll dig to see if there’s a Github issue entry for it and create one if there isn’t.

What about if I add a new aggregator “sum(moneyz)” but earlier segments don’t have it and I scan over a the before and after period? Will it raise or just show me the aggregation for after the period covered?

I’ve seen the video where Netflix folks are changing their configuration in realtime, so I’m guessing heterogenous / evolving metadata (adding/removing dimensions/aggregation over time) works pretty well. Any caveats there? I’m asking to figure out whether the UI or querying layer would have to somehow handle this, by say issuing different queries for different segment ranges and merging the results somehow.

Max

If the column you are asking for in an aggregator doesn't exist,
you'll generally get "0"s or some other aggregator-specific default
value. So, in the case of a sum aggregator on a column didn't exist
and now does, while it didn't exist, it will generate 0s, while it
does exist, it will generate numbers.

The information of what aggregator is used is completely thrown away
right now. Parag is right in that in a future (soonish) version, we
do expect to have that information exposed via some mechanism.

We don't automatically peg the aggregator because ingestion is
actually somewhat similar to generating an aggregate table in an
RDBMS. I.e. if you generate a table with

INSERT INTO aggregate_table (SELECT hour(timestamp), dim1, dim2, dim3,
SUM(moneyz) AS monets, COUNT(*) AS records GROUP BY hour(timestamp),
dim1, dim2, dim3)

You now have the ability to run the expected queries like

SELECT day(timestamp), SUM(monets) AS art FROM aggregate_table GROUP
BY day(timestamp)

But you can technically also run queries like

SELECT day(timestamp), MAX(monets) AS
max_monets_in_post_aggregated_row FROM aggregate_table GROUP BY
day(timestamp)

Whether that MAX actually makes sense depends on what dimensions were
included in the aggregation was done and/or what kind of information
you are trying to get out. However, just 'cause we don't peg the
aggregator doesn't mean that we can't expose what was done such that
it is easy for a UI/the user to know and just carry it through.

--Eric

This really clarifies things for me. I didn’t understand that the ingestion engine was actually performing aggregations, I thought somehow that at ingest time I was specifying how people may aggregate later as they would query, as if some sort of indexing or cubing would be happening at loading time. Definitely my fault as I haven’t been systematic about reading the whole documentation.

Now (tell me if I’m right) I’m more clear on the fact that I may want to load metrics as dimensions if the grain that I want in Druid is the grain in the source. Then at query time I can aggregate these numeric “dimension” however I want.

If I load atomic data as per the source, or say preaggregate before batch loading (see the batch Airflow operator I wrote http://pythonhosted.org/airflow/code.html#airflow.operators.HiveToDruidTransfer) , I probably should load everything as dimensions. I’ll then be able to define my actual aggregations at query time.

Minor correction on my previous post, Qi Wang actually wrote most of the HiveToDruidTransfer Airflow operator. It makes it real easy for us to load data from Hive to Druid. Thanks Qi!

Hello,

You cannot aggregate values of column specified as dimension during ingestion, it will give you wrong result (most probably 0 as the result of aggregation). Druid will not treat dimension column as a metric column and perform aggregation on column values.

While ingesting data Druid will aggregate all events which has exact same values for Timestamp and Dimension(s) columns for each segment. On top of it if query granularity is defined more than the segment granularity then Druid will truncate the timestamp of events as per the query granularity and aggregate the events having same combination of Time and Dims.

If I am not wrong the finest Query granularity is second, so if all of your events have timestamp atleast a second apart then all of original events will be preserved and not aggregated.

I hope this makes it more clear.

Minor correction on my previous post, Qi Wang actually wrote most of the HiveToDruidTransfer Airflow operator. It makes it real easy for us to load data from Hive to Druid. Thanks Qi!

Minor Correction - See Inline

Yes you are correct. I was confusing it with something else. Thanks for the clarification Nishant !