Druid Metrics

Hello,

I have been using Druid for two days and have worked my way through the wikipedia and twitter examples. Now, I am interested in batch ingesting my own data. However, I am a bit confused about this process, specifically the definition and usage of metrics.

Dimensions as I understand are used to slice and group your data. If I’m not mistaken, they are usually strings (do they have to be strings???). Metrics on the other hand are used to provide quantitative information about these groupings of dimensions - they seem to be derived from aggregation functions (min, max, avg, sum) applied to doubles or integers.

Let’s say I have data in the following form (just an example to work with):

String high_school, String city, String state, int num_students, int num_teachers, int num_rooms, double tuition

Clearly high_school, city, and state are dimensions I would like to be able to slice and group by. The other numerical data will be used to compute aggregates.

First question: In defining my spec file for the batch ingestion would I include the numerical fields as dimensions - even though they are not? You seem to have done this in the twitter example:

“dimensions”: [ … , “retweet_count”, “follower_count”, “friendscount”, … , “statuses_count”, … ],

I do not understand why you include these as dimensions.

Second question: Why are metrics defined in the spec file? This makes no sense to me. Does this mean Druid is pre-aggregating values? Does this mean I need to know what types of queries I will run before ingesting data? For instance, initially I might think I would like to see the total number of high school students in a certain city so I would define a metric to longSum the num_students field. However, what if after I ingest the data, I decide I’d rather know the max number of students at a high school in each city? I originally thought you could just define whatever aggregate you want in a query, but are you constrained to the metrics you defined in your ingestion spec file? If not, then what is the point of the metric in the spec file at all?

To extend this:

I am still working on batch ingestion and I defined a metric in my spec file (I’m still not sure why) as follows:

{

“fieldName”: “amount”,

“name”: “total_amount”,

“type”: “doubleSum”

}

I did not add amount as a dimension - amount is a double that is present in every record. My goal is to be able to execute queries where I can group by certain dimensions and observe the total amount (sum).

After ingesting the data, I executed the following query:

{

“queryType”: “groupBy”,

“dataSource”: “ebay_accounts5”,

“granularity”: “all”,

“dimensions”: [“city”],

“aggregations”: [

{“type”: “doubleSum”, “name”: “sum”, “fieldName”: “total_amount”}

],

“intervals”: [ “2012-01-01/2016-01-03” ]

}

Surprisingly, it actually returned what I wanted. Record for each unique city with a sum of the amounts for each city.

Surely, this cannot be how Druid is designed to handle this problem. It makes no sense. You define a metric in the spec file which creates a field (total_amount) which is the sum of individual amounts. It would make sense then that in the query you could list metrics you want to query as follows:

“dimensions”: [“city”],

“metrics”: [“total_amount”]

but no, you have to define a new aggregate and use the total_amount field???

“aggregations”: [

{“type”: “doubleSum”, “name”: “sum”, “fieldName”: “total_amount”}

],

^^^ This is very confusing because total_amount was already defined as a doubleSum in the spec file? This aggregation makes it seem like sum would be some sort of sum of sums… but no, sum turns out to be exactly what I wanted originally.

I hope what I am saying makes sense. If I am approaching this problem wrong, I would love to hear a better way.

Austin,

It might make more sense to think of it in terms of what you would do
in an RDBMS. At ingestion time, we do a "summarization" which is a
form of pre-aggregation. In an RDBMS, you might do something like

INSERT INTO ebay_accounts5 (SELECT granularity(timestamp, 'minute'),
product, state, city, SUM(amount) AS total_amount FROM kafka_feed_x
GROUP BY granularity(tmestamp, 'minute'), product, state, city);

Then, you might do a query like

SELECT city, SUM(total_amount) AS sum FROM ebay_accounts5 GROUP BY city;

Or you could do

SELECT granularity(timestamp, "day'), SUM(total_amount) AS SUM FROM
ebay_accounts5 GROUP BY granularity(timestamp, "day");

In these cases, having the aggregator on both the ingestion (the
INSERT statement) and the query (the SELECT statement) are potentially
doing two different things.

On the INSERT statement, it is doing an initial summarization to the
(minute_timestamp, product, state, city) level. Then on the query it
is doing another aggregation to just the city level. In the ingestion
case, you specify the aggregates to tell the system what to do when
you have collisions in the (minute_timestamp, product, state, city)
space. On the query, you specify it to tell the system what to do
when you have collisions in the (city) space.

Does that make more sense?

Also, yes, it's best to not include the "metric" columns that you use
in aggregators as dimensions.

--Eric

That makes a lot more sense.

Thanks a lot!