Reduced Output on Query

Hi I have been poking around in Druid and one concern I have is the amount of data it outputs from a query.

From the example

[ {

“timestamp” : “2015-08-03T22:35:00.000Z”,

“result” : {

“chars_added” : 23367.0,

“edit_count” : 95

}

}, {

“timestamp” : “2015-08-03T22:36:00.000Z”,

“result” : {

“chars_added” : 23690.0,

“edit_count” : 93

}

}, {

“timestamp” : “2015-08-03T22:37:00.000Z”,

“result” : {

“chars_added” : 54091.0,

“edit_count” : 114

}

}]

This is a small sample but there are a lot of repeated words. On a lot larger sample this can easily add up, is there anyway to reduce the output size? or any plans to reduce the data output size? or is there anything I can look to work on a plugin of sort?

Thanks,

Scott

Scott,

Can I ask for a bit more information on why you are worried about the
output size? Do you expect it to be a primary bottleneck for your
workload? (if so, can you provide some context on why that might
be?).

There is definitely room to optimize the result size and the JSON
itself does have some repetition in it, but aside from using
compression on the way out of the broker we've not seen an indication
that the size of the serialized form is a bottleneck (versus what it
takes to process and generate a result set of that size, for example).

A bit more context on why this is a big concern could help us in
figuring out possible resolutions.

--Eric

Ya a typical query for us is one day worth of minute level sensor data, and we have pages with as little as 1 sensor to pages with 8000+ sensors.

So a large dataset case would be about 800 minutes * 8000 readings ~= 6.4mil samples

I did the calculation in the format you have specified;

[{“timestamp”:“2015-08-03T22:35:00.000Z”,“result”:{“pin”:232}}]

The size came out to around 377M compresses to about 14M

Our current format comes out like;

{“o”:[1,2,3,4,5,6,7,8,9],“d”:[{“t”:“12:04”,“d”:[1,2,3,4,5,6,7,8,9]},{“t”:“12:05”,“d”:[1,2,3,4,5,6,7,8,9]}]}

“o” is basically the header and “d” as the time data, basically how a csv would do it.

This comes out to around 22M and with the gzip compression it gets to the client about 4M

We would probably have to have something inbetween to do the conversion but either way it is a lot of memory and cpu to just do the conversion.

Thanks,

Scott

Hrm, your calculation seems to indicate that you aren't doing any
aggregation with Druid, with minutely data, I would expect you go have
more like 60*24 = 1440 data points, which is much smaller.

Fwiw, it's actually significantly more difficult (and slower) to get
at the raw data with Druid than with other systems. Druid is
well-suited to an aggregation-based use case. Without knowing the
details of how and why you are using Druid I cannot say if it is a
good fit, but if your primary need is to store and access raw data it
might not be the best choice.

On your specific question, if you are using Java and have the Jackson
library you can do the query asking for "application/smile" content
type and you should be able to get data in Jackson's Smile format
which is a binary JSON format that avoids replication of field names
and stuff like that. Should be smaller...

--Eric

Ya it is solar data so we average about 800 minutes in a day instead of the 1440. Which helps since it is smaller.

I guess it depends on your term of aggregation because if I said I have 1 second data that is a different story of what is aggregation and raw data. In the end I think that point is not relevant to whether it is aggregated or not. If I look at how RDBMS work, if I do have a large query, yes it is still doing a lot of work on the backend but I can query rows one by one and process a large amount of data, with lower memory. What I think you are ultimately saying, it is meant to get a few facts and not a giant flume of aggregated data out. If that is the case, maybe it is not the best solution, it is hard to say.

Thanks for your help, I got what I was asking for,

Scott

If you pull the aggregation down to be seconds and you query for a
day, yes you will get a lot more points out, but that's still only
86,400 points for a day, which is 2 orders of magnitude less than 6
million.

For what it's worth, you're gonna be hard pressed to get many
visualization engines to do much with 6.4 million points of data and
once you take that data and start munging it somewhere downstream in
some other system you are going to face memory costs on those systems
as well (and, if all you want is access to the raw data, all of the
stuff that Druid does to optimize for aggregation is just overhead).

In general, the places that Druid works best is things that require
lots of interactivity like interactive dashboards, visualization and
the like. I've yet to see somebody provide an interactive experience
visualizing 6.4 million data points in a client of any data system
(honestly, once you start visualizing more than 2k data points,
visualizations tend to slow down lots). Generally, you want to allow
Druid/whatever your storage system is to do the aggregation as close
to the base data as possible and return the smallest result set
possible.

I'm not trying to push you away from Druid, while I'd love to think
that Druid is a great fit for everybody, I honestly don't have enough
information to know if Druid is a good fit for you or not. I'm trying
to help as much as I can with the limited information I have. That
means I'm providing extra information in the hopes that it is helpful
(but it might be totally misleading). Would you be willing to
describe what you are trying to do with the data a bit more? That
would likely make it easier to both (1) help us understand what we
need to do and *why* we need to do it in order to improve the system
and (2) help us provide you with better guidance on whether the system
has the potential to be a good fit for what you are hoping to achieve.

It's totally possible to optimize the return format of Druid and if
that's the only detractor for the system, I'd encourage you to do it.
The one caveat is that each query type can define its own result
format (they are not necessarily tabular), so it would need to be done
in some fashion that allows it to be applied to the queries you care
about. That said, I think that Jackson's smile format will actually
be very close to what you have described already (perhaps even better
because it'll be a binary format which shrinks numbers down a lot more
than JSON).

--Eric