Accessing Pre-aggregated data

Hi -

Is it possible to access pre-aggregated (roll-up) data from Druid before it is being transformed to compressed format as described in here -

http://druid.io/docs/latest/design/segments.html

I am trying to see if it is possible to purely access the rolled -up data AS-IS.

Thanks!

Hi Jagdeesh,
Could you explain more on your use case requirements ?

If you just want to iterate over the pre-aggregated rows present in druid segments,

you could either use SelectQuery with pagination to access row wise data from the Segments or have a look at DatasourceInputFormat , which can be used as an InputFormat for MR jobs to iterate over rows present in segment.

Hi Nishant,

My use case is simple - we want to store pre-aggregated data into Druid as well as into another data store, perhaps back into another database. This is to avoid losing of the human readable pre-aggregated data. When druid stores the data into Deep Storage, it is compressed with several algorithms. By storing into additional data store gives us more flexibility and access to the pre-aggregated data.

Thanks!

Thanks!

How about anything like an interceptor to grab the pre-aggregated data so I don’t have to query it? Select Query with pagination works, but framing the input criteria is going to be challenging.

Any more advices?

Hey Jagadeesh,

By “pre-aggregated” do you mean your original, raw data? Or do you mean the data that has been aggregated by Druid at ingestion time?

If it’s the original, that’s not accessible, as it isn’t stored by Druid.

If it’s the latter, probably your two best bets are either the “select” query, or trying the “DatasourceInputFormat” from druid-indexing-hadoop. This is currently really only packaged for internal use (there aren’t user facing docs) but you could probably use it anyway.

Hi Gian,

Thanks for the response. Pardon my insight into the Druid code base. Haven’t had a chance to dig into it much.

Yes, I meant to ask for the later. Essentially, I would want to access the rolled-up data from druid before getting persisted into Deep Storage so I can store it in yet another storage (perhaps into a database)

I have trouble understanding how the DatasourceInputFormat class would help int his case.

Would you mind elaborating on it ?

Thanks,

Jagadeesh.

Gian - will you be able to elaborate on how to use the DatasourceInputFormat class for my use case?

Thanks!

Hey Jagadeesh,

I was thinking you could use that to read Druid segments out of deep storage into Hadoop jobs, and then do something with that data. If you’re just wanting to send the Druid data somewhere else, that could be a map-only Hadoop job.

Any hints on how that map-only Hadoop job can be set-up? And you are right, we don’t want to send the data to deep storage and then read.

Thanks!

Oh, I was talking about reading segments after they’re written to deep storage. Is there a particular reason you want to siphon off the data before it gets to deep storage?

Yes, once it gets to deep storage, the only way we can access the underlying aggregated data is via Druid JSON API.

Like I mentioned in my other posts, I want to store the pre-aggregated data into another data source like a database or convert the aggregated rows into JSON to feed into another system.

Essentially, I am evaluating a use case where Druid can purely be used an sliding window aggregating component. The fact that Druid does the pre-aggregations is a great usage. Please let me know.

Thanks!

Hey Jagadeesh,

Okay, that makes sense. I think that you have two best options right now.

  1. DatasourceInputFormat may really be what you’re looking for. You can use that to read the aggregated data out of Druid segments without going through the Druid JSON API – this InputFormat works by reading segments directly off deep storage into your MR job. You’d set it as the InputFormat of a MR job, and then that MR job would get “InputRow” objects, one for each pre-aggregated row. Then you could do whatever you want with those InputRow objects. If you’re wanting to just dump the data somewhere else, probably the best thing to do is convert it in the mapper to whatever format you want.

  2. You can read the segment files directly off disk. 0.9.2 will include a DumpSegment tool that can be used to dump segments as JSON (although see the docs for caveats). Docs for that are here: https://github.com/gianm/druid/blob/master/docs/content/operations/dump-segment.md. You can get access to that now by building a snapshot from master.

Hi Gian,
I am interested in something similar. Thanks for pointing to DatasourceInputFormat. Is there some documentation on this class? What kind of output format do we expect from this reader in Hadoop?

Hello guys,

I would like to know if it’s possible to get HyperUnique bitmaps in any way?!

Select queries return string like this “AQAAAQAAAAKkgA==”

Thanks,

Ben

That’s the base64-encoding of an HLL objects. You could do a base64 decode followed by HyperLogLogCollector.makeCollector on the returned buffer to get the HLL object.