how to tell the query result ends?

i am using `“pagingIdentifiers” and offset number to iterate through the result returned by the query,
and stop when the “event” object is empty.

is there any better way to measure when the result is complete returned?

Thanks!

Jing
`

Hi Jing, as you asking how to tell if the experimental select query ends? Or queries in general?

queries in general.

i want to measure the timing for queries.
i am also facing a problem here.
my schema is dimension = asset id with some metrics.
there are 78,031 assets of daily data and total of 1 month.
The query doesn’t stop bc the next pagingIdentifier & interval get the wrong result for me and it keeps iterating.
Example:
fetch all (78,031 asset for 2 days)
{“queryType”:“select”,“dataSource”:“marketdata441_1m”,“metrics”:,“pagingSpec”:{“pagingIdentifiers”:{},“threshold”:1000},“granularity”:“all”,“dimensions”:,“intervals”:[“2015-01-28/2015-01-30”]}
i passed in new pagingSpec in each iteration. when first interval (1-28 to 1-29) completed, the second interval 1-29 to 1-30 starts. In the query object, pagingSpec was updated.
log:

*Next pagingIdentifiers
marketdata23_1m_2015-01-28T00:00:00.000Z_2015-01-29T00:00:00.000Z_2015-03-24T03:00:25.414Z
70000

Hi Jing,

Queries return results in the form of JSON objects and should complete when the connection terminates.

Druid is a data store that is very much designed to do analytics (OLAP queries, groupBys and aggregates). It appears you are evaluating Druid using the experimental Select Query. I don’t know of any production deployments that actually use this query and I don’t know of the performance implications of the query. Just to give you some context, I wrote the select query in a single day as part of a hackathon.

If you are interested in benchmarking Druid, a good place to start is here:
http://druid.io/blog/2014/03/17/benchmarking-druid.html

I don’t think the select query is in any state where it can be officially benchmarked. I would strongly recommend looking into timeseries and topN queries to get started with Druid.

the reason to use select query is bc i only want to get raw data, based on the interval and filters of dimensions without any aggregation.

it seems that TopN returns only 1 dimension per query and timeseries requires aggregator.

Hi Jing,

The select query will return Druid rows and not raw data unless you’ve chosen not to roll up your data at ingestion time, and your raw data’s granularity is at a millisecond or greater granularity.

I am curious to understand your use case and what you are trying to measure a bit better. Druid is a data store where the result set is typically smaller than the input set, as Druid is primarily designed for aggregates. Druid is not particularly great if your output set is the same size as your input set (you might as well use Hadoop at that point). Although we do have this experimental select query, I suspect you will not get any interesting results by trying to benchmark it. What is the end use case you are trying to accomplish?

Thanks,

FJ

the granularity of the data is by day.

i want to use druid as a caching layer right now without the analytic features. By running the select query filtering on dimensions/interval, it returns each row with all the dimension values pretty fast. My experiment is to return all columns based on interval or dimension values.

The input data will be 80k assets’ daily data, 1 row per day, around 1k columns and 100 dimension. The whole data set consists of 10 years’ data, ingestion can be done at once then keep feeding monthly. Different users will query the application to get a subset of data and do the analytic using their own way.

do you have any suggestion of how druid can help in this case?

Thanks!

Hi Jing,

Thanks for clarification. This is an interesting use case. To understand when the select query ends, as you page through the results, you’ll eventually return empty results. When this occurs, the query has ended.

For additional info, see: http://druid.io/docs/latest/SelectQuery.html

Thanks Fanjing!

is it possible not to roll up raw data in ingestion? and how to do that?

i don’t need aggregation for metrics as to persist raw data, but i want to entire raw row to return by query.
Now returned metrics are null since no aggregation is done.

Hi Jing,

you can set queryGranularity to ‘NONE’ while ingestion to prevent any truncation of timestamps while ingesting.

Hi Jing,
I am not sure i do understand your question, are you trying to figure out a way to query druid in the form ’Select * from data_source’ in order to get all the indexed data ?

is that what you mean ?

Thanks!

Jing,

Seems that as Fanjing already did said you might use the select query http://druid.io/docs/latest/SelectQuery.html

I hope this answer to your question.