Under-count result in Druid datasource through Hive

Hi Druid experts,

Our Hive table with druid datasource always undercounts in our set up and would like to bring up to folks here.

We created two Hive tables, druid_sourced and control. druid_sourced table has underlying Druid datasource and control table is a regular Hive table where data is stored as a text file in hdfs. Using the same data file with 1386725 rows, we indexed it into druid_sourced datasource and then loaded it into control table.

We first checked the count in druid_sourced through druid query and the count is matching:

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @query_druid.json ‘<broker_host>:8082/druid/v2/?pretty’

[ {

“timestamp” : “2017-11-01T00:00:00.000Z”,

“result” : {

"count" : 1386275

}

}

Then we checked count in control table through beeline and the count is also matching.

SELECT COUNT(*) FROM control;

INFO : OK

I had a typo; the source file should have 1386275 rows, (not 1386725). It is corrected below.

Hi,

You shouldn’t be hitting https://issues.apache.org/jira/browse/HIVE-17623 unless the CBO is failing for the select count(), since such query will be translated as druid time series.
Can you please share an explain plan for the query > SELECT COUNT(
) FROM druid_sourced;

Thanks Slim. Here is the explain plan for the SELECT COUNT statement.

EXPLAIN SELECT COUNT(*) FROM druid_sourced;

INFO : Compiling command(queryId=hive_20171213100904_f25689dc-bcfe-4da7-a104-cecaf1eb4705): EXPLAIN SELECT COUNT(*) FROM druid_sourced

INFO : We are setting the hadoop caller context from HIVE_SSN_ID:a81e416d-89f5-4403-8946-9bb0ff95dba4 to hive_20171213100904_f25689dc-bcfe-4da7-a104-cecaf1eb4705

INFO : Semantic Analysis Completed

INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)

INFO : Completed compiling command(queryId=hive_20171213100904_f25689dc-bcfe-4da7-a104-cecaf1eb4705); Time taken: 0.144 seconds

INFO : We are resetting the hadoop caller context to HIVE_SSN_ID:a81e416d-89f5-4403-8946-9bb0ff95dba4

INFO : Setting caller context to query id hive_20171213100904_f25689dc-bcfe-4da7-a104-cecaf1eb4705

INFO : Executing command(queryId=hive_20171213100904_f25689dc-bcfe-4da7-a104-cecaf1eb4705): EXPLAIN SELECT COUNT(*) FROM druid_sourced

INFO : Starting task [Stage-1:EXPLAIN] in serial mode

INFO : Resetting the caller context to HIVE_SSN_ID:a81e416d-89f5-4403-8946-9bb0ff95dba4

INFO : Completed executing command(queryId=hive_20171213100904_f25689dc-bcfe-4da7-a104-cecaf1eb4705); Time taken: 0.007 seconds

INFO : OK

Okay i see what is happening here.

So Your Druid indexing job is rolling up data (even with granularity none), this rollup is happening because you have some rows that have the same dim_values and timestamps, thus the actual number of rows in druid is less than raw data.

This means that select count() from druid table is equivalent to select count() from hive_table group by timeColumn, dim1, dim2,dimN

you can confirm this by running the count/group by all dimensions and time column

The reason you get to good count when you use longSum aggregator is because Druid is keeping the info about rollup in that metric column called count thus you get the good result. in fact your query is not really select count(*) since you are using longSum aggregator.

No sure what you are trying to achieve but if you want to get the count of rows matching from druid table using hive CLI you need to send the following query

select sum(count) from druid_table;

This should give you the exact count and the plan should match your druid_quey.json with the longSumAggregator.

I hope you get my point let me know if you have more questions.

Thanks Slim. I am glad you pointed out what actually happened. To paint a better picture of what we want to with this setup, we have Tableau dashboards connecting to Hive and we would like to improve query response time. But before we get to that, we found out Tableau dashboards reporting lower counts and hence we ran the “SELECT COUNT(*)” exercise.

The test data didn’t have all the dimensions in the original data. But at the same time, we didn’t want to add dimensions we don’t need just for Tableau to report matching counts. We are still very new to Druid. I don’t know if my question would make sense, but do you have any suggestions/pointers to have Tableau reporting the “non-rolled up” counts?

Thanks again. We really appreciate the info.

Regards,

BH

see inline

Thanks Slim. We were able to modify Tableau dashboards without turning off rollup in Druid. The preliminary response times have been very well-received.

I really appreciate your help.

no worries, we are adding lot of features to the hive adapter it should get even faster on 2.6.3 and on.