Hive and Druid integration problems

Hi Druid Users!

Recently, I’ve started testing Druid and Hive integration on top of Amazon EMR. The goal is to use Druid with Tableau as a self-service BI tool.

I’ve found two major issues that are blockers for us so far and hope that someone here could help me with solving them.

1. We’re receiving an error message while trying to filter dimensions using IN clause. The simplest way to reproduce that error is to create an external table in Hive using wikiticker data.

**Query: **

_SELECT * FROM default.druid_table_1 _WHERE channel IN (’#en.wikipedia’, ‘#ca.wikipedia’)

Error message:

“Error running query: java.lang.AssertionError: cannot translate filter: IN($2, _UTF-16LE’#en.wikipedia’, _UTF-16LE’#ca.wikipedia’)**”

We can’t change it to OR****clause because of the Tableau that creates the query with the IN during filtering the data.

2. When we are querying large dataset with a query that needs to use Hive as a second step, we are receiving lower results than actual.

Example q****uery:

*SELECT year(__time), sum() *FROM <druid_table> GROUP BY year(__time);

(I know that we can add year as a dimension, but this is just for an example)



I suspect it could be because of the default pagingSpec threshold - as I understand this is the limit but I have no clue what can I do with that.

Here is the component versions that we are using.

  • Hive version: 2.3.2,

****- Calcite version **1.10 **

  • Druid version: 0.12.0

Best regards,

Mateusz Jerzyk

Hi Mateusz,

Those queries you mentioned look like they should work fine in the native Druid SQL (, so this may be a better question for the Hive community.