Nested GroupBy Queries in Druid?

Hi,

  I am new to Druid and currently trying to figure out if the

following use case can be supported by Druid, explicitly, i.e.
(i.e. without mediation of an external; app - either front-end or
back-end). More specifically, bellow is a successful query, which

  • as far as I understand - will serve as the inner query later on.
    The output of this query is:

    time              city  users
    2017-01-17+00:00  Aiea  1
    2017-01-18+00:00  Aiea  12.04
    2017-01-19+00:00  Aiea  5.01
    
    The query itself is:
    
    {
      "intervals": [
        "2017-01-01T00:00:00.000Z/2017-02-01T00:00:00.000Z"
      ],
      "granularity": "all",
      "context": {
        "timeout": 60000
      },
      "queryType": "groupBy",
      "dataSource": "flattaxonomyv10",
      "aggregations": [
        {
          "type": "hyperUnique",
          "fieldName": "device_id_count",
          "name": "users"
        }
      ],
      "dimensions": [
        {
          "type": "default",
          "dimension": "city_en_name",
          "outputName": "city"
        },
        {
          "type": "extraction",
          "dimension": "__time",
          "outputName": "time",
          "extractionFn": {
            "type": "timeFormat",
            "format": "yyyy-MM-ddZZ",
            "timeZone": "UTC",
            "locale": "en-US"
          }
        }
      ],
      "filter": {
        "type": "and",
        "fields": [
          {
            "type": "selector",
            "dimension": "country_iso_code",
            "value": "US"
          },
          {
            "type": "not",
            "field": {
              "type": "selector",
              "dimension": "city_en_name",
              "value": null
            }
          }
        ]
      },
      "limitSpec": {
        "columns": [
          {
            "dimension": "city_en_name",
            "direction": "ascending"
          },
          {
            "dimension": "timestamp",
            "direction": "ascending"
          }
        ]
      }
    }
    
    What I would like to do next is:
    
    First, in the above query, filter out rows where users is less
    

than 100. The following clause does not seem to work for me when I
place is just after the opening bracket “{”.

  "having": {
    "type": "greaterThan",
    "aggregation": "users",
    "value": 100
  }

  Next, I would like to use the above query as inner query, group by

time and count number of rows. In other words, for each day, I
want to find the number of cities with more than 100 users. Is
that possible in Druid, explicitly? Btw, note that I have already
used the flag “groupByStrategy”: “v2” within the inner query.

Thank you very much for your help.

When you say “not seem to work” what do you mean exactly? Does the query fail, return wrong results, seem to ignore the having spec?

Hi Gian,

  Regarding the "having clause", I have noticed that, when I use the

following aggregator:

  {"type":"doubleSum","name":"users","fieldName":"c"}

  I get no errors at all. However, in my use case, I have to use

HLL. But, when I use the following aggregator:

{“type”:“hyperUnique”,“fieldName”:“device_id_count”,“name”:“users”}

  I get the following error. Any ideas? What is more, can I cast the

HLL results to long?

  Feb 07 09:56:00 ERROR metabase.driver.druid :: Error running

query:
Unknown exception
Feb 07 09:56:00 ERROR metabase.query-processor :: {:status
:failed,
:class java.lang.Exception,
:error “Unknown exception”,
:stacktrace
[“driver.druid$do_query.invokeStatic(druid.clj:60)”
“driver.druid$do_query.invoke(druid.clj:51)”

“driver.druid.query_processor$execute_query.invokeStatic(query_processor.clj:760)”

“driver.druid.query_processor$execute_query.invoke(query_processor.clj:751)”
“driver.druid$fn__34287.invokeStatic(druid.clj:160)”
“driver.druid$fn__34287.invoke(druid.clj:160)”
“driver$fn__18239$G__18232__18246.invoke(driver.clj:43)”

“query_processor$run_query.invokeStatic(query_processor.clj:431)”
“query_processor$run_query.invoke(query_processor.clj:415)”

“query_processor$pre_check_query_permissions$fn__21896.invoke(query_processor.clj:370)”

“query_processor$limit$fn__21881.invoke(query_processor.clj:342)”

“query_processor$cumulative_aggregation$fn__21875.invoke(query_processor.clj:322)”

“query_processor$cumulative_aggregation$fn__21875.invoke(query_processor.clj:322)”

“query_processor$post_format_rows$fn__21752.invoke(query_processor.clj:216)”

“query_processor$post_add_row_count_and_status$fn__21697.invoke(query_processor.clj:192)”

“query_processor$wrap_catch_exceptions$fn__21675.invoke(query_processor.clj:122)”

“query_processor$process_query.invokeStatic(query_processor.clj:479)”
“query_processor$process_query.invoke(query_processor.clj:470)”

“query_processor$dataset_query.invokeStatic(query_processor.clj:562)”
“query_processor$dataset_query.invoke(query_processor.clj:520)”
“api.dataset$fn__22672$fn__22675.invoke(dataset.clj:35)”

“api.common.internal$do_with_caught_api_exceptions.invokeStatic(internal.clj:222)”

“api.common.internal$do_with_caught_api_exceptions.invoke(internal.clj:219)”
“api.dataset$fn__22672.invokeStatic(dataset.clj:29)”
“api.dataset$fn__22672.invoke(dataset.clj:29)”

“middleware$enforce_authentication$fn__31033.invoke(middleware.clj:84)”
“api.routes$fn__31139.invokeStatic(routes.clj:43)”
“api.routes$fn__31139.invoke(routes.clj:43)”
“routes$fn__32129.invokeStatic(routes.clj:24)”
“routes$fn__32129.invoke(routes.clj:24)”

“middleware$log_api_call$fn__31121$fn__31123.invoke(middleware.clj:275)”
“db$_do_with_call_counting.invokeStatic(db.clj:461)”
“db$_do_with_call_counting.invoke(db.clj:455)”
“middleware$log_api_call$fn__31121.invoke(middleware.clj:274)”

“middleware$add_security_headers$fn__31080.invoke(middleware.clj:209)”

“middleware$bind_current_user$fn__31037.invoke(middleware.clj:104)”],
:query
{:type “native”,
:native
{:query
“{\n “intervals”: [\n
“2017-01-01T00:00:00.000Z/2017-02-01T00:00:00.000Z”\n ],\n
“granularity”: “all”,\n “context”: {\n “timeout”:
60000\n },\n “queryType”: “groupBy”,\n “dataSource”:
“flattaxonomyv10”,\n “aggregations”: [\n
{“type”:“hyperUnique”,“fieldName”:“device_id_count”,“name”:“users”}\n
],\n “dimensions”: [\n {\n “type”:
“default”,\n “dimension”: “city_en_name”,\n
“outputName”: “city”\n },\n {\n “type”:
“extraction”,\n “dimension”: “__time”,\n
“outputName”: “time”,\n “extractionFn”: {\n
“type”: “timeFormat”,\n “format”:
“yyyy-MM-ddZZ”,\n “timeZone”: “UTC”,\n
“locale”: “en-US”\n }\n }\n ],\n “filter”: {\n
“type”: “and”,\n “fields”: [\n {\n “type”:
“selector”,\n “dimension”:
“country_iso_code”,\n “value”: “US”\n },\n
{\n “type”: “not”,\n “field”: {\n
“type”: “selector”,\n “dimension”:
“city_en_name”,\n “value”: null\n }\n
}\n ]\n },\n “limitSpec”: {\n “columns”: [\n
{\n “dimension”: “city_en_name”,\n
“direction”: “ascending”\n },\n {\n
“dimension”: “timestamp”,\n “direction”:
“ascending”\n }\n ]\n },\n “having”: {\n
“type”: “greaterThan”,\n “aggregation”: “users”,\n
“value”: 100\n }\n}”},
:parameters ,
:constraints {:max-results 10000, :max-results-bare-rows 2000},
:info {:executed-by 1, :card-id nil, :uuid
“c83ed88f-598b-410f-9395-2e796556a01f”, :query-hash -247212218,
:query-type “native”}},
:expanded-query nil}

  Feb 07 09:56:00 ERROR metabase.query-processor :: Query failure:

Unknown exception
[“query_processor$assert_valid_query_result.invokeStatic(query_processor.clj:518)”
“query_processor$assert_valid_query_result.invoke(query_processor.clj:513)”
“query_processor$dataset_query.invokeStatic(query_processor.clj:563)”
“query_processor$dataset_query.invoke(query_processor.clj:520)”
“api.dataset$fn__22672$fn__22675.invoke(dataset.clj:35)”
“api.common.internal$do_with_caught_api_exceptions.invokeStatic(internal.clj:222)”
“api.common.internal$do_with_caught_api_exceptions.invoke(internal.clj:219)”
“api.dataset$fn__22672.invokeStatic(dataset.clj:29)”
“api.dataset$fn__22672.invoke(dataset.clj:29)”
“middleware$enforce_authentication$fn__31033.invoke(middleware.clj:84)”
“api.routes$fn__31139.invokeStatic(routes.clj:43)”
“api.routes$fn__31139.invoke(routes.clj:43)”
“routes$fn__32129.invokeStatic(routes.clj:24)”
“routes$fn__32129.invoke(routes.clj:24)”
“middleware$log_api_call$fn__31121$fn__31123.invoke(middleware.clj:275)”
“db$_do_with_call_counting.invokeStatic(db.clj:461)”
“db$_do_with_call_counting.invoke(db.clj:455)”
“middleware$log_api_call$fn__31121.invoke(middleware.clj:274)”
“middleware$add_security_headers$fn__31080.invoke(middleware.clj:209)”
“middleware$bind_current_user$fn__31037.invoke(middleware.clj:104)”]

Ah this is probably a finalization issue. Try using a hyperUniqueCardinality post-aggregator to finalize the HLL into an estimate, and then do the “having” on the finalized estimate.