Tuple Sketch rollup

Hey everyone,
I’m having a problem using the tuple sketch aggregation with rollup. I reduced it to this minimal case:

One input stream with one field “session” contianing a high cardinality ID and a metric field “value”. Doing this query we can aggregate those in a tuple sketch and use it:

{
  "queryType": "timeseries",
  "dataSource": {
    "type": "table",
    "name": "test"
  },
  "intervals": [
    "2000-01-01/2022-01-01"
  ],
  "aggregations": [
    {
      "type": "arrayOfDoublesSketch",
      "name": "session_values",
      "fieldName": "session",
      "numberOfValues": 1,
      "metricColumns": ["value"]
    }
  ],
  "postAggregations": [
    {
      "type": "arrayOfDoublesSketchToString",
      "name": "details",
      "field": {
        "type": "fieldAccess",
        "fieldName": "session_values"
      }
    }
  ],
  "granularity": {
    "type": "all"
  }
}

But since the input is excedingly large this is quite a slow query. Moving the aggregation to a metric field and enabling rollup does the aggregation upon ingestion, I can see the raw data from rolled up tuple sketch using this scan query:

{
  "queryType": "scan",
  "dataSource": {
    "type": "table",
    "name": "test"
  },
  "intervals": [
    "2000-01-01/2022-01-01"
  ],
  "granularity": {
    "type": "all"
  }
}

But then I can’t use the pre-aggregated tuple sketch. I tried this query:

{
  "queryType": "timeseries",
  "dataSource": {
    "type": "table",
    "name": "test"
  },
  "intervals": [
    "2000-01-01/2022-01-01"
  ],
  "aggregations": [],
  "postAggregations": [
    {
      "type": "arrayOfDoublesSketchToString",
      "name": "details",
      "field": {
        "type": "fieldAccess",
        "fieldName": "tuple_sessions" // Name of the tuple sketch metric
      }
    }
  ],
  "granularity": {
    "type": "all"
  }
}

It gives this error:

Missing fields [[tuple_sessions]] for postAggregator [details] at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 332]

I also tried using the exact same query as before in case Druid was able to optimize that on its own, but then the session field doesn’t exist anymore.
How can we use tuple sketch with rollup?

References

Relates to Apache Druid 0.21.1

Hello - how are you ingesting it? Would thetaSketchToString work/be appropriate for query time? Or thetaSketchEstimate? DataSketches Theta Sketch module · Apache Druid

The goal with this is to get the average “value” of a subset of the sessions, so theta sketch and HLL are not enough since they will only give the number of sessions in the intersection.

This is my ingestion spec:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "http",
        "uris": [
          "http://data-host/test.jsonl"
        ]
      },
      "inputFormat": {
        "type": "json"
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "hashed"
      },
      "forceGuaranteedRollup": true
    },
    "dataSchema": {
      "dataSource": "test",
      "timestampSpec": {
        "column": "!!!_no_such_column_!!!",
        "missingValue": "2010-01-01T00:00:00Z"
      },
      "transformSpec": {},
      "dimensionsSpec": {
        "dimensions": [
          "event"
        ]
      },
      "granularitySpec": {
        "queryGranularity": "hour",
        "rollup": true,
        "segmentGranularity": "day"
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        },
        {
          "name": "sum_value",
          "type": "longSum",
          "fieldName": "value"
        },
        {
          "type": "arrayOfDoublesSketch",
          "name": "tuple_sessions",
          "fieldName": "session",
          "numberOfValues": 1,
          "metricColumns": [
            "value"
          ]
        }
      ]
    }
  }
}

Hey @petermarshallio, any guess here? Doesn’t tuple sketches support rollup, or is it a bug? Do you know any workaround?

Hey @marcospassos I’ve not used the tuples sketch myself, so am kinda stabbing in the dark…

As the query seems to be unable to find the column, and columns are defined inside each segment, and each segment relates to a time period, perhaps try a simple SQL query to see whether the column is being generated in the intervals at all? It may point to an issue with the timestampSpec for example.

If memory serves, you can also check the actual ingestion task for any errors to see if it’s not able to generate the metric.