Druid Bay Area May Meetup

Please join us for our next Druid meetup at Target: https://www.meetup.com/druidio/events/239116526/

We’ll share use cases with Druid and discuss the upcoming Druid roadmap.

Hope to see everyone there!

HI ,FJ,
can you add me to the wechat group? I have some problems recently…My WeChat is: 15221352014

Thanks~!

在 2017年4月19日星期三 UTC+8上午2:30:25,Fangjin Yang写道:

Hi i need your help in getting a reply from one of our engineering staff post.

Hi,

I am new to Druid and while reading up on Druid read at few places about the values being imported as float. However I could not find a clear definition as to whether this was a single or double precision. We have some values that have a currency aspect (2 digits after decimal), and so I could potentially just multiply all values with 100 and then treat them as long in druid and during retrieval again divide by 100. However I think this is not a good solution as everybody dealing with this data would need to be aware of this data manipulation requirement. That said I tried importing the data directly into druid but it seems that the results I see are not entirely correct or in sync with my expectation. I will explain it with an example to make it clearer.

The load data JSON is:
{
  "type": "index",
  "spec": {
    "ioConfig": {
      "type": "index",
      "firehose": {
        "type": "local",
        "baseDir": "/tmp",
        "filter": "test.csv"
      }
    },
    "dataSchema": {
      "dataSource": "test2_data",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "intervals": [
          "2017-03-15/2017-03-16"
        ]
      },
      "parser": {
        "type": "string",
        "parseSpec": {
          "format": "csv",
          "timestampSpec" : {
            "column" : "TestDate",
            "format" : "yyyy-MM-dd'"
          },
          "columns": [
            "TestDate",
            "Region",
            "id",
            "Amt"
          ],
          "dimensionsSpec": {
            "dimensions": [
              "Region",
              "id"
            ]
          }
        }
      },
      "metricsSpec": [
        { "type": "count", "name": "count"},
        { "type" : "doubleSum", "name" : "TotAmt", "fieldName" : "Amt" }
      ]
    },
    "tuningConfig": {
      "type": "index",
      "partitionsSpec": {
        "type": "hashed",
        "targetPartitionSize": 5000000
      },
      "jobProperties": {}
    }
  }
}

The data that I am loading is:
2017-03-15,R1,1,72911.87
2017-03-15,R1,2,729118.7
2017-03-15,R1,3,7291187.0
2017-03-15,R1,4,72911870
2017-03-15,R2,0,729118
2017-03-15,R2,1,729118
2017-03-15,R2,2,729118
2017-03-15,R2,3,729118
2017-03-15,R2,4,729118
2017-03-15,R2,5,729118
2017-03-15,R2,6,729118
2017-03-15,R2,7,729118
2017-03-15,R2,8,729118
2017-03-15,R2,9,729118

The first four lines for R1 are with data that leads to inaccuracies in single precision float. The last 10 lines with R2 are copied 100 times overall to get 1004 lines in the file.

When I run the query to get the values grouped by the region and ID, the results are as follows:
┌────────┬────┬───────────────┐
│ Region │ id │ TotAmt │
├────────┼────┼───────────────┤
│ R1 │ 1 │ 72911.8671875 │
│ R1 │ 2 │ 729118.6875 │
│ R1 │ 3 │ 7291187 │
│ R1 │ 4 │ 72911872 │
│ R2 │ 0 │ 72911800 │
│ R2 │ 1 │ 72911800 │
│ R2 │ 2 │ 72911800 │
│ R2 │ 3 │ 72911800 │
│ R2 │ 4 │ 72911800 │
│ R2 │ 5 │ 72911800 │
│ R2 │ 6 │ 72911800 │
│ R2 │ 7 │ 72911800 │
│ R2 │ 8 │ 72911800 │
│ R2 │ 9 │ 72911800 │
└────────┴────┴───────────────┘

As can be seen in this data, for R1 the data is being approximated by a single precision floating point number in the cases of ID = 1, 2, 4 (also possibly 3, but the representation is an integer). For R2 each value is reported as exactly 100 time the base number as expected.

However if I group it at just the Region level the results are:
┌────────┬───────────┐
│ Region │ TotAmt │
├────────┼───────────┤
│ R1 │ 81005088 │
│ R2 │ 729118016 │
└────────┴───────────┘

So although the TotAmt is defined as DoubleSum; the sum at R2 level is not correct if we assume a double precision calculation which is expected to be accurate till 15 digits.

Can someone explain why this behavior is being seen.

Is it possible to force Druid to use double precision numbers in all cases?

Regards,
Manish.

Hey Manish,

I’m working on a doc for Druid SQL about types that covers this, and I’ll share a draft with you below. The short answer is “no” you cannot force Druid to use doubles in all cases. But you could store your data as 64 bit ints (perhaps multiplied by a scaling factor) and use integer arithmetic if precision is important to you.

Here’s the relevant part of the draft doc: