Hive+Druid

Hi there,
I have 2TB in Druid Data Source e.g., data source name “my_ds” --> size 2TB

I am using Hive to create external table for the data source and then query the data source"my_ds".

When I use Hive to count the row of this data source e.g., select count(*) from my_ds

I got 7710906278 rows.

However, when I use Hive to dupplicate the table “my_ds” to another table to store as ORC format, I will get only 22085632 rows.

E.g., create table my_ds_dump as select * from my_ds

Then select count(*) from my_ds_dump --> I get only 22085632 rows

Do you know what is the problem?

How can I select all the rows from the original source?

Thanks

Tas

Hi,

What might be happening is that druid is doing so rollups based what you have specified as Query Granularity.

Let say you elect to use “druid.query.granularity” = “DAY” All this rows for a given day with the same dimension values will be collapsed to one day.

Is there anyway that I can dump the Druid datasource to Hive table with the same number of rows?

use “NONE”

“druid.quey.granularity” = “NONE”

I have try “druid.quey.granularity” = “NONE” , but it did not help either :frowning:

can you share the index DDL used to index the data ?

and the logs of the Task ?

Hi Slim,
Thanks for your reply.

This is the Json file used to ingest CSV files to Druid.

ingestion.json

{

“type”: “index_hadoop”,

“spec”: {

"ioConfig": {

  "type": "hadoop",

  "inputSpec": {

    "type": "static",

     "paths":"/user/hive/a.csv,/user/hive/b.csv"

}

},

"dataSchema": {

  "dataSource": "flat_table_5TB",

  "granularitySpec": {

    "type": "uniform",

    "segmentGranularity": "month",

    "queryGranularity": "NONE",

    "intervals": [

    "1970-01-01/2017-12-30"

    ]

  },

  "parser": {

    "type": "hadoopyString",

    "parseSpec": {

      "format": "csv",

      "timestampSpec": {

        "format": "yyyy-MM-dd'T'HH:mm:ss'Z'",

        "column": "__time"

      },

      "columns": [

       "__time",

        "a",

        "a_price",

        "b_price",

        "c_price",

        "b",

        "d_price",

        "c",

        "d",

        "dd",

        "e",

        "f",

        "g",

        "h",

        "i",

        "day",

        "month",

        "year"

      ],

      "delimiter": ",",

      "dimensionsSpec": {

        "dimensions": [

          "a",

          "b",

          "c",

          "d",

          "e",

          "f",

          "g",

          "h",

          "i",

          "day",

          "month",

          "year"

        ]

      }

    }

  },

  "metricsSpec": [

    {

      "name": "a_price",

      "type": "count"

    },

    {

      "name": "bprice",

      "type": "count"

    },

    {

      "name": "c_price",

      "type": "count"

    },

    {

      "name": "d_price",

      "type": "count"

    }

  ]

},

"tuningConfig": {

  "type": "hadoop",

  "useCombiner": "true",

  "partitionsSpec": {

    "type": "hashed",

    "targetPartitionSize": 5000000

  },

 "jobProperties":

    {

    }

}

}

}

You are hitting this bug where the number of rows returned by query has the wrong cap.

https://issues.apache.org/jira/browse/HIVE-17623

please apply the patch or use the latest HDP 2.6.3

I will update what you’ve mentioned.

Thanks a lots Slim.