Native Batch Index Parallel

I am starting a thread here to see if anyone have any experience in reading data from s3 bucket and batch ingesting into druid.

Within the bucket, there are directories, the batch automatically will traverse the whole bucket by default. Inside the bottom of all directory structures there will be list of files, There will be numerous json data objects which can be ingested into the druid.

Here are some problem I am seeing:

  • The data don’t rollup. i.e. The count is always 1, even there could be multiple entry of same time chunk, dimensions. I tried the forceGuranteeRollup but did not help.
  • The job only read 1 entry per file in s3. Given there could be multiple json in the file, it tends to only count one.
    Is it something wrong with my ingestion spec? Or there are some other settings I need to configured outside of the spec in order to fix the 1 and 2 above.

Here is how the job being called:

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @firehose.json http://{druid}:8090/druid/indexer/v1/task

The ingestion spec (firehose.json) looks like this:


  "type": "index_parallel",
  "spec": {
    "dataSchema": {

      "dataSource": "batch_event_stats",
      "metricsSpec": [
        {
          "type": "count",
              "name": "count"
            }
        ],
        "granularitySpec": {
          "segmentGranularity": "hour",
          "queryGranularity": "hour",

          "rollup": true
        },
        "parser": {
          "parseSpec": {
            "format" : "json",
            "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "jq",
                "name": "eventName",
                "expr": ".payload.events[0].eventName"
              },
              {
                "type": "jq",
                "name": "eventTime",
                "expr": ".payload.events[0].eventTime"
              }
            ]
          },
            "timestampSpec": {
              "column": "eventTime",
              "format": "posix"
            },
            "dimensionsSpec": {

              "dimensions": ["carrier", "eventName", "scope", "source"]

            }
          }
        }
    },
    "ioConfig": {
        "type": "index_parallel",
        "firehose": {
          "type": "static-s3",

          "prefixes": ["s3://ccapp-druid-raw-qa"]

        },"appendToExisting": false
    },
    "tuningconfig": {
        "type": "index_parallel",
        "maxNumSubTasks": 2
    }
  }
}

Hi,

native parallel index task doesn’t support perfect rollup yet. It only supports best-effort rollup (http://druid.io/docs/latest/ingestion/index.html#roll-up-modes).

I’m not sure what you mean by “count is 1”, but I guess you’re seeing the result of best-effort rollup mode.

For the second question, do you mean each subTask reads only one file? or do you see any missing files which are not ingested?

If you mean single file per subTask, it’s the way parallel index task works as of now. It can be improved in the future so that each subTask reads multiple small files or reads a portion of a big file.

Jihoon

Regarding the 2nd question. The subtask read 1 file at a time is fine. but the problem is it only take 1 record inside of every single file, where all files have multiple records - they are all json…look like this below.

e.g. {scope: ‘web’, eventTime: 1234345567890, type: ‘notification’}{scope: ‘app’, eventTime: 1234345567890, type: ‘notification’}{scope: ‘server’, eventTime: 1234345567890, type: ‘notification’}…

thanks.

does index_parallel work correctly if you download the file into local disk vs pulling from s3?

Chirag,

Yes, it works after I put to local.

I found the problem is not because of the index mode.

It is because of the json objects do not have new line in between, the parser will only interpret 1 line per file if no new line after each new json.

Cathy