cannot ignore invalid lines in index task

Hi All,

I’m running an index task to load a tsv file from local filesystem, and there are a few blank lines, I couldn’t find a way to ignore them.

{

“type”: “index”,

“spec”: {

"dataSchema": {

  "dataSource": "dpi",

  "parser": {

    "type": "string",

    "parseSpec": {

      "format": "tsv",

      "timestampSpec": {

        "column": "timestamp",

        "format": "yyyy-MM-dd HH:mm:ss"

      },

      "columns": ["timestamp","auction_id","user_id","ip","seller_member_id","user_agent","referer_url","latitude","longitude","app_id","device_make_id","device_model_id","carrierId","site_domain","geo_uk","de_geo_country","de_geo_city","de_geo_region","de_geo_postcode","timezone","device_id_type","device_id","response_values"], 

      "dimensionsSpec": {

        "dimensions": [

          "site_domain"

        ],

        "delimiter":"\t",

        "dimensionExclusions": [],

        "spatialDimensions": []

      }

    }

  },

  "metricsSpec": [

    {

      "type": "count",

      "name": "count"

    }

  ],

  "granularitySpec": {

    "type": "uniform",

    "segmentGranularity": "DAY",

    "queryGranularity": "NONE",

    "intervals": ["2015-03-01/2015-09-01"]

  }

},

"ioConfig": {

  "type": "index",

  "firehose": {

    "type": "local",

    "baseDir": "/mnt/xvdb/data/",

    "filter": "sample.tsv"

  }

},

"tuningConfig": {

  "type": "index",

  "targetPartitionSize": 0,

  "rowFlushBoundary": 0,

"ignoreInvalidRows": true

}

}

}

In the above config, I tried “ignoreInvalidRows”: true, but it doesn’t seem to have any effect. I found the following config from the task logs, and even that doesn’t seem to be having the above setting.

{

“type” : “index”,

“id” : “index_dpi_2015-03-24T11:56:46.225Z”,

“spec” : {

"dataSchema" : {

  "dataSource" : "dpi",

  "parser" : {

    "type" : "string",

    "parseSpec" : {

      "format" : "tsv",

      "timestampSpec" : {

        "column" : "timestamp",

        "format" : "yyyy-MM-dd HH:mm:ss"

      },

      "dimensionsSpec" : {

        "dimensions" : [ "site_domain" ],

        "dimensionExclusions" : [ "timestamp" ],

        "spatialDimensions" : [ ]

      },

      "delimiter" : null,

      "listDelimiter" : null,

      "columns" : [ "timestamp", "auction_id", "user_id", "ip", "seller_member_id", "user_agent", "referer_url", "latitude", "longitude", "app_id", "device_make_id", "device_model_id", "carrierId", "site_domain", "geo_uk", "de_geo_country", "de_geo_city", "de_geo_region", "de_geo_postcode", "timezone", "device_id_type", "device_id", "response_values" ]

    }

  },

  "metricsSpec" : [ {

    "type" : "count",

    "name" : "count"

  } ],

  "granularitySpec" : {

    "type" : "uniform",

    "segmentGranularity" : "DAY",

    "queryGranularity" : {

      "type" : "none"

    },

    "intervals" : [ "2015-03-01T00:00:00.000Z/2015-09-01T00:00:00.000Z" ]

  }

},

"ioConfig" : {

  "type" : "index",

  "firehose" : {

    "type" : "local",

    "baseDir" : "/mnt/xvdb/data",

    "filter" : "sample.tsv",

    "parser" : null

  }

},

"tuningConfig" : {

  "type" : "index",

  "targetPartitionSize" : 5000000,

  "rowFlushBoundary" : 500000,

  "numShards" : -1

}

},

“groupId” : “index_dpi_2015-03-24T11:56:46.225Z”,

“dataSource” : “dpi”,

“interval” : “2015-03-01T00:00:00.000Z/2015-09-01T00:00:00.000Z”,

“resource” : {

"availabilityGroup" : "index_dpi_2015-03-24T11:56:46.225Z",

"requiredCapacity" : 1

}

}

Kindly let me know how I could make indexer to ignore invalid lines.

Thanks,

https://github.com/druid-io/druid/pull/1226

There was a patch recently to fix this in the hadoop indexer, I wonder if the indexing service also suffers from it.

This option is not a valid config for the indexing task:
http://druid.io/docs/latest/Tasks.html

It is not currently supported.

Thanks all for the reply.

The option to ignore invalid lines is vital especially for batch upload. Should a feature request be filed? And kindly let me know if there are any workarounds (other than removing the lines from the input files).

Thanks,

The easiest workaround for batch data is to use Hadoop indexing so long as the values in the file will fail at the map step (failure to split) rather than reduce step (failure to interpret column value).

Hi Sowdri, FWIW, we never use the index task in production and we do all batch ingestion using Hadoop based batch indexing. The index task was created to ingest small batches of data for POCs and was never intended to be used in production. It is horribly inefficient and won’t really scale with data volumes greater than 1G.

Hi Yang,

Thank you for your reply. I’ll use hadoop batch indexer for indexing my data.

Thanks all for your responses,