Batch Data Load - success but no datasource appearing?

After having gone through the quickstart, I am now trying to get load my own data loaded…

When I run it,

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @/home/dhopkins/druid-messages-index.json http://localhost:8090/druid/indexer/v1/task

I see in the console success, and in the log I see success…

The bolded line below…looks suspect?..nothing to publish? does this mean…it is not finding any records in the data file?

Is my ISO time stamp format an issue? it doesn’t have milliseconds?

Also I noticed the quickstart example has json fields explicitly when values are null, is that needed?..

i.e. if a column is not present in the data what happens?

if a column is present that is not defined in the spec…what happens?

019-02-04T19:11:20,010 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Pushing segments in background:
2019-02-04T19:11:20,010 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Submitting persist runnable for dataSource[messages_index]
2019-02-04T19:11:20,018 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Dropping segments[]
2019-02-04T19:11:20,024 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.IndexTask - Pushed segments[]
2019-02-04T19:11:20,026 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Nothing to publish, skipping publish step.
2019-02-04T19:11:20,027 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.IndexTask - Published segments
2019-02-04T19:11:20,027 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down…
2019-02-04T19:11:20,029 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_messages_index_2019-02-04T19:10:15.763Z] status changed to [SUCCESS].
2019-02-04T19:11:20,032 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_messages_index_2019-02-04T19:10:15.763Z”,
“status” : “SUCCESS”,
“duration” : 308
}

``

The ‘druid cluster’ console:

http://wdc-tst-bdrd-001.openmarket.com:8081/#/

does not show my datasource (it does show the one from the quickstart)

The overlord console…(which shows the tasks/etc)

http://wdc-tst-bdrd-001.openmarket.com:8090/console.html

does in fact show my datasource, and that the task succeeded

Here is my spec:

{
“type”:“index”,
“spec” : {
“dataSchema”:{
“dataSource”:“messages_index”,
“parser”:{
“type”:“string”,
“parseSpec”:{
“format”:“json”,
“dimensionsSpec”:{
“dimensions”:[
“acceptedDate”,
“acceptedDateBucket”,
“deliveredDate”,
“updatedDate”,
“accountName”,
“accountId”,
“companyId”,
“countryName”,
“countryCode”,
“carrierName”,
“caId”,
“messageType”,
“messageOriginator”,
“messageOriginatorTon”,
“phoneNumber”,
“sourceAddress”,
“destinationAddress”,
“productName”,
“productId”,
“productIdDescription”,
“subaccount”,
“userDefined1”,
“userDefined2”,
“messageStatus”,
“responseCode”,
“responseCodeDescription”,
“messageId”,
“parentMessageId”,
“livup”,
“apiVersion”,
“contentEncoding”,
“userDataHeader”,
“remoteIpAddress”,
“remoteResponseCode”,
“userAgent”,
“productSubType”,
“pId”,
“internalMessageId”,
“Score”
]
},
“timestampSpec”:{
“column”:“acceptedDateBucket”,
“format”:“iso”
}
}
},
“metricsSpec”:,
“granularitySpec”:{
“rollup”:false,
“segmentGranularity”:“MINUTE”,
“queryGranularity”:“MINUTE”
}
},
“ioConfig” : {
“type” : “index”,
“firehose” : {
“type” : “local”,
“baseDir” : “/home/dhopkins/”,
“filter” : “kafka-file-dump.json”
},
“appendToExisting” : false
},
“tuningConfig”:{
“type” : “index”,
“targetPartitionSize” : 5000000,
“maxRowsInMemory” : 25000,
“forceExtendableShardSpecs” : true
}
}
}

``

Here is sample data record:

{“destinationAddress”:“111111111”,
“remoteResponseCode”:"",
“accountName”:“11111112F27-11111-A447-7D5433A69CF5”,
“updatedDate”:“2019-02-03T18:43:33Z”,
“messageOriginatorTon”:“1”,
“responseCodeDescription”:“Message delivered”,
“acceptedDate”:“2019-02-03T18:43:21Z”,
“productSubType”:“TRANSACTIONAL”,
“productName”:“111 Way”,
“responseCode”:“4”,
“deliveredDate”:“2019-02-03T18:43:33Z”,
“Score”:"[score20m=0,
score10m=0,
score24h=0]",
“apiVersion”:“VERSION_4”,
“carrierName”:“V11o”,
“messageType”:“MT”,
“countryCode”:“BR”,
“messageOriginator”:“11111111”,
“parentMessageId”:"",
“contentEncoding”:“UTF-8”,
“productIdDescription”:“CXsfd)”,
“remoteIpAddress”:“11.11.11.444”,
“sourceAddress”:“1111111”,
“productId”:“133”,
“subaccount”:“SSDFDFICE”,
“messageId”:“1119Z-0203T-1843Q-2137S”,
“userAgent”:“V4HTTP”,
“messageStatus”:“Delivered”,
“accountId”:“112-11111”,
“internalMessageId”:“11111-11111-11111-2137S”,
“companyId”:“000-000-00000-00000”,
“phoneNumber”:“1111111111111”,
“userDataHeader”:"",
“userDefined2”:“fcbdf3de-96c1-42d9-96b5-8c92c3c8b1a7”,
“countryName”:“Brazil”,
“caId”:“111”,
“livup”:“false”,
“userDefined1”:“InvokeRId=111-11111-11111-IF20L-4R824-PSI”,
“pId”:""}

``

Hi

2019-02-04T19:11:20,026 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Nothing to publish, skipping publish step.

Your guess is correct. This means there’s no data ingested and thus there’s no segments to create and publish.

If some columns exist in input data but not in the spec, Druid ignores them. If some columns exist in the spec but not in input data, Druid fills those columns with nulls.

I think this is more like a parse error. Can you set logParseExceptions (http://druid.io/docs/latest/ingestion/native_tasks.html) to true and see if there’s any parse error?

Jihoon

I updated as follows:

“ioConfig” : {
“type” : “index”,
“firehose” : {
“type” : “local”,
“baseDir” : “/tmp”,
“filter” : “kafka-file-dump.json.gz”
},
“appendToExisting” : false
},
“tuningConfig”:{
“type” : “index”,
“targetPartitionSize” : 4000000,
“maxRowsInMemory” : 25000,
“forceExtendableShardSpecs” : true,
“logParseExceptions”:true,
“reportParseExceptions”:true
}

``

now when I run it…well its odd…I’m not seeing any exceptions/or any new errors

but the odd part is…when I look at the payload/log it doesn’t show the logParseException field…

I know it picking up the changed file…as it is showing the reportParseExceptions as being true, and the targetPartitionSize which I changed from 5xxxx to 4xxxx

ie… it shows:

  "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "/tmp",
        "filter" : "kafka-file-dump.json.gz",
        "parser" : null
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 4000000,
      "maxRowsInMemory" : 25000,
      "maxTotalRows" : 20000000,
      "numShards" : null,
      "indexSpec" : {
        "bitmap" : {
          "type" : "concise"
        },
        "dimensionCompression" : "lz4",
        "metricCompression" : "lz4",
        "longEncoding" : "longs"
      },
      "maxPendingPersists" : 0,
      "buildV9Directly" : true,
      "forceExtendableShardSpecs" : true,
      "forceGuaranteedRollup" : false,
      "reportParseExceptions" : true,
      "pushTimeout" : 0,
      "segmentWriteOutMediumFactory" : null
    }
  },

Hi,

You can check the ingestion reports of complete tasks via an overlord API (http://druid.io/docs/latest/ingestion/reports.html).

Would you please check it especially the ‘rowStats’ part?

Jihoon

that yields a 404 :frowning:

Sending a get to:

http://wdc-tst-bdrd-001:8090/druid/indexer/v1/task/index_sms_messages_index_2019-02-05T17:30:36.845Z

returns the task json payload

http://wdc-tst-bdrd-001:8090/druid/indexer/v1/task/index_sms_messages_index_2019-02-05T17:30:36.845Z/reports

returns 404…

Oh, what druid version are you running?

hrm…0.12.1 - installed via HDP

Ah, yeah. The ingestion reports are available since 0.13.0.

I’m not sure why any rows weren’t ingested. Would you post the task log if possible?

Jihoon

See Attached

tmp.txt (76.8 KB)

Thanks. I see the below log.

2019-02-05T18:29:40,961 INFO [task-runner-0-priority-0] io.druid.segment.realtime.firehose.LocalFirehoseFactory - Initialized with files

This means there’s no input files found. Would you please check it again?

Jihoon

progress :slight_smile: (getting a parse exception atm…looks like I may have truncated row somewhere…will dig into that in a bit).

so the issue was/is (I have this running on a cluster), and switched to doing batch load vs kafka, to validate the payload was good, but…yeah…it decided to run the task on a node where the file wasn’t located…so…my question is…how can one ‘batch’ load a file on a clustered druid…with having to copy the file to all nodes that the task may run on?

and thanx big time for you help :_

indeed last row in the file was truncated…no biggy…but…my question is will this fail the entire load? still not seeing the datasource appear…will try again without the last file.

Ah, if you’re running a cluster, input files should be somewhere where all middleManagers can access to. HDFS, NFS, or AWS S3 are popular options.

After a task is successfully finished, you should wait for a while for a new dataSource to appear. This is because the Druid coordinator periodically refreshes its metadata in memory.

Jihoon