Ingestion (HttpFirehose) Task completed but druid still requesting file?

Hi all, are running ingestion tasks with the httpFirehose. (druid 0.15)

The task appears to run correctly, based on a ‘cron’ calling:

http://uswest2-prod-druid-master-001:8090/druid/indexer/v1/task/task_id/status/

Also Checking the UI it also shows success.

What is odd to us, is that we continue to see requests form druid to that end http endpoint in specified in the fire hose!

We would think that once the task has completed, showing SUCCESS or FAILED, it would no longer send requests to that end point?

Any thoughts on what is going on here?

Example Task payload:

{
“type”: “index_parallel”,
“id”: “pipe-sds-to-file-1568741400000.json”,
“resource”: {
“availabilityGroup”: “pipe-sds-to-file-1568741400000.json”,
“requiredCapacity”: 1
},
“spec”: {
“dataSchema”: {
“dataSource”: “test-messages1”,
“parser”: {
“type”: “string”,
“parseSpec”: {
“format”: “json”,
“timestampSpec”: {
“column”: “acceptTs”,
“format”: “iso”
},
“dimensionsSpec”: {
“dimensions”: ,
“dimensionExclusions”: ,
“spatialDimensions”:
}
}
},
“metricsSpec”: ,
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “FIFTEEN_MINUTE”,
“queryGranularity”: {
“type”: “none”
},
“rollup”: false,
“intervals”: null
},
“transformSpec”: {
“filter”: null,
“transforms”:
}
},
“ioConfig”: {
“type”: “index_parallel”,
“firehose”: {
“type”: “http”,
“uris”: [
http://dps.openmarket.com/pipe-sds-to-file-1568741400000.json
],
“maxCacheCapacityBytes”: 1073741824,
“maxFetchCapacityBytes”: 1073741824,
“prefetchTriggerBytes”: 536870912,
“fetchTimeout”: 900000,
“maxFetchRetry”: 3,
“httpAuthenticationUsername”: null,
“httpAuthenticationPassword”: null
},
“appendToExisting”: false
},
“tuningConfig”: {
“type”: “index_parallel”,
“maxRowsPerSegment”: null,
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0,
“maxTotalRows”: null,
“numShards”: null,
“indexSpec”: {
“bitmap”: {
“type”: “concise”
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”
},
“maxPendingPersists”: 0,
“forceGuaranteedRollup”: false,
“reportParseExceptions”: true,
“pushTimeout”: 0,
“segmentWriteOutMediumFactory”: null,
“maxNumSubTasks”: 1,
“maxRetry”: 3,
“taskStatusCheckPeriodMs”: 1000,
“chatHandlerTimeout”: “PT10S”,
“chatHandlerNumRetries”: 5,
“logParseExceptions”: true,
“maxParseExceptions”: 0,
“maxSavedParseExceptions”: 1,
“partitionDimensions”: ,
“buildV9Directly”: true
}
},
“context”: {},
“groupId”: “pipe-sds-to-file-1568741400000.json”,
“dataSource”: “test-messages1”
}

``

Hi Daniel:

I think this makes sense, because 1. Druid is not able to ingest unlimited amount of data in one task, and 2, in HTTP firehose it has no information on how much data is will come out from the hose, and how much data it has to download. So instead, it just opens a stream and keeps reading from that stream until it consumes all data.

Hope this helps

I do not think that is what happening…

For example:

Druid state: tasks are running! everything is either failed/success atm

We shutdown our http-server with the data files. (overnight).

We start the http-server.

We see this…imediatly on startup:

019-09-18 16:23:23,817 INFO [qtp531169818-63] MatcherFilter - The requested route [/pipe-sds-to-file-1568744100000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,836 INFO [qtp531169818-60] MatcherFilter - The requested route [/pipe-sds-to-file-1568743200000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,855 INFO [qtp531169818-59] MatcherFilter - The requested route [/pipe-sds-to-file-1568742300000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,874 INFO [qtp531169818-63] MatcherFilter - The requested route [/pipe-sds-to-file-1568741400000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,896 INFO [qtp531169818-59] MatcherFilter - The requested route [/pipe-sds-to-file-1568740500000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,915 INFO [qtp531169818-63] MatcherFilter - The requested route [/pipe-sds-to-file-1568739600000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,935 INFO [qtp531169818-59] MatcherFilter - The requested route [/pipe-sds-to-file-1568738700000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,954 INFO [qtp531169818-60] MatcherFilter - The requested route [/pipe-sds-to-file-1568737800000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]
2019-09-18 16:23:23,973 INFO [qtp531169818-63] MatcherFilter - The requested route [/pipe-sds-to-file-1568736900000.json] has not been mapped in Spark for Accept: [text/html, image/gif, image/jpeg, *; q=.2, /; q=.2]

``

etc…

We deleted the files long ago…since they were flagged as success…and the UI shows no tasks our running?

So clearly, its ‘constantly’ attempting to poll these files still?

checking the druid_task we see (the last one from above shows it as active = false).

It appears this may be due to the UI.

If I reload the druid UI, I see druid issue requests for ALL files/tasks that were previously completed/failed!

While I suspect/hope it won’t actually do anything with that file if it is returned? (we delete the files when the task is completed).