Batch Ingestion via POST

Hi,

This is perhaps a very basic question, but I’m testing out loading some web traffic logs and other derived data into Druid and I’d like to be able to take a file that has records - one line of JSON per - and POST that directly to an indexer (not realtime - I have a set of about 10 log files that correspond to each hour, this would be a batch for that hour).

Looking at the documentation:

http://druid.io/docs/latest/ingestion/batch-ingestion.html

http://druid.io/docs/latest/ingestion/firehose.html

I see that the “local” firehose type lets you read from local files on disk. But in my case it would be useful to be able to actually post the content over HTTP (indexer and the machine containing the logs are not the same machine) - not one message at a time (as would seem to be the case with EventReceiverFirehose), but just send the batch over. Is there some option for that? I realize it may be quite a bit of data and perhaps doing it over HTTP is not the greatest, but I wanted to see if there is some feature I’m missing.

Best, Brad

Hi Brac,

with event receiver firehose you can post a whole batch of events (tranquility uses same interface to push events actually)… something like

[{event1}, {event 2}…]

– Himanshu

Thanks - that would do it. Will try that out.

Picking this back up, I am running into this issue now with the event receiver firehose:

I am using an overlord node in local mode to attempt to load in realtime data with an EventReceiverFirehose. As a result, I am getting:

Problem accessing /druid/indexer/v1/task. Reason:

    com.google.inject.ProvisionException: Guice provision errors:

``

  1. Error in custom provider, com.metamx.common.ISE: Cannot add a handler after the Lifecycle has started, it doesn’t work that way.
    at io.druid.guice.DruidProcessingModule.getProcessingExecutorService(DruidProcessingModule.java:90)
    at io.druid.guice.DruidProcessingModule.getProcessingExecutorService(DruidProcessingModule.java:90)
    while locating java.util.concurrent.ExecutorService annotated with @io.druid.guice.annotations.Processing()
    for parameter 0 at io.druid.query.IntervalChunkingQueryRunnerDecorator.<init>(IntervalChunkingQueryRunnerDecorator.java:38)
    while locating io.druid.query.IntervalChunkingQueryRunnerDecorator
    for parameter 0 at io.druid.query.timeseries.TimeseriesQueryQueryToolChest.<init>(TimeseriesQueryQueryToolChest.java:72)
    at io.druid.guice.QueryToolChestModule.configure(QueryToolChestModule.java:71)
    while locating io.druid.query.timeseries.TimeseriesQueryQueryToolChest
    for parameter 0 at io.druid.query.timeseries.TimeseriesQueryRunnerFactory.<init>(TimeseriesQueryRunnerFactory.java:51)
    at io.druid.guice.QueryRunnerFactoryModule.configure(QueryRunnerFactoryModule.java:80)
    while locating io.druid.query.timeseries.TimeseriesQueryRunnerFactory
    while locating io.druid.query.QueryRunnerFactory annotated with @com.google.inject.multibindings.Element(setName=,uniqueId=18, type=MAPBINDER)
    at io.druid.guice.DruidBinders.queryRunnerFactoryBinder(DruidBinders.java:36)
    while locating java.util.Map<java.lang.Class<? extends io.druid.query.Query>, io.druid.query.QueryRunnerFactory>
    for parameter 0 at io.druid.query.DefaultQueryRunnerFactoryConglomerate.<init>(DefaultQueryRunnerFactoryConglomerate.java:34)
    while locating io.druid.query.DefaultQueryRunnerFactoryConglomerate
    at io.druid.guice.StorageNodeModule.configure(StorageNodeModule.java:53)
    while locating io.druid.query.QueryRunnerFactoryConglomerate

``

I don’t follow the guts of this thing enough to understand how I’m creating a lifecycle problem with this request.

The steps I am doing to get this are:

This is my task:

rt-test.json:

{
“type”: “index_realtime”,
“spec”: {
“dataSchema”: {
“dataSource”: “wikipedia”,
“parser”: {
“type”: “string”,
“parseSpec”: {
“format”: “json”,
“timestampSpec”: {
“column”: “timestamp”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions”: [
“page”,
“language”,
“user”,
“unpatrolled”,
“newPage”,
“robot”,
“anonymous”,
“namespace”,
“continent”,
“country”,
“region”,
“city”
],
“dimensionExclusions”: ,
“spatialDimensions”:
}
}
},
“metricsSpec”: [
{
“type”: “count”,
“name”: “count”
},
{
“type”: “doubleSum”,
“name”: “added”,
“fieldName”: “added”
},
{
“type”: “doubleSum”,
“name”: “deleted”,
“fieldName”: “deleted”
},
{
“type”: “doubleSum”,
“name”: “delta”,
“fieldName”: “delta”
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “DAY”,
“queryGranularity”: “NONE”
}
},
“ioConfig”: {
“type”: “realtime”,
“firehose”: {
“type”: “receiver”,
“serviceName”: “testService”,
“bufferSize”: 10000
},
“plumber”: {
“type”: “realtime”
}
},
“tuningConfig”: {
“type”: “realtime”,
“maxRowsInMemory”: 500000,
“intermediatePersistPeriod”: “PT10m”,
“windowPeriod”: “PT10m”,
“basePersistDirectory”: “/tmp/realtime/basePersist”,
“rejectionPolicy”: {
“type”: “serverTime”
}
}
}
}

``

And I am submitting it with:

curl -X POST ‘http://localhost:8090/druid/indexer/v1/task’ -H ‘content-type: application/json’ -d@rt-test.json

``

Any input?

I realized after I posted this that I jumped subjects a bit - this is a realtime example and my original post was about batch data. I ended up doing batch processing via local files; at this point I’m trying to get realtime to work, thus this question. (Sorry about the thread mis-use!)

Hi Brad, trying to stream files directly into Druid is going to be extremely difficult until https://groups.google.com/forum/#!searchin/druid-development/windowperiod/druid-development/kHgHTgqKFlQ/fXvtsNxWzlMJ is completed.

For the time being, if you have any static set of data, I recommend using batch ingestion, and if you have a stream of current time data, to use realtime ingestion.