File ingestion fails - is my ingestion spec correct?

Hi - Im a newb in druid and trying to get it data ingestion to work.

Basically ingestion tasks fail for some reason.

Could someone confirm if my ingestion spec is correct?

https://paste.fedoraproject.org/462000/92534147/raw/

Thx

Here’s the log: https://paste.fedoraproject.org/462024/47759586/raw/

I looks like overlord trows a null-pointer exception bu that’s about it.

2016-10-27T16:58:56,926 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTask{id=index_lookups_2016-10-27T16:58:53.849Z, type=index, dataSource=lookups}]
java.lang.NullPointerException
	at io.druid.indexing.common.task.IndexTask.getDataIntervals(IndexTask.java:242) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:200) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_40]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_40]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_40]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_40]

Hey JC,

The ingestion spec for an IndexTask doesn’t take an inputSpec but takes a firehose (inputSpec is for Hadoop tasks). Also, I don’t know if that was just an example, but as far as I know the index task can’t read data from HTTP. Assuming that you have a local file, here’s a spec that should get you on the right track:

{
  "type": "index",
  "spec": {
    "ioConfig": {
      "type": "index",
      "firehose": {
        "type": "local",
        "baseDir": "/path/to/file/",
        "filter": "ws8_2016-10-24_14-10-02_2716.json"
      }
    },
    "dataSchema": {
      "dataSource": "lookups",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "intervals": [
          "2015-09-12/2017-09-13"
        ]
      },
      "parser": {
        "type": "string",
        "parseSpec": {
          "format": "json",
          "dimensionsSpec": {
            "dimensions": [
              "word",
              "is_mwu",
              "host",
              "ref"
            ]
          },
          "timestampSpec": {
            "format": "auto",
            "column": "timestamp"
          }
        }
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        }
      ]
    }
  }
}

Thanks David!

@David

Hmm. ran into another issue - it says “Parameter ‘directory’ is not a directory”.

So im thinking this is because druid cluster is in docker so this file doesn’t exist insdide of the indexer host.

is there any way to serve file via web http, S3, ftp? etc?

Hey,

Ah okay, that makes sense. You can mount directories from the host inside the container, see the Docker docs for more info: https://docs.docker.com/engine/tutorials/dockervolumes/

As for alternatives, AFAIK there’s no ingestion mechanism to read from HTTP or FTP. There is an extension for reading from files stored in S3 described here: http://druid.io/docs/0.9.1.1/development/extensions-core/s3.html

@David

Thx. for your help. Im trying to figure out what is the proper naming convention for files in S3.

here’s what im doing right now:

s3://word-lookup/57c2a3e4b0c4/2016/10/17-31-20_l1.json.gz

Where

bucket = word-lookup

S3 file = 57c2a3e4b0c4/2016/10/17-31-20_l1.json.gz

^Is that correct?

That looks correct to me, the first level in the URI denotes the S3 bucket name.

@David

hmm. So naming is correct then. This is what I see in the logs when running the ingestion.

Some NullPointerException, any idea what am doing wrong?

Oct 31, 2016 1:48:24 PM org.hibernate.validator.internal.util.Version <clinit>
INFO: HV000001: Hibernate Validator 5.1.3.Final
2016-10-31T13:48:25,540 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29
2016-10-31T13:48:26,381 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTask{id=index_lookups_2016-10-31T13:48:23.462Z, type=index, dataSource=lookups}]
java.lang.NullPointerException
	at io.druid.indexing.common.task.IndexTask.getDataIntervals(IndexTask.java:242) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:200) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_40]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_40]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_40]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_40]
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider as a provider class
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider as a provider class
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering io.druid.server.initialization.jetty.CustomExceptionMapper as a provider class
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering io.druid.server.StatusResource as a root resource class
Oct 31, 2016 1:48:26 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.19 02/11/2015 03:25 AM'
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding io.druid.server.initialization.jetty.CustomExceptionMapper to GuiceManagedComponentProvider with the scope "Singleton"
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider to GuiceManagedComponentProvider with the scope "Singleton"
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider to GuiceManagedComponentProvider with the scope "Singleton"
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding io.druid.server.QueryResource to GuiceInstantiatedComponentProvider
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding io.druid.segment.realtime.firehose.ChatHandlerResource to GuiceInstantiatedComponentProvider
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding io.druid.query.lookup.LookupListeningResource to GuiceInstantiatedComponentProvider
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding io.druid.query.lookup.LookupIntrospectionResource to GuiceInstantiatedComponentProvider
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding io.druid.server.http.security.StateResourceFilter to GuiceInstantiatedComponentProvider
Oct 31, 2016 1:48:26 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding io.druid.server.StatusResource to GuiceManagedComponentProvider with the scope "Undefined"

I’ll open a new thread - this one is too long to follow.