Problem accessing /druid/indexer/v1/supervisor

I am submitting a kafka ingestion spec using:

curl -XPOST -H’Content-Type: application/json’ -d @quickstart/kafka_earliest_offset.json http://localhost:8090/druid/indexer/v1/supervisor

Here is my extension list from “common.runtime.properties” file:

druid.extensions.loadList=[“druid-hdfs-storage”, “mysql-metadata-storage”, “druid-histogram”, “druid-kafka-eight”, “kafka-indexing-service”]

And I have downloaded the “kafka-indexing-service” from here: https://jar-download.com/?detail_search=a%253A%2522druid-kafka-indexing-service%2522&search_type=2&a=druid-kafka-indexing-service

And I am getting this error:

Error 500

HTTP ERROR: 500

Problem accessing /druid/indexer/v1/supervisor. Reason:

    java.lang.IllegalArgumentException: No enum constant com.metamx.common.Granularity.PT5M

Powered by Jetty://

Here is my ingestion spec. Anyone know the problem?

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "pageviews-kafka",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "time",
          "format": "auto"
        },
        "dimensionsSpec": {
          "dimensions": ["url", "user"]
        }
      }
    },
    "metricsSpec": [
      {"name": "views", "type": "count"},
      {"name": "latencyMs", "fieldName": "latencyMs", "type": "doubleSum"}
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "PT5M",
      "queryGranularity": "NONE"
    }
  },
  "ioConfig": {
    "topic": "pageviews",
    "consumerProperties": {
      "bootstrap.servers": "localhost:9092"
    },
    "useEarliestOffset" : "false",
    "taskCount": 1,

“replicas”: 1,

“taskDuration”: “PT5M”

}
}

Hey,

segmentGranularity doesn’t actually take arbitrary ISO8601 durations but requires a value from a set of enums defined here:

To have 5 minute segments, replace “PT5M” with “FIVE_MINUTE”.

Oh ok! It is now working, Thank you.

Also just a clarification. What is the difference between task duration and segmentGranularity. Since I read on the docs task duration is the length of time before tasks stop reading from Kafka and begin to publish their segments, whiles segmentGranularity is also used to determine when to create the segments. So should both of these values be the same?

For example, if my task duration is 5 minutes segments will be created every five minutes. And if my segmentGranularity is HOUR then it will group the 12 segments created before from the task duration (since 1 segment per 5 minute) and combine those 12 segments into one big segment for that hour?

Like you mentioned, taskDuration controls the lifetime of the indexing task and when it should stop reading and publish segments. segmentGranularity determines the time interval covered by each segment; as an example for HOUR segmentGranularity you would get segments that look like this (regardless of the taskDuration):

dataSource_2016-08-31T16:00:00.000Z_2016-08-31T17:00:00.000Z_version_partitionNum
dataSource_2016-08-31T17:00:00.000Z_2016-08-31T18:00:00.000Z_version_partitionNum
dataSource_2016-08-31T18:00:00.000Z_2016-08-31T19:00:00.000Z_version_partitionNum

A segment can have multiple partitions, and if you set taskDuration to 5 mins and segmentGranularity to HOUR, you would get at least 12 partitions per hour (and many more if you had multiple Kafka partitions since we create a separate Druid partition for each Kafka partition), for example:

dataSource_2016-08-31T16:00:00.000Z_2016-08-31T17:00:00.000Z_version
dataSource_2016-08-31T16:00:00.000Z_2016-08-31T17:00:00.000Z_version_1
dataSource_2016-08-31T16:00:00.000Z_2016-08-31T17:00:00.000Z_version_2
dataSource_2016-08-31T16:00:00.000Z_2016-08-31T17:00:00.000Z_version_3

dataSource_2016-08-31T16:00:00.000Z_2016-08-31T17:00:00.000Z_version_10
dataSource_2016-08-31T16:00:00.000Z_2016-08-31T17:00:00.000Z_version_11

One of the pain points with the Kafka indexing service right now (that you’ll notice if you read other forum posts) is that it is not able to combine the 12 partitions into a big segment for the hour and so it can generate a lot of partitions. It is possible to merge them using a Hadoop indexing task (http://druid.io/docs/0.9.1.1/ingestion/update-existing-data.html) but hopefully in the near future Hadoop won’t be necessary.

taskDuration and segmentGranularity don’t need to be the same, but it makes sense for taskDuration >= segmentGranularity to minimize the number of partitions created which will help with query performance.

Thank you for the detailed response. It has clarified my doubts.

Hi David,

I am running a local ingestion task on nfs cluster from master nodes . when I run from each master node separately( by checking, if one does not work , go to next and perform). the CURL command - curl -X POST -H ‘Content-Type:application/json’ -u uname:pwd -d @tp.json http://localhost:8090/druid/indexer/v1/supervisor

Here tp.json is the file with ingestion spec. I will attach the same file referred to as 1 on my local machine.

Warning: Couldn’t read data from file “tp.json”, this makes an empty POST.

Error 500 Server Error

HTTP ERROR 500

Problem accessing /druid/indexer/v1/supervisor. Reason:

    Server Error

Caused by:

java.lang.NullPointerException

curl -X POST -H ‘Content-Type:application/json’ -u uname:pwd -d @tp.json http://localhost:8010/druid/indexer/v1/supervisor

Warning: Couldn’t read data from file “tp.json”, this makes an empty POST.

curl: (7) Failed connect to localhost:8010; Connection refused

  1. when I execute sam curl command from different master node , no error is displayed in logs, console and terminal. but I do not see any jobs running.

3.when I run from another master node,

curl -X POST -H ‘Content-Type:application/json’ -u uname:pwd -d @tp.json http://localhost:8081/druid/indexer/v1/supervisor

{“error”:“Could not resolve type id ‘index’ into a subtype of [simple type, class org.apache.druid.indexing.overlord.supervisor.SupervisorSpec]: known type ids = [NoopSupervisorSpec, SupervisorSpec]\n at [Source: HttpInputOverHTTP@265434b7[c=1475,q=0,[0]=null,s=STREAM]; line: 1, column: 3]”}

Note:- data file has data only for one day 28-07-2019. should I make changes in date

json file

{

“type” : “index”,

“spec” : {

“dataSchema” : {

“dataSource” : “NFSdata”,

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “json”,

“dimensionsSpec” : {

“dimensions” : [

“CPURank”,

“UserName”,

“AcctString”,

“ExpandAcctString”,

“AppID”,

“ClientID”,

“DefaultDatabase”,

“SpoolUsage”,

“StartTime”,

“ParseBlockTime”,

“QryResptime”,

“NumOfActiveAMPs”,

“NumResultRows”,

“SumCPU”,

“SumIO”,

“CPUSKW”,

“IOSKW”,

“PJI”,

“UII”,

“ImpactCPU”,

“Query_Text”

]

},

“timestampSpec”: {

“column”: “timestamp”,

“format”: “iso”

}

}

},

“metricsSpec” : ,

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “day”,

“queryGranularity” : “none”,

“intervals” : [“2016-06-27/2016-06-28”],

“rollup” : false

}

},

“ioConfig” : {

“type” : “index”,

“firehose” : {

“type” : “local”,

“baseDir” : “path“,

“filter” : “filename”

},

“appendToExisting” : false

},

“tuningConfig” : {

“type” : “index”,

“targetPartitionSize” : 5000000,

“maxRowsInMemory” : 25000,

“forceExtendableShardSpecs” : true

}

}

}