Quickstart question -- streaming data

Hi,

I’ve been working through the Druid quickstart (http://druid.io/docs/0.9.1.1/tutorials/quickstart.html) and most of it has worked just fine. However, I am having trouble with the Streaming data ingestion step. I start tranquility just fine and run the generate-example-metrics script. In the tranquility log, I see that it creates an overload indexing task that looks right [1] and I see the indexing tasks running in the overlord console.

But, I don’t see that the metrics data source has been created. For example, “http://:8081/druid/coordinator/v1/metadata/datasources?includeDisabled” only lists the wikiticker data source that I created earlier in the quickstart.

Am I doing something wrong or perhaps not understanding how the streaming data ingestion is supposed to work?

Thanks! -Aaron

[1]

{

“type” : “index_realtime”,

“id” : “index_realtime_metrics_2016-08-23T15:00:00.000Z_0_0”,

“resource” : {

"availabilityGroup" : "metrics-2016-08-23T15:00:00.000Z-0000",

"requiredCapacity" : 1

},

“spec” : {

"dataSchema" : {

  "dataSource" : "metrics",

  "parser" : {

    "type" : "map",

    "parseSpec" : {

      "format" : "json",

      "timestampSpec" : {

        "column" : "timestamp",

        "format" : "millis",

        "missingValue" : null

      },

      "dimensionsSpec" : {

        "dimensionExclusions" : [ "count", "timestamp", "value_min", "value_max", "value", "value_sum" ],

        "spatialDimensions" : [ ]

      }

    }

  },

  "metricsSpec" : [ {

    "type" : "count",

    "name" : "count"

  }, {

    "type" : "doubleSum",

    "name" : "value_sum",

    "fieldName" : "value"

  }, {

    "type" : "doubleMin",

    "name" : "value_min",

    "fieldName" : "value"

  }, {

    "type" : "doubleMax",

    "name" : "value_max",

    "fieldName" : "value"

  } ],

  "granularitySpec" : {

    "type" : "uniform",

    "segmentGranularity" : "HOUR",

    "queryGranularity" : {

      "type" : "none"

    }

  }

},

"ioConfig" : {

  "type" : "realtime",

  "plumber" : null,

  "firehose" : {

    "type" : "clipped",

    "interval" : "2016-08-23T15:00:00.000Z/2016-08-23T16:00:00.000Z",

    "delegate" : {

      "type" : "timed",

      "shutoffTime" : "2016-08-23T16:15:00.000Z",

      "delegate" : {

        "type" : "receiver",

        "serviceName" : "firehose:druid:overlord:metrics-015-0000-0000",

        "bufferSize" : 100000

      }

    }

  }

},

"tuningConfig" : {

  "shardSpec" : {

    "type" : "linear",

    "partitionNum" : 0

  },

  "rejectionPolicy" : {

    "type" : "none"

  },

  "buildV9Directly" : false,

  "maxPendingPersists" : 0,

  "intermediatePersistPeriod" : "PT10M",

  "windowPeriod" : "PT10M",

  "type" : "realtime",

  "maxRowsInMemory" : "100000"

}

}

}

Hey Aaron,

The /druid/coordinator/v1/metadata/datasources path doesn’t list datasources that aren’t being served by historical nodes. In the case of streaming data through Tranquility, the queries are handled by peons running on the middle manager nodes until the data is handed off to the historicals which in your case would happen about every hour (since the segmentGranularity is hourly).

To see if the data was ingested properly, you can issue a Druid query to the broker; a straightforward one to do would be a timeseries query similar to the following:

{
“queryType”: “timeseries”,
“dataSource”: “metrics”,
“intervals”: [“2016-01-01/2020-01-01”],
“granularity”: “all”,
“aggregations”: [
{
“type”: “longSum”,
“fieldName”: “count”,
“name”: “count”
}
]
}

You can put this in a file and post it using:

curl -X POST ‘<queryable_host>:/druid/v2/?pretty’ -H ‘Content-Type:application/json’ -d @<query_json_file>

Alternately, the Imply Analytics Platform bundles Druid with Pivot which will automatically introspect your datasources and lets you visually explore your data. If you’re interested in that, you can check it out at: https://imply.io/

Thank you the clear explanation, David! I did the see the streaming data appear in the historicals after handoff and your query is a great starting point.

Best, Aaron

I just wanted to add that if you are interested in learning more about exactly-once streaming ingestion from Kafka to Druid, there is a nice tutorial that David wrote here: https://imply.io/docs/latest/tutorial-kafka-indexing-service.html