Druid kafka streaming

Hi,

If there is a better place to post this please let me know…

I’ve got installed on a single node via Ambari the following components of Hortonworks HDP-3.1.0.0

HDFS 3.1.1

Zookeeper 3.4.6

Ambari Metrics 0.1.0

Kafka 2.0.0

Smart Sense 1.5.1…2.7…3.0-139

Druid 0.12.1

The node has 8vCPU, 64GB RAM and 2TB disk… so I don’t expect this thing to run nicely, I’m just playing. But if hardware is the culprit here please suggest that.

I have added the topic “testTopic” in kafka - and got a response that the topic is good.

I’ve written a producer in python writing to testTopic via a kafka-python producer object.

The producer creates two strings, a datetime derived time_stamp and a modulo int heart_beat.

I run the producer, and get the following (very helpful) output:

time_stamp 2019-01-17 16:14:26.526608 <class ‘str’>
heart_beat 0 <class ‘str’>
time_stamp 2019-01-17 16:14:28.529227 <class ‘str’>
heart_beat 1 <class ‘str’>
time_stamp 2019-01-17 16:14:30.531801 <class ‘str’>
heart_beat 2 <class ‘str’>
time_stamp 2019-01-17 16:14:32.534363 <class ‘str’>
heart_beat 3 <class ‘str’>

I run the following kafka command to test the data is getting into kafka:

/usr/hdp/3.1.0.0-78/kafka/bin/kafka-console-consumer.sh --bootstrap-server node00.bpeap.local:6667 --topic testTopic --from-beginning

And I get a nice rolling output on screen of the data my producer has passed…

{“time_stamp”: “2019-01-17 16:14:26.526608”, “heart_beat”: “0”}
{“time_stamp”: “2019-01-17 16:14:28.529227”, “heart_beat”: “1”}
{“time_stamp”: “2019-01-17 16:14:30.531801”, “heart_beat”: “2”}
{“time_stamp”: “2019-01-17 16:14:32.534363”, “heart_beat”: “3”}

So the kafka part looks good.

I want to ingest the data from kafka into druid.

I use the following command:

root@node00:~# curl -XPOST -H’Content-Type: application/json’ -d @data_spec.json http://localhost:8090/druid/indexer/v1/supervisor

And get the following reassuring output:

{“id”:“testTopic”}

Where data_spec.json looks like:

{
“type”: “kafka”,
“dataSchema”: {
“dataSource”: “testTopic”,
“parser”: {
“type”: “string”,
“parseSpec”: {
“format”: “json”,
“timestampSpec”: {
“column”: “time”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions”: [
“time_stamp”,
“heart_beat”
]
}
}
},
“metricsSpec” : ,
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “second”,
“queryGranularity”: “second”,
“rollup”: false
}
},
“tuningConfig”: {
“type”: “kafka”,
“reportParseExceptions”: false
},
“ioConfig”: {
“topic”: “testTopic”,
“replicas”: 2,
“taskDuration”: “PT10M”,
“completionTimeout”: “PT20M”,
“consumerProperties”: {
“bootstrap.servers”: “node00.bpeap.local:6667”
}
}
}

I want to test the kafka-druid hookup so I run what I believe is the equivaluent of an sql select *

Entering:

root@node00:~# curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @select.json http://node00:8083/druid/v2?pretty

Gives me the empty set:

Which is reassuring in that I am getting something back. Curling against :8082 give like results.

The json for the select query is:

{
“queryType”: “select”,
“dataSource”: “testTopic”,
“descending”: “false”,
“dimensions”:,
“metrics”:,
“granularity”: “all”,
“intervals”: [
“2019-01-10/2019-01-20”
],
“pagingSpec”:{“pagingIdentifiers”: {}, “threshold”:5}
}

I’m pretty sure its my query that’s the problem here. Can someone help me find my missing data?

PS I’ve tried running through the simple kafka data ingestion and query tutorials on druid.io… but for whatever reason, my installation as fails on syntax errors and bad indetifiers in the json.

I’ve also noticed hortonworks doesn’t seem to include dsql which could help me right now.

Thanks in advance,

Nick

Hey Nick,

Your query looks ok for testing. In your Kafka ingestion spec the timestamp spec you’ve given specifies the column “time” but in the data you’ve shared the timestamp column is named “time_stamp”.

Hopefully changing that should resolve your issue.

Best regards,

Dylan

That worked! Super Huge Thanks!!