Druid Wikipedia edits Getting the Realtime data


I am trying to do the Wikipedia Edits in real-time. I want to get the data that is being edited in Wikipedia. I get that i need to set up Kafka and load the streaming data from there . But in the documentation and tutorials all i see is that pasting the sampled data just by editing the current time. The example data to be edited is as given below

{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}

I would like to get the currently being edited data . I tried using the IMPLY which also does the same change of time and pasting of data. But the explanation on setting up the Wikipedia edits for real-time data is not mentioned.

Can someone help me out in getting the real-time data OR show me the tutorial to set it up?

If anyone has done this Can i get the Tarball and can u send me the instructions so i can repeat your process.

Thank You.


These examples no longer work with 0.9.0 Druid. I believe the Druid/Imply docs you’ve already read cover how to load streaming data into Druid and the problem you have is actually around how to get Wikipedia data and really has nothing to do with the system. For this, you can either try https://github.com/implydata/wikiticker or just read the Wikipedia API and build a simple http emitter for this data.