Can Druid handle arrays of JSON?

If we want to ingest multiple records inside of a single JSON array, is that possible? To give an example, I know that you can ingest data like this:

{“timestamp”: “2013-08-31T12:41:27Z”, “page”: “Coyote Tango”, “language” : “ja”, “user” : “cancer”}

{“timestamp”: “2013-08-31T12:41:28Z”, “page”: “Hotel”, “language” : “en”, “user” : “cancer”}

But I would like to ingest data like this:

[{“timestamp”: “2013-08-31T12:41:28Z”, “page”: “Hotel”, “language” : “en”, “user” : “cancer”}, {“timestamp”: “2013-08-31T12:41:28Z”, “page”: “Hotel”, “language” : “en”, “user” : “cancer”}]

Is this possible? All I want to do is have Druid unpack this array so that each element is a separate record.

Hi,
Its not possible to ingest JSON array like this in the current JsonParser.

For now, I think your best bet would be to split the array into multiple messages in ETL layer and ingest that into druid.

Thanks for the reply. We’re using Kafka for realtime ingestion, is it possible to have a Kafka message with multiple lines of JSON instead? We have a periodic emission of data that will contain dozens of records at a time and we want to avoid the network overhead of making so many discrete calls out to Kafka. For example, would it be possible to send this?

MESSAGE 1:

{“timestamp”: “2013-08-31T12:41:27Z”, “page”: “Coyote Tango”, “language” : “ja”, “user” : “cancer”}
{“timestamp”: “2013-08-31T12:41:28Z”, “page”: “Hotel”, “language” : “en”, “user” : “cancer”}

MESSAGE 2:

{“timestamp”: “2013-08-31T12:41:29Z”, “page”: “Foxtrot”, “language” : “ja”, “user” : “cancer”}
{“timestamp”: “2013-08-31T12:41:29Z”, “page”: “Golf”, “language” : “en”, “user” : “cancer”}

Here’s a readme from upcoming capabilities that is at least tangentially related to this thread:

https://github.com/jon-wei/druid/blob/flat_json/docs/content/ingestion/flatten-json.md

But in general, if you have multiple events, having each event on its own is the only out-of-the-box supported formatting for batch.

If you’re doing realtime then the EventReceiverFirehoseFactory takes events in batches. Specifically it tries to parse the events as a Collection<Map<String, Object>>

Good news, this will soon be possible!

The code is already merged into Druid-api. Using Druid to do the parsing will be less efficient than using a stream processor, but it will work.