Delayed batch ingestion

Hi,

Our system currently handles about 400k events per day, ingested realtime via Kafka.

We have another system with more or less the same data that we also would like to ingest, however, we can only get this data pre-aggregated and with an hour delay.

Is this solvable, and if so, how to proceed?

regards,

Robin

Hi

Hi,

Our system currently handles about 400k events per day, ingested realtime via Kafka.

We have another system with more or less the same data that we also would like to ingest, however, we can only get this data pre-aggregated and with an hour delay.

not sure what is the question ?

do you want to ingest on realtime via kafka then append the hour late data to it ? is that your question ?

Is this solvable, and if so, how to proceed?

regards,

Robin

You received this message because you are subscribed to the Google Groups “Druid User” group.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.

To post to this group, send email to druid-user@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/2f10bcd8-65e8-4f11-a2d5-f48fc4963278%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Slim Bouguerra, Ph.D

Yahoo> Software Dev Eng

1908 South First Street Champaign, IL 61820

Hi,

Hi

Hi,

Our system currently handles about 400k events per day, ingested realtime via Kafka.

We have another system with more or less the same data that we also would like to ingest, however, we can only get this data pre-aggregated and with an hour delay.

not sure what is the question ?

Sorry for the confusion.

do you want to ingest on realtime via kafka then append the hour late data to it ? is that your question ?

Yes, that’s correct. :slight_smile:

Thanks,

Robin

Hi,

Found this in another thread.

“The actual solution you want will come in Druid 0.9.1, where for Kafka-based ingestion, there will be no more windowPeriod and you can stream in any timestamp exactly-once. The RC should out in the coming weeks and you should look into using the KafkaIndexTask.”

regards,

Robin

Hey Robin,

In terms of ingesting events that are an hour old from Kafka, using the KafkaIndexTask that will be released in 0.9.1 seems like a good way to go. If you want to read up on it, there are some docs available here: https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/kafka-ingestion.md

Having part of your data pre-aggregated may or may not work for you, depending on what kind of queries you’re planning on issuing and what information is available in the pre-aggregated events. If you post more details about your use case, you may be able to get additional assistance.

This feature is super helpful, especially when it is able to ignore the windowPeriod. My data in Kafka is not in JSON Format, it is in some custom format List<MapEvent<Map<String, Object>>> that has been serialized using Avro. That being said, it looks like Tranquility is the way for me to go or is the above phenomena updated into Tranquility API somehow ?

The Kafka indexing service is separate from Tranquility and runs directly on the Overlord.

I don’t have any experience using this extension, but if you’re using Avro it may be worth looking at: http://druid.io/docs/latest/development/extensions-core/avro.html. If it seems like it might be promising, there are others on the group who can probably answer more specific questions about it.

Hi David,

Thanks, will read. :slight_smile:

Regarding the data we have a couple of fields win and loss that we aggregate using doubleSum, doubleMin and doubleMax. I think the easiest solution from our side will be to simply divide the pre-aggregated values with the total number of pre-aggregated entries. The doubleMin and doubleMax values will be useless, but the more important doubleSum will still be correct.

Thanks,

Robin