Lambda Architecture: Synchronize Batch / Realtime

Hello everyone,

I would like to know if you have some insights to synchronize batch and real time ingestion.

Here is my usecase:

  • using kafka indexing service to update Druid in real time during the day (several segments could be created)

  • executing a batch at night that will overwrite segments (granularity DAY)

The batch is the source of truth that’s why it overwrites the real time data. However the batch perimeter can be different than the real time and so some real time data in the segment have to be kept during several days.

  • Batch

  • data => S3 => IngestionSpec => Druid

  • Real time

  • data => kafka

  • SupervisorSpec => Druid

  • S3 (streaming app)

Here are the scenarios I could imagine

  • RealTime Always Up
  • Data are stored on S3 (batch)
  • Data from kafka are stored on S3 and read by supervisor spec (realtime)
  • At Batch Execution, all S3 data (batch and realtime) are sent to Druid (Overwrite)

=> Possible DataLoss (if realtime has consumed messages directly from kafka during the batch execution)

  • Stop Realtime at batch execution

  • Data are stored on S3 (batch)

  • Data from kafka are stored on S3 and read by supervisor spec (realtime)

  • At Batch Execution

  • supervisor is suspended

  • all S3 data (batch and realtime) are sent to Druid (Overwrite)
    => Possible Duplicate Data (no synchronization between streaming app that write on S3 and the supervisor. As the supervisor is stopped, the streaming app can read more data on kafka therefore when the supervisor is restarted, it will process the same data twice)

  • Secure on S3 before Realtime Druid

  • Data are stored on S3 (batch)

  • Data from kafka are stored on S3 THEN read by supervisor spec (realtime) (a new topic is needed: kafka_in => streaming app => kafka_out => Supervisor Spec)

  • At Batch Execution

  • streaming app is stopped

  • supervisor must have read all messages in kafka_out

  • all S3 data (batch and realtime) are sent to Druid (Overwrite)
    => OK but Increase Latency

Without going in details in my workflow, if anybody knows a general way to handle synchronization between real time and batch, it would be really helpful.

Thanks.

Erwan

Hey Erwan,

Just wondering if you found the right approach to this?

Regards,

Ayushi