Streaming from Kafka data missing

So, this has been happening all of today.

I have written a script to generate the kafka ingest specs for a bunch of csv files(all have the same dimensions, only name and data is different).

After it generates the files, the script posts all the kafka ingest specs to the Druid server.

I, then, send the data line by line through a kafka producer.

Expected result-

See all datasources in Overlord console

See all datasources in Coordinator console (ONLY after some data is pushed)

Get all data in through query

Actual result-

See all datasources in Overlord console

See ONLY a FEW datasources in Coordinator console (even after all data is sent)

Able to get only data that’s visible in Coordinator console

Thanks in advance,

Paritosh J Chandran

Hi Paritosh,

Can you elaborate on what you are doing with kafka ingest specs?

How many streaming kafka topics do you have?

Are the kafka events represented as CSV rows or entire files?

It would be tremendously helpful if you could attach one of the generated ingest specs that you have.

Best regards,

Vadim

It could be that you don’t have enough capacity (druid.worker.capacity on MiddleManager) to run tasks for all of these datasources at the same time.

Firstly, I’m sorry for the late response.

Can you elaborate on what you are doing with kafka ingest specs?

I have attached a sample ingest spec.

I, then, post this to the Druid server.

I hope this answers your question.

If not, I’m not entirely sure what the question is.

How many streaming kafka topics do you have?

At any given point of time, I’m only streaming on 1 topic.

I’ve not yet tried multithreading this process.

I thought I’ll first get it to work on a single thread and then proceed onto a multithreaded workload.

Are the kafka events represented as CSV rows or entire files?

Right, so, what I’m doing here is I’m reading all the data into python and then iterating over the rows.

I know this is not ideal but, it is very similar to what the actual workload would be like.

In fact, that’s why I chose kafka because of its ability to handle high volumes of data.

It would be tremendously helpful if you could attach one of the generated ingest specs that you have.

PFA the same.

I have masked some data (like ip address).

Thanks and regards,

Paritosh J Chandran

sample_ingest.json (1.19 KB)

Honestly, this sounds like it’s the problem.
Do you have any suggestions of how to get this number right?

Is there a thumb rule you follow for setting druid.worker.capacity, or could you at least point me in the right direction?

Thanks and regards,

Paritosh Chandran