Tracking missing events after Hadoop indexing

Hi Team,

We are trying to ingest around a billion events through Hadoop indexing. Our index task was successful but we see a meagre 50K events loaded into druid

MR job ran for around 2 hrs and we have ignoreInvalidRows = true

How can we go about debugging the missing events and also is it possible for us to keep track of the ignored events

Thanks,

Sathish

Also, on the overlord console for the specific datasources its showing

Datasource 1 -> 97% to load until available

Datasource 2 -> 99% to load until available

Thanks,

Sathish

It looks like the historical nodes are still loading the segments which might be the reason for you to see less number of rows.
For debugging invalid rows, you can either set ignoreInvalidRows=true which will throw exceptions if any row is invalid/unparseable or enable debug logs for druid which will contain debug logs related to invalid rows.

Hi Nishant,

Thanks for the prompt reply.

Our Index task completed 14 hrs ago and still for the datasources we have 87% to load until available how can we go about speeding them ?

Also on our coordinator console, we don’t see remote workers getting listed but they are running on their individual boxes, same way few historical nodes goes out of overlord console and later they are visible

Thanks,

Sathish