Has anyone noticed that Druid silently drop rows during ingestion. I have a tsv file which I noticed that has parsing issues after indexing when running concurrently with other tasks, but if I run it individually, it can be ingested correctly.
Further research indicates that parse issue is because Druid has dropping rows probably due to memory constraint.
Do you mean that you’ve diagnosed the row drops to a few failed tasks? There shouldn’t be any dropped rows in that situation, it should fail. Driud ingestion can drop rows when parsing fails as controlled by:
reportParseExceptions which is
false by default, meaning the job won’t fail on parsing errors, it will skip the unparseable rows.
- You can also control whether parse exceptions are logged with
logParseExceptions and what the
maxParseExceptions are before the process fails.
- If Out of Memory is resulting in a parse exception, this is probably a bug. Could you provide more details on that?
- the fact that the problem occurs when running concurrently with other tasks might indicate a capacity problem as well (i.e. the tasks that are running are consuming more resources that are available on the server).
You should also consider Apache Druid 24.0 and using SQL Based Batch Ingestion, as it uses a new engine which is more efficient in batch loading plus it is easier to use because it is all SQL.
The parse exception is really due to Druid drop the rows partially, so the rest of rows are misaligned. I checked the file and found the exact place where the partial row got dropped.
I reduced the number of workers from 8 to 4 to stabilize the process.
Yes, the druid didn’t report failures, just dropped the rows due to some memory issues.
Yes, I will look into SQL based ingestion.
Can you share the ingestion spec and the Overlord, MM, and task logs for one of these situations. It would be really useful to determine when this happens. An out of memory situation should not cause silent drop of data, it should either recover or fail completely.
Also, there should be some guardrails in the ingestion spec so that it doesn’t attempt to use too much memory (e.g.
maxBytesInMemory) and the cluster configuration such that there are enough resources for the number of workers being configured. In other words make sure that the SUM of the MM and its Peons JVM memory settings does not exceed the server’s capacity.
I will try to reproduce, right now, I am focusing on stablizing and push to production , then I will try to optimize more and reproduce and share here.