Native Batch Ingestion vs. Hadoop Batch Ingestion

Hello everyone, I’m new to Druid and I would like to have a general question regarding version 0.22.1 and later versions of Druid.

Usually, I have to ingest about 30-50 gb of Data into Druid using batch ingestion. We are running with vesion 0.20.2 and mainly use Hadoop ingestion. Recently, we upgraded to new Dataproc cluster image on Google Cloud (image version 2.0-debian10) and the ingestion speed is fast. However, when I try to use Native Batch ingestion, the performance is only on par with old Dataproc image version (version 1.5 and below). I tried to change the maxConcurrentSubTask to equal to the number of druid.worker.capacity defined in the settings, but the speed is only ok.

For detailed comparisons:

  1. With Dataproc cluster 2.0-debian10 and Hadoop ingestion, the ingestion of 30gb of parquet files took about 25 -30 minutes.
  2. With old Dataproc cluster 1.5-debian10 and Hadoop ingestion, the same ingestion took about 1h 15 minutes.
  3. Using native batch ingestion, the ingestion took about 1h 10 minutes.

The druid runs on a small machine with 10 cores and 48gb of ram. and the Dataproc cluster has 1 master with 10 worker nodes.

Would you guys suggest me to use hadoop ingestion or native one? And for native ingestion, are there properties I can change to improve the performance ?

I don’t yet know enough about Hadoop based ingestion to provide better advice. But one thing that caught my attention is that Native Ingestion would not have access to the 10 worker nodes (with how many cores and memory each?). So that could explain the faster ingestion on Hadoop, there’s just a lot more resources available. Perhaps a larger Druid cluster would be better to compare.

So I double checked again the resources for the cluster, each worker node has 4 cores and 16gb of ram. That makes the entire cluster has 40cores and 160gb of ram which are three times the amount of cores and ram on my druid cluster machine. The amount of time to finish in the new cluster is also half of what in the druid cluster.

I would try to make a larger druid cluster and see if I can make a fair comparison.