Questions when ingesting data

Hi Team,

Druid is a really powerful OLAP tool. We are trying to use druid to handle big data in daily working.

Currently, we want to test with the data set over 30 billion which stored in about 100 gz files (about 650G).

We setup a druid cluster with 10 nodes, including 1 master node(coordinator/overlord), 8 data nodes(middlemanager/historical) and 1 query node(broker).

The total disk is about 7T. All VMs have 32G memory and 8 cores.

I just loaded all the files in one request, the field “paths” in the indexing json is “s3://xxxxx/xxxx/xxx/**” and I saw there is only 1 data node picked up the task to handle the data on web console page.

My questions are,

  1. I read the documentation mentioned the task is assigned by coordinator and one of the data nodes will pick up the task. If I want to ingest multiple files to a new datasource, is there any way to do this parallel? This means each node handles one file at the same time?
  2. What about current internal logical for ingesting data with a directory, one data node picks up the task and handle the files under the folder one by one?
  3. As mentioned in “Batch File Ingest”, druid will handle the ingesting data with hadoop cluster. What is the size of hadoop cluster by default? According to the historical nodes number? I’m not very familiar with hadoop.
  4. For indexing a large datasource, an external hadoop cluster is better for performance?
  5. I have imported 150 million from s3. It spent 3 hours with 1 data node(16G memory for historical and 3G for middlemanage). The heap size of druid.indexer.runner is 6G. Is this too small for indexing large data set?

Could you please help to explain these questions in detail?

Really appreciate for your great help.

Best Regards,

Daniel