What can i do to improve performance ingest data

Hi everyone,

I hava a job need ingest data to Druid.
Here are the details

  • The input data to be indexed is 20 GB/minute, with 30 columns in it.
  • “segmentGranularity”:“HOUR”,
  • “queryGranularity”:“HOUR”,

I have a question:

  • Is it possible to ingest 20Gb / min with a system of 10 notes, each note has 32 cores and Ram 125 Gb ?

  • If impossible, How many notes (each note has 32 cores and Ram 125 Gb ) do we need to do this job ?

  • If possible, what can i do to improve performance ingest data ?
    I tried to ingest 1.3T data in a system with memory:1136640, cores:160, time to do the job is 2.5 hour. what can i do to improve performance ingest data ?
    I increated block size to 512 MB, the number of maps is 3314
    This is my tuningConfig:

“mapred.compress.map.output”: “true”,
“mapred.output.compression.type”:“BLOCK”,
“mapred.map.output.compression.codec”:“org.apache.hadoop.io.compress.SnappyCodec”,
“mapred.output.compression.codec”:“org.apache.hadoop.io.compress.SnappyCodec”,
“mapreduce.task.io.sort.mb”:“2047”,
“mapreduce.task.io.sort.factor”:“100”,
“mapreduce.map.memory.mb”:“8146”,
“mapreduce.reduce.memory.mb”:“16292”,
“mapreduce.map.java.opts”:"-Xmx6516m",
“mapreduce.reduce.java.opts”:"-Xmx13033m""

“mapreduce.map.sort.spill.percent”:“0.90”

Have you followed these instructions? Is your task real time or batch? Can you send over your ingestion spec?

https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html

You should be able to ingest about 10k records / second / core, roughly.

We are in the process of choosing solutions and estimating investments.
It would be great if can ingest data real time. Now we are using batch data processing
I followed the instructions
But we use Hadoop to ingest data into Druid, I find that the performance of the Hadoop index is key, the configuration in druid does not greatly affect the performance of the Druid indexer.
i followed the blog: https://imply.io/post/hadoop-indexing-apache-druid-configuration-best-practices
I increare block size to 512 Mb, change some config in tuning config.
I attached my ingestion file
What can i do to improve performance ?
Vào 02:34:30 UTC+7 Thứ Năm, ngày 16 tháng 7 năm 2020, Rachel Pedreschi đã viết:

nio_index.json (7.2 KB)

Is there any reason you have to use hadoop indexing? I would recommend druid native indexing. Hadoop indexing uses a map reduce job. With the recent releases of druid the parquet and orc formats are also supported with druid native indexing.

vijay

vijay

one thing that could help you is mapred.min.split.size. This controls the number of mappers and can speed up performance. It depends on where the bottleneck is in your hadoop cluster.

vijay

When i tried to ingest 275 GB data, i increase block size to 512 Mb, and i coalesce to 650 partition. the number of mappers is 653 and number of reduces is 92. The number of mappers and reduces is that good enough ?
Vào 11:43:04 UTC+7 Thứ Năm, ngày 16 tháng 7 năm 2020, vijay narayanan đã viết:

My druid version is 0.9.2 and we can’t use lastest version of druid. Can you tell me why using druid native indexing is needed? Is there any different in performance between use hadoop indexing and use druid native indexing ?

Vào 11:32:31 UTC+7 Thứ Năm, ngày 16 tháng 7 năm 2020, vijay narayanan đã viết: