Is there a document about load data using hadoop?

Hi ,

I can see there is a doc about data load using apache hadoop,https://druid.apache.org/docs/latest/tutorials/tutorial-batch-hadoop.html, but it is mainly about docker.

my environment is a native hadoop, how to load 10TB data to druid by the hadoop? Or it is good enough to load from local files?

Thanks

Josh

Maybe look at these two

https://druid.apache.org/docs/latest/ingestion/hadoop.html

https://druid.apache.org/docs/latest/operations/other-hadoop.html

Thanks!

Manu

Hey,
To add to Manu’s answer - we’re ingesting dozens of TBs per day to our Druid clusters, and the Hadoop-based ingestion works well.

2 side notes:

  1. In some cases (depending on how you pre-process the data and whether or not you’re running on a public cloud), it also allows you to get some nice cost savings (for example, see slides 27-28 here).
  2. There’s currently an effort to create Spark-based ingestion to Druid, but it’s still in its early stages.

Itai

Hi Itai,

Thanks for your reply.

Is it ok to use the native hadoop-based ingestion to load 10T data? will the load be completed in hours or days? or There is some ways to speed up the hadoop-based ingestion, like more parallel task to do the hadoop-based ingestion? few parameters are introduced in the doc.

the hadoop-based ingestion will be run in a native hadoop environment, not on a cloud.

Josh

在 2020年5月18日星期一 UTC+8下午9:13:56,itai yaffe写道:

Hi Manu

Thank you, I just missed the right doc.

Josh

在 2020年5月18日星期一 UTC+8下午8:30:21,Manu T M写道:

Hey Josh,
That depends on the size of your Hadoop cluster.

we run multiple hadoop-based ingestion tasks (which are basically MapReduce jobs) in parallel (on separate Hadoop clusters, but you can run them on the same Hadoop cluster), one per Druid datasource usually, and the execution time is around a few hours (again - depending on the cluster size, the amount of ingested data, the ingestion spec, etc.).

You can also explore tuning the Hadoop MapReduce job (the one that ingests the data), e.g memory allocated to mappers/reducers.

Hope that helps :slight_smile:

Itai

Hi Itai,

Actually, I would run a ssb test on druid. The data scale is about 10 TB, but the lineorder table will have about 6 TB data. I think it is better to cut the lineorder table data into pieces, such as 10 GB per file, so there will be 600 files to load.

According to what you mentioned above, the parallel task can be against different datasource, my situation is one datasource but many individual data file, can I run 600 MR job parallel to load the data? Can druid load data to one datasource from many MR job? I don’t know it will speed up the load, or just suspend the load.

My hadoop cluster has 5 nodes, with 40 cpu and 128 GB memory.
Thanks a lot.

Josh

在 2020年5月19日星期二 UTC+8下午9:15:19,itai yaffe写道:

Hey Josh,
I hope I’m not missing anything here, but when you run a Hadoop-based ingestion task, you provide the various specs (as shown at https://druid.apache.org/docs/latest/ingestion/hadoop.html#task-syntax).

Part of it, is the static inputSpec (see https://druid.apache.org/docs/latest/ingestion/hadoop.html#static).

This has 2 parameters:

  1. “paths” - which can either be a single file, a folder, or a list of files (e.g “paths” : “hdfs://path/to/data/is/here/data.gz,hdfs://path/to/data/is/here/moredata.gz,hdfs://path/to/data/is/here/evenmoredata.gz”).
  2. “inputFormat” - the format of those input files (e.g Parquet).

So essentially you can have a single ingestion task, that runs a single MapReduce job, that ingests multiple files (in parallel) to the same datasource.

Considering that (and again - unless I’m missing anything), you don’t need to run 600 MR jobs to ingest those 600 files.

I hope it’s clearer now, but let me know if you have follow-up questions.

Thanks,

Itai