Data Load time

Hello Druid User,

we recently loaded about 215 MB of data ( data is placed on the same machine) thru batch ingestion process into druid. Currently i am running this on a single node machine which has 252 GB RAM with 2 CPU’s and 32 Cores. It took about 30 minutes to load and index data which i believe is not an ideal performance.

Can any one please help me understand/estimate load times and what factors affect the load time and also how to extract best performance from the machine to do this.

Let me know if you need any other details.

Regards

Karteek

Hi Karteek, have you had a chance to try 0.8.0-rc2? It fixes some bugs with index tasks taking much longer than they should. FWIW though, for larger data sets (> 1G), we strongly recommend using hadoop based batch ingestion for data as the index task is extremely inefficient.

Thank you Fangjin.

I used index_hadoop and was bale to load data in ~ 2mins.

On the same note i would now try HadoopDruidIndexer to load data. can you please let me know how is this is different from hadoop_inxex? will the data be segmented the same way as inxex_hadoop or do we have more flexibility since we can levarage '“partitionsSpec”?

Karteek

Fangjin,

On the same note i have another question. currently we are doing a POC with 200 MB Data. But our eventual goal is to load and index 1 TB to 1.5 TB of data. We plan to procure the best hardware available out there. are you aware of of any implementation where they have indexed that much of data or more. Do you think its feasible to do that in druid?

Karteek

Hi karteek the largest cluster I’m aware of has indexed over 20 PB of raw data.

You can take a look at the overview page on the druid website for the scale druid has been deployed at.

Also note that druid scales with more hardware. So 1TB of indexed data may take a while to churn through on a system with only 4~8 cores and limited memory resources.

Every dataset is different, and it’s hard to pick a core:memory:disk ratio that works for arbitrary datasets.

It would be worth exploring this as part of your POC to determine how many boxes you’ll want and what their configuration should be for the types of data and queries you will be experiencing.

Hi,

I have the same problem as described in your first post. Could you tell me how did you hasten loading batch data ?

Thanks in advance,

Tom

W dniu wtorek, 14 lipca 2015 02:38:22 UTC+2 użytkownik karteek chada napisał:

Hi Tom, are you using the index task or the hadoop index task? If it is the latter, are you running it on the distributed cluster?

I follow http://druid.io/docs/latest/tutorials/tutorial-loading-batch-data.html so for me it is index task.
I run it on standalone machine (localhost).

W dniu wtorek, 14 lipca 2015 02:38:22 UTC+2 użytkownik karteek chada napisał:

The index task is incredibly slow and is designed for small POCs. For faster loading of batch data, you can use the hadoop based index task, or wait for https://github.com/druid-io/druid/pull/1907 and the work we’ve been doing on enabling all data to be streamed into Druid to be completed in the next few weeks.