loading data into historical from s3 is very slow

only have 40gb of data and even loading that takes a full day.
is there any particular reason for this?

any way to speed it up?

it seems the historical literally takes each segment, downloads, unpacks, indexes it etc and then moves on to another.

have you thought about doing all this in parallel so it would download and unpack a whole bunch of segments at once.

Hi Prashant,

Do you have 40gb of segments? How many historicals do you have?

https://github.com/druid-io/druid/pull/1258 should help with the behavior you are seeing.

2 historicals.

40 gb segments each.

Replication factor 2.

From the logs, can you tell how long downloading each segment takes? Loads taking a full day for this volume of data should not occur.

so i added 2 more historicals (totalling to 4 now) and looking at boundary stats, here is what i see:

stats:

3:34pm http://screencast.com/t/76d4kX4N4YF

3:36pm http://screencast.com/t/76d4kX4N4YF

Basically it seems every minute boundary 1 segment is loaded into 1 of the 2 historicals.

the above shows 1 node, so the network in graph shows 1 segment loaded at 3:34 and another at 3:36

I dont have the capture for 3:35 but I definitely saw the 2nd node load a segment at that time.

druid version 0.6.160

Both those images are broken for me.

not sure why the image is not loading for you.

Here is another screen from splunk this time:

http://screencast.com/t/obqaqzKM

as you can see in the course of an hour only 81 segments are loaded.

you can see from the timeline graph, that some minutes almost no segment is loaded, at the max there is only 2 events loaded per minute.

what i find also interesting is the behavior after one of the full nodes is taken down.

to explain futher:

I have two full nodes 1 & 2.

I added 2 more nodes 3 & 4.

The segment loading was going really, really slow for 3 and 4.

They only loaded 12 gb in about 6 hours.

Then I took down node 1.

The network traffice of 3 and 4 spiked up and they started loading segments quicker!

Here is a screenshot showing the network traffic of 3:

http://screencast.com/t/lPUZaFmZhj

Hi Prashant,
have you made any changes to your coordinator dynamic configs ?

what are maxSegmentsToMove & replicationThrottleLimit set to ?

increasing those will also speed up movement of segments across nodes when new nodes are added.

Just to add on to what Nishant is saying.

Druid will download data without throttling if any data source is incompletely loaded (i.e. no copies of a segment that should be in the cluster is actually loaded in the cluster). If a datasource is completely loaded, and Druid is replicating segments, there is a throttle in place as to how fast this can occur.

For your case, do you see slowness after your data completed or is it when it is in an incomplete state?

have you made any changes to your coordinator dynamic configs ?

No. didnt know about this till now in fact.

what are maxSegmentsToMove & replicationThrottleLimit set to ?

looking at the config:

maxSegmentsToMove = 5

replicationThrottleLimit= 10

Here is the full config:

For your case, do you see slowness after your data completed or is it when it is in an incomplete state?

the slowdown was noticed when i added nodes to existing full cluster, so i guess it was in the ‘completed’ state when the slowdown occured.

That said, even when it was in ‘incomplete’ state, the data was loaded far, far slower than what the machiens were capable of loading.

It tooks hours to load that data, when the node was capable of doing it in minutes.

The documentation for coordinator dynamic config is confusing.

I see replicationThrottleLimit, but I there is no replicationLimit, so I dont know what the value is without throttling.

Also what params should i tweak to allow faster loading

Hi Prashant, see inline.

have you made any changes to your coordinator dynamic configs ?

No. didnt know about this till now in fact.

what are maxSegmentsToMove & replicationThrottleLimit set to ?

looking at the config:

maxSegmentsToMove = 5

This config is mainly used to speed up rebalancing. You can set it higher to enable your cluster to balance faster.

replicationThrottleLimit= 10

This config concerns how fast segments can be downloaded. It throttles how many segments Druid will try to replicate at once. Druid will only try to replicate further once all 10 segments have been downloaded.

In production we set maxSegmentstoMove at 200 and replicationThrottleLimit at 300. Our cluster consists of nearly 1M segments though and any time you set these values higher, there will be some churn and potential impacts to query performance. For us in production though, the query performance impact is near negligible.