We are in the process of setting up a druid cluster for one PoC.As part of that I set up 5 nodes(3 historical nodes,1 for zk-coordinator-overlord and 1 for broker).
We were able to execute some ingest data from flat files and query from it.For setting up the cluster we followed the steps given in http://druid.io/docs/latest/tutorials/cluster.html.
now comes the problem area.
are unable to distribute the data across multiple historical nodes.In each of the historical nodes we have the same configuration and these nodes are shown up in the console as worker nodes.After some debugging I
figured out that we cannot use ‘local’ as deep storage if we want to distribute data set.Is that the correct understanding?If yes,is there any way we can bypass the s3 access key and secret key.We have set up the ec2 instances with roles and it should
ideally enable access to s3 without having to specify that.However it doesn’t seem to be the case .Does co-ordinator distribute the tasks randomly?I went through this document however I it was not clear for me how I can evenly distribute the data.
2,We are trying to have a fail-over set up by having multiple historical nodes.The objective is ,if one of the historical nodes goes down,we should still be able to query the data set without issues.Does it work only with S3/HDFS deep storage option?
3,We are ingesting data via the simple ingestion task(http://druid.io/docs/latest/ingestion/batch-ingestion.html).
loading around 80m records,it takes around 27 minutes,with some sort of
parallelization in place ,i.e by submitting tasks for different intervals and executing them parallely.
This seems to be very slow.Looks like the configuration which I use needs to change.
have set up m3.xlarge instance for co-ordinator and r3.2xlarge instances(3) for historical and I have used the default configurations.
It would be great if you can throw some insights.