Druid cluster set up-questions

Hi,

We are in the process of setting up a druid cluster for one PoC.As part of that I set up 5 nodes(3 historical nodes,1 for zk-coordinator-overlord and 1 for broker).
We were able to execute some ingest data from flat files and query from it.For setting up the cluster we followed the steps given in http://druid.io/docs/latest/tutorials/cluster.html.

now comes the problem area.

1,We
are unable to distribute the data across multiple historical nodes.In each of the historical nodes we have the same configuration and these nodes are shown up in the console as worker nodes.After some debugging I
figured out that we cannot use ‘local’ as deep storage if we want to distribute data set.Is that the correct understanding?If yes,is there any way we can bypass the s3 access key and secret key.We have set up the ec2 instances with roles and it should
ideally enable access to s3 without having to specify that.However it doesn’t seem to be the case .Does co-ordinator distribute the tasks randomly?I went through this document however I it was not clear for me how I can evenly distribute the data.

2,We are trying to have a fail-over set up by having multiple historical nodes.The objective is ,if one of the historical nodes goes down,we should still be able to query the data set without issues.Does it work only with S3/HDFS deep storage option?

3,We are ingesting data via the simple ingestion task(http://druid.io/docs/latest/ingestion/batch-ingestion.html).
For
loading around 80m records,it takes around 27 minutes,with some sort of
parallelization in place ,i.e by submitting tasks for different intervals and executing them parallely.
This seems to be very slow.Looks like the configuration which I use needs to change.

fyi…we
have set up m3.xlarge instance for co-ordinator and r3.2xlarge instances(3) for historical and I have used the default configurations.

It would be great if you can throw some insights.

Sunil

Hi
please see inline.

Hi,

We are in the process of setting up a druid cluster for one PoC.As part of that I set up 5 nodes(3 historical nodes,1 for zk-coordinator-overlord and 1 for broker).
We were able to execute some ingest data from flat files and query from it.For setting up the cluster we followed the steps given in http://druid.io/docs/latest/tutorials/cluster.html.

now comes the problem area.

1,We
are unable to distribute the data across multiple historical nodes.In each of the historical nodes we have the same configuration and these nodes are shown up in the console as worker nodes.After some debugging I
figured out that we cannot use ‘local’ as deep storage if we want to distribute data set.Is that the correct understanding?

Yes there is no way you can share the data without a distributed file system as a deep storage eg HDFS or S3

If yes,is there any way we can bypass the s3 access key and secret key.We have set up the ec2 instances with roles and it should
ideally enable access to s3 without having to specify that.However it doesn’t seem to be the case

i am not 100% sure but most likely you need to have that in place thought

.Does co-ordinator distribute the tasks randomly?

I think you mean segments not tasks right ? if you are asking about the segments then Yes the coordinator will load balance and there is couple of optimization in place.

I went through this document however I it was not clear for me how I can evenly distribute the data.

the coordinator will spread the segments across the histroricals within the same tier

2,We are trying to have a fail-over set up by having multiple historical nodes.The objective is ,if one of the historical nodes goes down,we should still be able to query the data set without issues.Does it work only with S3/HDFS deep storage option?

Yes it does work by setting a replication factor of 2 or more (the replication factor is a druid config not at all related to HDFS or S3). Keep in mind druid is serving the data from nodes local memory so it is totally independent from the deep storage. The deep storage is used to only load the segments (Data) once.

3,We are ingesting data via the simple ingestion task(http://druid.io/docs/latest/ingestion/batch-ingestion.html).
For
loading around 80m records,it takes around 27 minutes,with some sort of
parallelization in place ,i.e by submitting tasks for different intervals and executing them parallely.
This seems to be very slow.Looks like the configuration which I use needs to change.

Are you doing it locally or you are submitting the task to a Hadoop Cluster or EMR ? the way to scale this is to use a proper Hadoop cluster. looking a the numbers looks like you are doing the work on only one machine.

fyi…we
have set up m3.xlarge instance for co-ordinator and r3.2xlarge instances(3) for historical and I have used the default configurations.

It would be great if you can throw some insights.

Sunil

Slim !

This is duplicated by https://groups.google.com/forum/#!topic/druid-development/YALPDrTHO7M. Please keep discussions within one thread.

Thanks Slim and Fangjin.Sorry for posting this in multiple forums.I thought I had initially posted to the wrong forum and hence ended up posting here as well.
I’ll try these things out and let you know.

Hi Slim & Fangjin,

I came across this https://github.com/druid-io/druid/pull/837 and reckon it is possible to bypass s3 access key and secret key now.
Please correct me if I am wrong.
The reason why we would like to do that is because we have been using ec2-instances to which policies have been attached(read from /write to s3 buckets) .

Regards
Sunil

Hi All,

After some days of struggle we were able to set up deep storage S3 without providing the access keys and secret keys.Looks like there was some problem with encryption.It started working after adding jets3t.properties in the druid class path.

https://groups.google.com/forum/#!topic/druid-user/Olr3MB6BnsQ

However we encountered another hurdle after this.We were not able to execute hadoop-index tasks as it was expecting username and password to be supplied.We used normal index tasks for ingestion and performance was not good.I don’t have the error message handy but just wanted to check if someone has come across similar issue.

Regards
Sunil

Adding a little bit more info below

If yes,is there any way we can bypass the s3 access key and secret key.We have set up the ec2 instances with roles and it should
ideally enable access to s3 without having to specify that.However it doesn’t seem to be the case

i am not 100% sure but most likely you need to have that in place thought

The aws credentials in druid are resolved in the following order:

  return new AWSCredentialsProviderChain(
      new ConfigDrivenAwsCredentialsConfigProvider(config),
      new LazyFileSessionCredentialsProvider(config),
      new EnvironmentVariableCredentialsProvider(),
      new SystemPropertiesCredentialsProvider(),
      new ProfileCredentialsProvider(),
      new InstanceProfileCredentialsProvider());
}
  1. Configuration
  2. A specified file
  3. Environment Variables
  4. System Properties
  5. Profile
  6. Instance Role

Not storing AWS keys and only having role based access does indeed work.

.Does co-ordinator distribute the tasks randomly?

I think you mean segments not tasks right ? if you are asking about the segments then Yes the coordinator will load balance and there is couple of optimization in place.

I went through this document however I it was not clear for me how I can evenly distribute the data.

the coordinator will spread the segments across the histroricals within the same tier

see https://github.com/druid-io/druid/pull/2972 for more info