Experience with AWS druid cluster

My question has two parts.

  1. Has anyone has setup Druid on AWS, and instructions I can follow to do the same?

I am looking to setup a medium size cluster to play around with a small set of data.

  1. For ingesting data into AWS Druid cluster, can the data be pushed from a local machine into the remote AWS? Is the data transfer seamless?

I found this document:

http://druid.io/blog/2013/04/03/15-minutes-to-live-druid.html

Any other document/information available?

Thanks.

Also, this link is no longer working

https://github.com/metamx/druid/wiki/Druid-Personal-Demo-Cluster

as it takes me to https://github.com/metamx/druid.

Hi, see inline.

My question has two parts.

  1. Has anyone has setup Druid on AWS, and instructions I can follow to do the same?

I am looking to setup a medium size cluster to play around with a small set of data.

Many production clusters I know of run in AWS. Are you past the point of setting it up locally on your own computer? If not, I suggest trying that first. If you are, can you give some details about the cluster you are trying to set up?

  1. For ingesting data into AWS Druid cluster, can the data be pushed from a local machine into the remote AWS? Is the data transfer seamless?

No. The AWS cluster must be able to access the data somehow. One way is to put your raw data in S3 and ingest from S3.

Hi Fangjin,

Yes, locally I am able to push data into druid, and query, etc without any issues. I want to increase the scale, and also play with a cluster set-up, that’s why I am wondering about AWS set-up. I was following this (https://github.com/druid-io/druid-io.github.io/blob/master/docs/0.6.73/Druid-Personal-Demo-Cluster.md), but it fails in stack creation step. Lets say that i want to push about <= 25GB data, and play with it…any steps I can follow to setup AWS druid cluster with S3 as deep storage…

I am still getting familiar with Druid, but if there are some instructions I can follow to do this set-up, please let me know!

Thanks!

Hi Raj, I would recommend that you load your raw data in S3 and use the hadoop indexing task to create segments. For that volume of data, you only need a single historical node with >25GB of RAM (most AWS nodes meet this requirement). You can colocate ZK, coordinator, and metadata store on another node and experiment with a 2 node cluster.