Indexing Service Questions

I want to set up a cluster providing indexing service. After reading the Druid doc and questions/answers in the Druid User group, I still have the following questions:

  1. If middlemanager and overlord are in separate nodes, do we have to use HDFS or S3? My HDFS is not ready yet so I want to see if a shared NFS directory can be used instead.
  2. Middlemanager can spawn peons to ingest different datasources. To post event to peon, should we always use the url pattern as http://peonHost:port/druid/worker/v1/chat//push-events/, where peon’s host/port is the same as middlemanager’s host/port?
  3. Can we post json array to the url in case even producer buffers events somehow.

Thanks.

  1. Are you talking about deep storage? I think people have got NFS working as deep storage by telling Druid to treat it as “local”.

  2. Your URL path is right but the peon port is actually different from the middle manager’s. Each peon has its own http server. It’ll announce its host and port under its serviceName in ZooKeeper service discovery.

  3. Yes, and this is in fact recommended. You’ll get better throughput if you batch a bit on the producer side.

If you’re interested, the code in tranquility that sets up Druid tasks is here: https://github.com/metamx/tranquility/blob/master/src/main/scala/com/metamx/tranquility/druid/DruidBeamMaker.scala; and the code that actually sends data to them is here: https://github.com/metamx/tranquility/blob/master/src/main/scala/com/metamx/tranquility/druid/DruidBeam.scala. Those might be useful if you’re looking at implementing your own thing. Or, you could use tranquility.

Thanks for the clarification. Yes, for 1 I mean deepstorage. For 2, is there any web service to obtain the peon url for a given serviceName? For 3, should the payload be a json array [{…},{…}] or just {…},{…} or the peon is smart to adapt it?

For 2, there is not one as part of Druid, but you could use this one: http://curator.apache.org/curator-x-discovery-server/

For 3, it should be a json array.

Great!

One more question, can we use NFS directory for local task logging, instead of S3 or hdfs, for remote mode?

Thanks.

Hi Gyrokinetc, I’m inclined to say this should be fine as nothing should prevent this kind of setup. Although S3 or HDFS are popular options, I know other folks out there that do use NFS.