Druid cluster - Local Deep Storage mounted as a same shared folder on realtime and historical nodes

Hi!

Is it possible to run a clustered Druid with Deep Storage (type: Local Mount) mounted as the same shared folder on realtime and historical nodes?

common.runtime.properties:

Deep storage

druid.storage.type=local
druid.storage.storageDirectory=/var/log/druid/storage

``

Machine folders are described here.

MACHINE-STORAGE: /folders/druid-deep-storage

DRUID-REALTIME: /var/log/druid/storage --> remote mounted to point to MACHINE-STORAGE:/folders/druid-deep-storage

DRUID-HISTORICAL: /var/log/druid/storage --> remote mounted to point to MACHINE-STORAGE:/folders/druid-deep-storage

I don’t wan’t to start with HDFS or S3 for our first production use, I wan’t to keep it as simple as possible, knowing that if our MACHINE-STORAGE goes down I’ll have problems, but that’t a trade-off…

Hi! Is there anybody out there? :wink:

Davor, the committers volunteer their time to help out the community and try to find time where possible to answer questions. If you need dedicated help, please visit http://imply.io/

Help me help yourself, so I get this running, and then, you’ll probably have another client for enterprise support :slight_smile:

Hi Davor, please see inline.

Hi!

Is it possible to run a clustered Druid with Deep Storage (type: Local Mount) mounted as the same shared folder on realtime and historical nodes?

I’m not 100% sure of the question being asked, but in general, Druid nodes are really just processes and many different types of nodes can be colocated together. For example, overlord and coordinator nodes should be colocated. Realtime and historical nodes can be colocated for smaller workloads, and have dedicated hardware for larger workloads.

Deep storage is a permanent backup of data. It is not involved in the query path and for most production clusters, deep storage is HDFS or S3.

common.runtime.properties:

Deep storage

druid.storage.type=local
druid.storage.storageDirectory=/var/log/druid/storage

``

Machine folders are described here.

MACHINE-STORAGE: /folders/druid-deep-storage

DRUID-REALTIME: /var/log/druid/storage --> remote mounted to point to MACHINE-STORAGE:/folders/druid-deep-storage

DRUID-HISTORICAL: /var/log/druid/storage --> remote mounted to point to MACHINE-STORAGE:/folders/druid-deep-storage

I don’t wan’t to start with HDFS or S3 for our first production use, I wan’t to keep it as simple as possible, knowing that if our MACHINE-STORAGE goes down I’ll have problems, but that’t a trade-off…

You can use NFS for production use, as others have done before and ‘mimic’ using the local filesystem.

Thank you very much!

  1. What about segment storage? Can it be on NFS?

  2. Is segment storage involved in query path? As far as I know, it shold be.

  3. If I put segment storage on local filesystem, when one realtime or historical node goes down, is this segment replicated to other machine with realtime / historical? AFAIK, it isn’t. I guess that’s the point to have a distributed filesystem like HDFS.

Hi Davor,

There’s two concepts here, deep storage and segment cache. As FJ mentioned, deep storage should be a distributed FS such as S3 or HDFS or if you’d like NFS. Historical nodes will load segments from deep storage into a segment cache which should be local and will use memory mapped I/O to load the segments for scanning. So I’m not sure what you mean by segment storage, but hopefully the above explanation answers your questions. NFS can be used for deep storage, and segment caches should be stored locally on the fastest available drives (SSDs are cool).

Sorry, I meant segment cache! You have picked what I meant and answered it, thank you very much David!