Question about (undocumented) features of Kafka Indexing Service

I feel the Kafka Indexing Service has a lot left to be documented. There are a lot of options there that are not clearly defined.

Regarding KafkaSupervisorTuningConfig:

  1. What isIntermediate Persist? The fields intermediatePersistPeriod and maxPendingPersists refer to it, but there is no documentation for what it is?

Does this require disk space? Which property configures the location?

  1. is workerThreads property used on overlord, middle manager or peon? The description says “The number of threads that will be used by the supervisor”. Does ‘supervisor’ refer to overlord here? Does this mean i can control the number of threads on Overlord from this config?

  2. chatThreads. similar to above. which node, overlord, mm, or peon does this affect?

  3. segmentWriteOutMediumFactory : Does specifying ‘tmpFile’ have advantage in that the data can be reused by peon in case it is shut down unexpectedly?

There needs to be some idea of the size or contents of it. The documentation just says “Druid temporarily stores some pre-processed data in some buffers”. What does ‘some’ mean here?

If my segment is 100MB in size, is this ‘pre-processed data’ going to be in kilobytes, single digit megabytes, 10 megabytes?

Atleast some info should be given as to what the data is or if using disk has any extra advantage. Without any of this, its like choosing a random number right now.

MM nodes functionality:

  1. Lets say all my MM nodes are shut down and recovered unexpectedly as part of maintenance by cloud provider.

When they come back up do they need to restart ingesting the segment from the very begining?

For example:

Lets say i make 24 hour segments. At time 23:15 all my MM nodes reboot. When they come back up, do they have to start scanning from 00:00?

This is important to know since this means there will be a period of time where the realtime data is not available. And this gap can grow if the segment size is too large. For example, what if i had a monthly segment.

  1. Can MM nodes themselves have really low heap sizes? It seems their job is simply to spawn peons. Can they have a -Xmx of just 512MB?

bump

  1. What isIntermediate Persist? The fields intermediatePersistPeriod and maxPendingPersists refer to it, but there is no documentation for what it is?
    Does this require disk space? Which property configures the location?

Yes, this is “java.io.tmpdir” by default, the “persists” refer to partial segment data that’s persisted to disk during ingestion (to free up memory for more ingestion). It’s configurable via “basePersistDirectory” in the tuningConfig.

  1. is workerThreads property used on overlord, middle manager or peon? The description says “The number of threads that will be used by the supervisor”. Does ‘supervisor’ refer to overlord here? Does this mean i can control the number of threads on Overlord from this config?

The “supervisor” in the context of Kafka indexing is a thread on the overlord that manages Kafka indexing tasks. workerThreads controls the number of threads used by this supervisor thread for managing tasks.

  1. chatThreads. similar to above. which node, overlord, mm, or peon does this affect?

chatThreads is similar to workerThreads above, it affects the supervisor thread on the overlord

  1. segmentWriteOutMediumFactory : Does specifying ‘tmpFile’ have advantage in that the data can be reused by peon in case it is shut down unexpectedly?

AFAIK, there’s no such functionality. For resiliency, there is the replicas property, for spawning replicated tasks.

When they come back up do they need to restart ingesting the segment from the very begining?

As I understand, yes, the tasks started after the nodes come back up will start reading from the last persisted offset…