Questions on Peon/Kafka indexing task

I have a Kafka based ingestion and have observed a few things so raising a few questions to understand them better:

  1. At any point of time, I see 24 open segments (realtime and not published). These 24 segments are for the past 24 hour intervals (Segment granularity is at HOUR level). Is it a standard thing which Druid does or the number of open segments can be changed?

  2. In the indexing tasks’ logs, I see the following:

2020-04-10T04:59:00,734 INFO [[index_kafka_hour_level_druid_ingestion_test_b32bebb7930a3d1_oiipeeed]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[hour_level_druid_ingestion_test_2020-04-09T10:00:00.000Z_2020-04-09T11:00:00.000Z_2020-04-09T13:25:58.259Z_6] spill[0] to disk in [173] ms (8 rows).

2020-04-10T04:59:00,782 INFO [[index_kafka_hour_level_druid_ingestion_test_b32bebb7930a3d1_oiipeeed]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[hour_level_druid_ingestion_test_2020-04-09T17:00:00.000Z_2020-04-09T18:00:00.000Z_2020-04-10T04:59:00.480Z] spill[0] to disk in [18] ms (12 rows).

2020-04-10T04:59:00,799 INFO [[index_kafka_hour_level_druid_ingestion_test_b32bebb7930a3d1_oiipeeed]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[hour_level_druid_ingestion_test_2020-04-09T06:00:00.000Z_2020-04-09T07:00:00.000Z_2020-04-09T06:00:05.795Z_17] spill[0] to disk in [15] ms (8 rows).

2020-04-10T04:59:00,821 INFO [[index_kafka_hour_level_druid_ingestion_test_b32bebb7930a3d1_oiipeeed]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[hour_level_druid_ingestion_test_2020-04-10T02:00:00.000Z_2020-04-10T03:00:00.000Z_2020-04-10T04:59:00.225Z] spill[0] to disk in [19] ms (14 rows).

2020-04-10T04:59:00,841 INFO [[index_kafka_hour_level_druid_ingestion_test_b32bebb7930a3d1_oiipeeed]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[hour_level_druid_ingestion_test_2020-04-09T07:00:00.000Z_2020-04-09T08:00:00.000Z_2020-04-09T07:00:10.751Z_16] spill[0] to disk in [17] ms (14 rows).

``

My question is why the flushing/spill over happens with very few records. My ingestion rate is approximately 500 per second. BTW, I have provided the cluster config and ingestion spec below.

  1. When a Peon process is created, what exactly uses the Heap memory and what uses the Direct memory?

Cluster Config:

  • 4 Data servers (MM and Historical colocated) with 16GB RAM, 6 core each.

  • 1 Coordinator-Overlord node - 16GB, 8 core

  • 1 Broker-Router node - 16GB, 8 Core
    Process level JVM config:

  • Historical : 3GB Max heap, 10GB MaxDirectMemory

  • MiddleManager : 128MB Heap

  • Task Configuration : 1GB Heap, 4GB MaxDirectMemory, numMergeBuffers=2, numThreads=2, sizeBytes=256000000

Attaching the ingestion spec.

ingestion_spec.json (968 Bytes)

Bump.

Any idea why this behaviour is showing up? Can I tune it to become more optimal so that the number of flushing/spill over with fewer records are avoided?

Thanks.