Kafka Ingestion leads to missing events without thrown away events

Hello everyone,

I have a sample Kafka topic with 8225 events for which I have created an ingestion spec in order to create a Druid datasource. The topic as other topics I have created before uses Avro Schema with Schema Registry. I don’t have any troubles using Avro schema this is a process that I have repeated in the past.
My issue is that from the 8225 events I end up with a datasource of 7826 events. As it is shown in the screenshot below it seems that all events were parsed with no errors, no thrownaway or unparseable, this is why I didn’t find any logs related to this issue as well.

Please can you suggest a possible cause of this problem and a way to debug this process.

Things I've tried
  • Deleting created segments and hard resetting ingestion spec using earliest offset
Architecture - Kafka 2.8.1 - Schema Registry 6.1.4

Relates to Apache Druid 0.22.1

Hi ZisisFL,

Welcome to the Druid Forum.

Right off the bat, can we rule out things such as rollup / granularity? Would it be possible for you to paste your ingestion spec here (You can remove information you don’t want to be public such as IPs, column names etc.) ?

Thanks!

Hello Vijeth thank you for your response here is my ingestion spec, it includes only 3 string dimensions and a timestamp in unix seconds.

{
    "type": "kafka",
    "spec": {
      "dataSchema": {
        "dataSource": "campaign-events",
        "timestampSpec": {
          "column": "event_timestamp",
          "format": "posix"
        },
        "dimensionsSpec": {
          "dimensions": [
            "event_name",
            "user_id",
            "url"
          ],
          "dimensionExclusions": ["event_timestamp"]
        },
        "parser": {
          "type": "avro_stream",
          "avroBytesDecoder": {
            "type": "schema_registry",
            "url": "SCHEMA_REGISTRY_URL"
          },
          "parseSpec": {
            "format": "avro",
            "flattenSpec": {
              "fields": []
            }
          }
        },
        "metricsSpec": [],
        "granularitySpec": {
          "type": "uniform",
          "segmentGranularity": "WEEK",
          "queryGranularity": "NONE"
        }
      },
      "tuningConfig": {
        "type": "kafka",
        "maxRowsPerSegment": 5000000
      },
      "ioConfig": {
        "topic": "campaign-events",
        "useEarliestOffset": true,
        "consumerProperties": {
          "bootstrap.servers": "",
          "security.protocol": "",
          "sasl.mechanism": "",
          "sasl.jaas.config": "",
          "ssl.truststore.location": "",
          "ssl.truststore.password": "",
          "ssl.truststore.type": "",
          "ssl.endpoint.identification.algorithm": ""
        },
        "taskCount": 1,
        "replicas": 1,
        "taskDuration": "PT1H",
        "type": "kafka"
      }
    }
  }

Thank you for sending this across. Druid enables rollup by default and will combine any rows if the timestamp and dimensions are identical after truncation by ‘queryGranularity’ (which is none in your case).

If this is possible in your data, then I’d explicitly disable rollup and see if that helps. Have you been able to find out which rows are being lost?

Hello Vijeth there were indeed duplicates in the source topic so this is what caused the issue I wasn’t aware that the default was to enable rollup, so thank you for pointing it out it is really crucial.

2 Likes

That is great news, I am glad we were able to get this working!