Kafka Indexing service - Does not index any dimensions - None shown in overlord

I have Kafka Indexing Service running against a kafka topic that has avro messages. These avro messages are json encoded.
The schemas are published to schema registry (confluent).
There are some 160K messages with fully populated payload in the topic. The segments were created in S3 however none of the dimensions shows any cardinality.
The overlord console shows shards for the segment, however only 1 dimension is shown for each of them while there are more than 20 dimensions.
Any help is highly appreciated ! I suspect if it’s to do with the json path mapping ?
@Indexing Task
{
“type” : “kafka”,
“dataSchema” : {
“dataSource” : “”,
“parser” : {
“type” : “avro_stream”,
“avroBytesDecoder”:{
“type”:“schema_registry”,
“url”:“http://”
},
“parseSpec” : {
“format” : “avro”,
“flattenSpec”: {
“useFieldDiscovery”: true,
“fields”: [
{
“type”:“root”,
“name”:“contentId”,
“expr”:"$.contentId.string"
},


“dimensionsSpec” : {},
“timestampSpec”: {
“column”: “eventTime”,
“format”: “auto”
}
}
},
“metricsSpec” : ,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “DAY”,
“queryGranularity” : “NONE”,
“rollup” : false
}
},
“ioConfig” : {
“topic” : “”,
“consumerProperties”: {
“bootstrap.servers”: “”,
group.id”: “”
},
“replicas”:“2”,
“taskCount”:“1”,
“taskDuration”: “PT30M”,
“useEarliestOffset”:true
},
“tuningConfig” : {
“type” : “kafka”,
“resetOffsetAutomatically”:true
}
}

``

@AVRO Schema

{
“schema”: “{“type”:
“record”,
“name”:
“AtomicEvent”,
“namespace”:”<Masked",
“fields”:[
{

{“name”:“contentId”,“type”:[“null”,“string”],“default”:null}

}

``

2018-06-06T13:47:13,228 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userId] inverted with cardinality[0] in 51 millis.
2018-06-06T13:47:13,279 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[eventType] inverted with cardinality[0] in 51 millis.
2018-06-06T13:47:13,334 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[eventAction] inverted with cardinality[0] in 55 millis.
2018-06-06T13:47:13,392 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userbrowser.name] inverted with cardinality[0] in 58 millis.
2018-06-06T13:47:13,443 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userbrowser.category] inverted with cardinality[0] in 51 millis.
2018-06-06T13:47:13,498 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userbrowser.os] inverted with cardinality[0] in 55 millis.
2018-06-06T13:47:13,547 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userbrowser.version] inverted with cardinality[0] in 49 millis.
2018-06-06T13:47:13,596 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userbrowser.vendor] inverted with cardinality[0] in 49 millis.
2018-06-06T13:47:13,647 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userbrowser.os_version] inverted with cardinality[0] in 51 millis.
2018-06-06T13:47:13,704 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[userAgent] inverted with cardinality[0] in 57 millis.
2018-06-06T13:47:13,756 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[eventSubType] inverted with cardinality[0] in 52 millis.
2018-06-06T13:47:13,817 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[searchKeywords] inverted with cardinality[0] in 61 millis.
2018-06-06T13:47:13,882 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[actor.districtRefId] inverted with cardinality[0] in 65 millis.
2018-06-06T13:47:13,931 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[actor.districtId] inverted with cardinality[0] in 49 millis.
2018-06-06T13:47:13,985 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[actor.schoolPid] inverted with cardinality[0] in 54 millis.
2018-06-06T13:47:14,036 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[actor.districtPid] inverted with cardinality[0] in 51 millis.
2018-06-06T13:47:14,086 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[actor.stateId] inverted with cardinality[0] in 50 millis.
2018-06-06T13:47:14,152 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[actor.roles] inverted with cardinality[0] in 66 millis.
2018-06-06T13:47:14,203 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[actor.browserId] inverted with cardinality[0] in 51 millis.
2018-06-06T13:47:14,256 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[target.viewType] inverted with cardinality[0] in 53 millis.
2018-06-06T13:47:14,314 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[target.extensions.programId] inverted with cardinality[0] in 58 mi
llis.
2018-06-06T13:47:14,368 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[target.extensions.programName] inverted with cardinality[0] in 54
millis.
2018-06-06T13:47:14,417 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[usageContext.id] inverted with cardinality[0] in 49 millis.
2018-06-06T13:47:14,469 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[usageContext.name] inverted with cardinality[0] in 52 millis.
2018-06-06T13:47:14,519 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[usageContext.type] inverted with cardinality[0] in 50 millis.
2018-06-06T13:47:14,577 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[usageContext.description] inverted with cardinality[0] in 58 milli
s.
2018-06-06T13:47:14,626 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[usageContext.extensions.assignmentId] inverted with cardinality[0]
in 49 millis.
2018-06-06T13:47:14,679 INFO [caliper-sample-atomic-avro-kafka-int-incremental-persist] io.druid.segment.StringDimensionMergerV9 - Completed dim[usageContext.resourceId] inverted with cardinality[0] in 53 millis
.


2018-06-06T13:47:17,407 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - New segment[caliper-sample-atomic-avro-kafka-int_2018-06-05T00:00:00.000Z_2018-06-06T00:00:
00.000Z_2018-06-06T13:47:17.390Z] for row[MapBasedInputRow{timestamp=2018-06-05T00:06:33.467Z, event={contentId=null, userId=null, eventType=null, eventAction=null, userbrowser.name=null, userbrowser.category=nu
ll, userbrowser.os=null, userbrowser.version=null, userbrowser.vendor=null, userbrowser.os_version=null, userAgent=null, eventSubType=null, searchKeywords=null, actor.districtRefId=null, actor.districtId=null, a
ctor.schoolPid=null, actor.districtPid=null, actor.stateId=null, actor.roles=null, actor.browserId=null, target.viewType=null, target.extensions.programId=null, target.extensions.programName=null, usageContext.i
d=null, usageContext.name=null, usageContext.type=null, usageContext.description=null, usageContext.extensions.assignmentId=null, usageContext.resourceId=null}, dimensions=[contentId, userId, eventType, eventAct
ion, userbrowser.name, userbrowser.category, userbrowser.os, userbrowser.version, userbrowser.vendor, userbrowser.os_version, userAgent, eventSubType, searchKeywords, actor.districtRefId, actor.districtId, actor
.schoolPid, actor.districtPid, actor.stateId, actor.roles, actor.browserId, target.viewType, target.extensions.programId, target.extensions.programName, usageContext.id, usageContext.name, usageContext.type, usa
geContext.description, usageContext.extensions.assignmentId, usageContext.resourceId]}] sequenceName[index_kafka_caliper-sample-atomic-avro-kafka-int_3483c0c34065dd8_0].





2018-06-06T14:17:15,216 DEBUG [appenderator_merge_0] com.amazonaws.request - Received successful response: 200, AWS Request ID: 758B95442174AD4C
2018-06-06T14:17:15,217 DEBUG [appenderator_merge_0] com.amazonaws.requestId - x-amzn-RequestId: not available
2018-06-06T14:17:15,217 DEBUG [appenderator_merge_0] com.amazonaws.requestId - AWS Request ID: 758B95442174AD4C
2018-06-06T14:17:15,218 INFO [appenderator_merge_0] io.druid.storage.s3.S3DataSegmentPusher - Deleting temporary cached index.zip
2018-06-06T14:17:15,228 INFO [appenderator_merge_0] io.druid.storage.s3.S3DataSegmentPusher - Deleting temporary cached descriptor.json
2018-06-06T14:17:15,258 INFO [appenderator_merge_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Pushed merged index for segment[caliper-sample-atomic-avro-kafka-int_2018-05-29T00:00:00.000Z_2018-05-30T00:00:00.000Z_2018-06-06T11:03:08.775Z_2], descriptor is: DataSegment{size=114414, shardSpec=NumberedShardSpec{partitionNum=2, partitions=0}, metrics=, dimensions=, version=‘2018-06-06T11:03:08.775Z’, loadSpec={type=>s3_zip, bucket=>hmheng-data-services/druid-segments/int, key=>segments/sample-data/caliper-sample-atomic-avro-kafka-int/2018-05-29T00:00:00.000Z_2018-05-30T00:00:00.000Z/2018-06-06T11:03:08.775Z/2/index.zip, S3Schema=>s3n}, interval=2018-05-29T00:00:00.000Z/2018-05-30T00:00:00.000Z, dataSource=‘caliper-sample-atomic-avro-kafka-int’, binaryVersion=‘9’}
2018-06-06T14:17:15,264 INFO [appenderator_merge_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Pushing merged index for segment[caliper-sample-atomic-avro-kafka-int_2018-05-31T00:00:00.000Z_2018-06-01T00:00:00.000Z_2018-06-06T13:46:53.788Z].
2018-06-06T14:17:15,267 INFO [appenderator_merge_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Adding hydrant[FireHydrant{, queryable=caliper-sample-atomic-avro-kafka-int_2018-05-31T00:00:00.000Z_2018-06-01T00:00:00.000Z_2018-06-06T13:46:53.788Z, count=0}]
2018-06-06T14:17:15,267 INFO [appenderator_merge_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Adding hydrant[FireHydrant{, queryable=caliper-sample-atomic-avro-kafka-int_2018-05-31T00:00:00.000Z_2018-06-01T00:00:00.000Z_2018-06-06T13:46:53.788Z, count=1}]
2018-06-06T14:17:15,274 WARN [appenderator_merge_0] io.druid.segment.IndexMerger - Indexes have incompatible dimension orders, using lexicographic order.
2018-06-06T14:17:15,277 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - Using SegmentWriteOutMediumFactory[TmpFileSegmentWriteOutMediumFactory]
2018-06-06T14:17:15,312 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - Completed version.bin in 22 millis.
2018-06-06T14:17:15,336 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - Completed factory.json in 24 millis
2018-06-06T14:17:15,336 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - Completed dim conversions in 0 millis.
2018-06-06T14:17:15,532 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - completed walk through of 37,234 rows in 139 millis.
2018-06-06T14:17:15,562 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - Completed time column in 29 millis.
2018-06-06T14:17:15,562 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - Completed metric columns in 0 millis.
2018-06-06T14:17:15,567 INFO [appenderator_merge_0] io.druid.segment.IndexMergerV9 - Completed index.drd in 5 millis.

``

any pointers here?

Hi Guys,

Here is my overlord console for the Kafka ingestion. Essentially each of the shards show only 1 dimension although I have the json expressions defined to pull those fields out of avro messages. I'm wondering why the cardinality is '0' for every dimension?

![Auto Generated Inline Image 1.png|922x534](upload://v6rfWlny0kQqiRGjyOIEMtPjRsR.png)

The sample avro data is json encoded and the 'eventTime' field has varying times spanning different years. Could this be a problem?

Essentially, we have an years worth of Data in Elastic Search. We wish to move this over an ETL job through Kafka and have it ingested using Kafka indexing service.

Best Regards
Varaga

Hi Varaga,

For flattening you want to do something like this for paths:

   {

     "type":"path",

     "name":"usageContext.resourceId",

     "expr":"$.usageContext.resourceId"

   }

And you can leave the root fields like “contentId” out completely from the flattenSpec. (Because you have useFieldDiscovery: true, the root fields are discovered automatically.)

See here for some details and examples on flattening: http://druid.io/docs/latest/ingestion/flatten-json.html

Hi Gian,

Here's my avro record. I tried the flatten spec., with every possibility.
With path defining expression as
    $.`eventType
$.``eventType`.string
With root field d expression as
   $.`eventType

With “useFieldDiscovery”: true

With path defining the nested fields,

{
     "type":"path",
     "name":"actor.districtRefId",
     "expr":"$.actor.['com.<Masked>'].districtRefId.string"
   }

 and also

{
“type”:“path”,
“name”:“actor.districtRefId”,
“expr”:"$.actor…districtRefId.string"
}
`

None of them works. Essentially the task is successful however there are no dimensions shows. I could only get a count of the total events but could not groupBy as there are no cardinality fetched.

{
“env”: {
“string”: “int”
},
“id”: {
“string”: “”
},
“userId”: {
“string”: “ed73c58b-2f91-48b8-8657-61c5650406d9”
},
“contentId”: {
“string”:"<Masked>"
},
“eventType”: {
“string”: "<Masked>"
},
“eventAction”: {
“string”: “Viewed”
},
“userBrowser”: {
“map”: {
“name”: {
“string”: “Chrome”
},
“category”: {
“string”: “pc”
},
“os”: {
“string”: “Windows 10”
},
“version”: {
“string”: “66.0.3359.181”
},
“vendor”: {
“string”: “Google”
},
“os_version”: {
“string”: “NT 10.0”
}
}
},
“userAgent”: {
“string”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36”
},
“eventSubType”: {
“string”: “”
},
“eventTime”: {
“string”: “2018-05-29T13:30:44.801Z”
},
“searchKeywords”: {
“string”: “”
},
“actor”: {
“com.<Masked>”: {

        "districtRefId": {
            "string": "ff064"
        },
        "districtId": {
            "string": ""
        },
        "schoolPid": {
            "string": ""
        },
        "districtPid": {
            "string": ""
        },
        "stateId": {
            "string": "OTH"
        },
        "roles": {
            "array": ["Instructor"]
        },
        "browserId": {
            "string": ""
        }
    }
},
"target": {
    "com.``"<Masked>`"`": {
        "viewType": {
            "string": ""
        },
        "programId": {
            "string": "8523"
        },
        "programName": {
            "string": "1"
        },
        "extensions": {
            "map": {
                "programId": {
                    "string": ""
                },
                "programName": {
                    "string": ""
                }
            }
        }
    }
},
"usageContext": {
    "com.``"<Masked>`"`": {
       
        "description": {
            "string": ""
        },
        "extensions": {
            "map": {
                "assignmentId": {
                    "string": ""
                },
                "resourceId": {
                    "string": "``"<Masked>`"`"
                }
            }
        }
    }
}

}

``

Hi Gian, All,

Any pointers would help us proceed forward.

TIA

Varaga

Hi Gian

 The path expression does not work for a nested object containing a map of name:values

  If you look the payload that was annexed earlier,
  the "userBrowser" object contains a map with name:value pairs. One of them is "name": { "string": "chrome"}.
  Using this expressionbelow in http://jsonpath.herokuapp.com/ seems to work

$.userBrowser…name

``

  However the same expression run through the indexing task does not find any cardinality on that payload for the field.

  Do you suggest any other way to get to the problem? I can confirm that the name property is not null !

   BTW., Do you think that in a batch (say during poll from kafka), if one of the dimension is missing in a subset of the batch, then is there a possibility that the entire batch is not indexed ?

Best
Varaga

Hi Gian, All,

  I'm wondering if avro complex data structures are supported?
  Basically, the avro data is a record with it's schema registered in the schema registry.
  The avro record contains nested avro record types which in turn has modeled avro maps.

  Here is the schema below. You can check the corresponding avro data (decoded payload above. The avro data is json encoded)
  This might probably shed light to you?

{
“schema”: “{
“namespace”: “com.m1.m2.m3”,
“type”: “record”,
“name”: “AtomicEvent”,
“fields”: [
{
“name”: “env”,
“type”: [“null”, “string”],
“default”: null
},
{
“name”: “id”,
“type”: [“null”, “string”],
“default”: null
},

{
“name”: “userBrowser”,
“type”: [
“null”,
{
“type”: “map”,
“values”: “string”
}
],
“default”: null
},

{
“name”: “actor”,
“type”: [
“null”,
{
“namespace”: “com.m1.m2.m3”,
“type”: “record”,
“name”: “Actor”,
“fields”: [

{
“name”: “roles”,
“type”: [
“null”,
{
“type”: “array”,
“items”: “string”
}
],
“default”: null
}
]
}
],
“default”: null
},
{
“name”: “target”,
“type”: [
“null”,
{
“namespace”: “com.m1.m2.m3”,
“type”: “record”,
“name”: “Target”,
“fields”: [
{
“name”: “viewType”,
“type”: [“null”, “string”],
“default”: null
},
{
“name”: “programId”,
“type”: [“null”, “string”],
“default”: null
},

{
“name”: “extensions”,
“type”: [
“null”,
{
“type”: “map”,
“values”: “string”
}
],
“default”: null
}
]
}
],
“default”: null
},
{
“name”: “usageContext”,
“type”: [
“null”,
{
“namespace”: “com.m1.m2.m3”,
“type”: “record”,
“name”: “ContextObject”,
“fields”: [

{
“name”: “extensions”,
“type”: [
“null”,
{
“type”: “map”,
“values”: “string”
}
],
“default”: null
}
]
}
],
“default”: null
}
]
}”
}

``

Hi Guys,

Can I request you to help me here?

Best Regards

Varaga

hii guys,

How to connect druid DB to tableau?

Have you try to connect druid with grafana (as SQL data source because grafana doesn’t support well for druid data source) ?

Thank you so much

For the spec., as suggested earlier, should it not use AvroStreamInputRowParser for “avro_stream”?
I found in one of the stack trace that the MapbasedInputRow parser is used? Is this valid?

io.druid.data.input.MapBasedInputRow

``

Hi Jihoon, Gian, All,
I dug deeper into the code around this issue and it looks like that the avro map is decoded fine and the HashMap is loaded with the properties, however the HashMap#get(key) returns a null while the actual value is in there. I enabled some debugging into the code which I have shared below.
The toString() of the map prints out the keys and the values however, the #get returns a null. Looks like it’s the deserializer (& while path evaluates) the Map is rehashed!!!
The log below highlights this. Essentially, userBrowser is a Map and the name a key. The GenericAvroJsonProvider isMap recognizes that as a Map,
`GenericAvroJsonProvider#getMapValue returns null, although the toString() prints out the value for it.

`
2018-06-13T16:26:33,140 DEBUG [task-runner-0-priority-0] com.jayway.jsonpath.internal.path.CompiledPath - Evaluating path: $[‘userBrowser’][‘name’]
2018-06-13T16:26:33,140 DEBUG [task-runner-0-priority-0] io.druid.data.input.avro.GenericAvroJsonProvider - isMap, object instance of Map: false, object instance of Generic Record: true
2018-06-13T16:26:33,140 DEBUG [task-runner-0-priority-0] io.druid.data.input.avro.GenericAvroJsonProvider - getMapValue, object instance of Generic Record, Object: {“env”: “int”, “id”: “[Masked]”,… “userBrowser”: {“name”: “Chrome”, “category”: “pc”, “os”: “Windows 7”, “version”: “65.0.3325.181”, “vendor”: “Google”, “os_version”: “NT 6.1”}, “userAgent”: “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36”…}},
value returned for key: {name=Chrome, category=pc, os=Windows 7, version=65.0.3325.181, vendor=Google, os_version=NT 6.1}
key: userBrowser
2018-06-13T16:26:33,140 DEBUG [task-runner-0-priority-0] io.druid.data.input.avro.GenericAvroJsonProvider - isMap, object instance of Map: true, object instance of Generic Record: false

2018-06-13T16:26:33,140 DEBUG [task-runner-0-priority-0] io.druid.data.input.avro.GenericAvroJsonProvider - getMapValue, object instance of Map, Object: {name=Chrome, category=pc, os=Windows 7, version=65.0.3325.181, vendor=Google, os_version=NT 6.1}, obj returned for key: null key: name

2018-06-13T16:26:33,140 DEBUG [task-runner-0-priority-0] io.druid.data.input.avro.AvroFlattenerMaker - transformValue: field: null,

``

Can you guys spot where this is rehashed?
I can contribute if there are pointers!

Varaga

Theres a bug in GenerivAvroJsonProvider class. The #getMapValue() seems to search the GenericRecord#map using a string while the keys in this map are Utf8 avro classes.
I gave fixed it and tested it fine.

Will contribute it.

Issue Registered: https://github.com/druid-io/druid/issues/5884