Tranquility : Null values in multi-value dimensions

We are lining up our batch segments against our real-time segments (which are produced via Tranquility). When looking at multi-valued fields (in pivot), we noticed that the segments produced by real-time contain “null” values, whereas the segments produced by our batch processes do not.

If possible, I’d like to eliminate that “null” value from the real-time segments to exactly match what is produced by batch.

For real-time:

Via a Tranquilizer, we are sending a Map<String,Object> as our event object. Each entry in the map is a dimension. For the multi-valued dimensions, we supply a Set as our key. In the case, where we don’t have a value, I am presently supplying an empty set as the value. (which results in null values in Pivot)

I tried omitting the key if the event had no values for that dimension.

I also tried supplying null as the key value.

But alas, I still see null values in Pivot.

I’m trying to track this down in the code, and came to the sendBatch method on the DruidBeam class in Tranquility, but then it occurred to me that it might be the server side indexing task, and its interpretation of the values sent from Tranquility. (at which time I thought I should just ask =)

any help is appreciated,

-brian

Hey Brian,

My guess is that it’s something happening on the Druid side. Which version of Druid are you using?

One possibility is that the way Druid deals with multi-valued dimensions is that if all rows have only a single value or zero values, then rows with zero values are “lifted” to nulls and the column is converted to a single-value dimension. So could it be that some of your realtime segments are just never seeing more than one value per row?

Otherwise- could you try reducing this behavior to a test case? (a row that shows up one way when batch indexed and another way when realtime indexed)

Gian,

I was able to reproduce this with a simple test case.

Using the following data, a datasource created by real-time contains a null for the multi-valued field, whereas the batch process does not.

(field delimited by “|”)

7,2015-08-01 01:00:00,Desktop,foo-1|bar-1|snaz-1,

7,2015-08-01 01:00:00,Desktop,foo-1|bar-1,

7,2015-08-01 01:00:00,Desktop,

7,2015-08-01 01:00:00,Tablet,foo-1|bar-1|snaz-1,

7,2015-08-01 01:00:00,Desktop,foo-1,

Should I create a ticket/issue so I can attach the test case?

(using Java + Tranquility for RT, and a simple spec file for the batch side)

(I also started diving into the Druid server code to locate the discrepancy, but haven’t pinpointed it yet)

-brian

Gian,

I believe I was able to further narrow down the issue. It appears the null only returns in the results while the RT task is in flight. Once the task completes, I no longer see the null. Output is shown below. I issue the query twice, once while the RT task is running, and the second after it completes.

brianoneill@blu (master):~/queries-> ./query_offers.sh

[ {

“timestamp” : “2016-03-06T13:31:00.000Z”,

“result” : [ {

“count” : 4.0,

“SEGMENT” : “foo-1”

}, {

“count” : 3.0,

“SEGMENT” : “bar-1”

}, {

“count” : 2.0,

“SEGMENT” : “snaz-1”

}, {

“count” : 1.0,

“SEGMENT” : null

} ]

} ]

brianoneill@blu (master):~/queries-> ./query_offers.sh

[ {

“timestamp” : “2016-03-06T13:31:00.000Z”,

“result” : [ {

“count” : 4.0,

“SEGMENT” : “foo-1”

}, {

“count” : 3.0,

“SEGMENT” : “bar-1”

}, {

“count” : 2.0,

“SEGMENT” : “snaz-1”

} ]

} ]

-brian

Gian,

To track this, I created an issue here:

I created a repo that demonstrates the issue here:

-brian