dimensionsSpec regex

Hi - I am trying to modify value of one of the dimension column values while ingestion using regex in dimensionsSpec like below.

{
    "type": "regex",
    "name": "testDim",
    "expr": "(\\w+)",
    "replaceMissingValue" : true,
    "replaceMissingValueWith": "1"
}

The testDim dimension has some string value which I want to replace with "1". Is this the right way to do? Can someone please help me - may be the expr that I have specified is not correct?

Thanks

Are you on 0.12.0?

If so, this undocumented PR may be useful for you: https://github.com/druid-io/druid/pull/4890, along with the expression documentation: http://druid.io/docs/latest/misc/math-expr.html

The “transformSpec” goes in the “dataSchema” of your ingestion spec, on the same nesting level as “datasource” and “parser”, e.g.:


"transformSpec": {

"transforms": [

{

"type": "expression",

"name": "eventTime",

"expression": "timestamp_format(eventTime, yyyy-MM-dd'T'HH:mm:ss.SSSZ, UTC)"

}

]

}

Where “expression” is an expression suitable for your use case.

Prior to 0.12.0, there is no mechanism for transforming input values during ingestion.

This looks very interesting to me as well, thanks for posting it.
The PR mentions both “transform” and “filter” functions. Could you give a similar example for a “filter”?

Background: we have large streams that our staging environment cannot fully ingest like prod can, so we would like to sample the stream and only keep like 1% of all events.

The “filter” in the “transformSpec” takes the same format as the Druid query filters (), e.g.:


"transformSpec": {

"filter" : {

"type": "selector",

"dimension" : "<dimension>",

"value" : "<dimension_value>"

},

"transforms": [

{

"type": "expression",

"name": "eventTime",

"expression": "timestamp_format(eventTime, yyyy-MM-dd'T'HH:mm:ss.SSSZ, UTC)"

}

]

}

Hey Jonathan!

Can the transform output field can be used as a timestamp within the schema? trying something like:

{
“type” : “expression”,
“name” : “eventTime”,
“expression” : “visitStartTime + div(time,1000)”
}

``

“timestampSpec”: {

“column”: “eventTime”,

“format”: “posix”

}

``