Custom InputRowParser

Hi,

I have a “,” sep values logs. One value is a “&” sep key-values (a url with name=value&name=value…).

My understanding is that Tsv parser is a function String -> Map<K,V> where K=String and V=String for simple values and V=Iterable for list-values.

However I need String -> Map<String,V> where V=Map<String,String> for the url field (and String for the others).

I can write my own custom parser / InputRowParser but my question is how to do it in a way that Druid transparently interprets correctly my values and I can query them.

In fact I can as well flatten the maps above into a single level (assuming that url param names to not collide with the field names) - and just inject dynamically param names into the field names.

Please advice,

Nicu

Hi again,

So I wrote a Parser<String, Object> implementation, a decorator on top of DelimitingParser.

I am not clear how to deploy this to Druid. I read from other threads that the way to do it is with Guice modules. Is there an example of custom parsing done? (or at least custom metrics, aggs, etc)?

My first attempt, for sure incorrect, tries to map json like this:


@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "format", defaultImpl = DelimitedParseSpec.class)
@JsonSubTypes(value = {
        @JsonSubTypes.Type(name = "tsv_with_url", value = UrlDelimitedParseSpec.class)
})
public class UrlDelimitedParseSpec extends DelimitedParseSpec {
...makeParser will return a custom made parser that will delegate to DelimitedParser except getFieldNames and parse which will replace my url field with a set of fields
}

Usually people post in druid-dev when they should be asking druid-user but this probably belongs in druid-dev.

In general ETL is not a strong suit of druid (specifically the Translate part) and many use cases delegate the formatting and cleaning of data to hadoop or spark for batch, and samza or storm for streaming (by no means an exhaustive list).

I’d be glad to help clarify things missing in http://druid.io/docs/latest/development/modules.html if extending druid is your preferred option, but I have to ask: is there a reason you cannot use another framework to do the bulk of the ETL work to get the data in a nice clean format?

BTW, this PR might be interesting: https://github.com/metamx/java-util/pull/36

It adds the ability to write your own custom parser with regex or javascript.

The performance won’t be great, but it is a fast way of getting started and loading your own data.