[druid-user] How to use druid-influx-extensions

Hi,

I have enabled druid-influx-extensions in druid, I am able to see it in the status panel of druid UI, but when I am configuring ingestion spec using kafka plugin, the available format list doesn’t have influx/line protocol as an option. the configuration given in the druid documents is

“parser”: { “type”: “string”, “parseSpec”: { “format”: “influx”, “timestampSpec”: { “column”: “__ts”, “format”: “millis” }, “dimensionsSpec”: { “dimensionExclusions”: [ “__ts” ] }, “whitelistMeasurements”: [ “cpu” ] }

my doubt is where to add it in ingestion spec, there is no key as parser in the ingestion spec. Can someone guide me?

Hey! So, “Parser” is a different way of bringing data into Druid. It was replaced in 0.17??? with the new way, “inputFormat”.

Therefore in the console, for example, everything nowadays is using inputFormat.

There’s more about Parser here, which you should be able to work out about where it goes.

Hi Peter,

Thank you for your response. In my use case, I was trying to ingest line protocols through Kafka. In druid I enabled “druid-kafka-indexing-service” and “druid-influx-extensions” for real-time ingestion and parsing of the data, I tried to set up parser as influx, as well as “inputFormat” type as influx but neither of them worked for me. I guess “druid-influx-extensions” is not supported as “inputFormat” and it is better to use inputFormat with Kafka indexing, as parser will be deprecated with kafka-indexing-service.

I tried to ingest the data as JSON format as well but I have a couple of query regarding it.

  1. So there is no logical partitioning of the data in druid as measurements (in influx terms), is it ok to have 100/1000s of columns in one data source? Will it impact the query performance of druid?

  2. The supervisor status turned into “UNHEALTHY_TASK”, with the task started to fail with an error in logs saying “too many files open …”. I tried to increase the maximum no file limit to 65535 using “ulimit -n” command but still the it was failing. What will the optimum number to set for running druid.

I am using single server deployment in docker.

Hi Shubham,

  1. A colleague of mine has said to expect performance impacts once segments or datasources get into the thousands, so it’s best to practice Compaction. Perhaps that’s something to explore?
  2. I came across this discussion regarding UNHEALTHY_TASKS, where grepping the Overlord log for failed task ids was proposed as a way of pinpointing causes for task failures. Regarding the “too many files open” error, I came across this Imply article, which might be the same one you already read in regard to the “ulimit -n” command. At the bottom of the article, there’s a further reference to a Red Hat article which might be of some use to you.
    Best,

Mark

Hi Mark,

Thank you for your response, I am new to druid so I might be asking some silly questions as well.

  1. Yeah, Compaction is something that I have to explore.

  2. I will see how to increase the limit and try once again

  3. a couple of requests again, Is it possible to send an invitation to the druid slack channel?

  4. Any example of schema-less data ingestion? How to declare metrics too in a schema-less data model? I read about dimensionSpec, if we keep it empty all the keys will be considered as dimension except the one mentioned in dimensionExclude and metricsSpec, but is it possible to declare metrics too as well in case we don’t know the column already?

Thanks,

Shubham

Hi Shubham,

You’re welcome! Ask away, you’re part of the community.

  1. You’ve been invited to the Slack channel.
  2. This might take care of your schema-less questions and your dimensionSpec/metricsSpec question.

Best,

Mark