Parse access logs wrapped in docker logs (json)

I’m having my first journey in druid data ingestion

I get access logs (apache httpd like) wrapped in json document by docker log driver :

  "date": "2021-11-13T05:55:03.000000Z",
  "source": "stdout",
  "log": " - - [13/Nov/2021:05:55:03 +0000] \"GET /img/pics/event_creation-480w-1024w.webp HTTP/2.0\" 200 91570 \"\" \"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75\" 143954 \"prod@docker\" \"\" 15ms",
  "container_id": "",
  "container_name": "/traefik_rp_1"

The JSON wrapper does not bring any interesting information, the payload is in the “log” key.
It is a access log format.

How can I combine transformations to get all pieces of information (time, IP, URL, return code, etc.) ?

Thanks you !

I’m surprise that there is no straightforward solution to this use case.

Some access logs in a docker json logs format does not seem very exotic to me.

My current solution involves jq + awk in a shell and is only few lines. I also managed to do it with goaccess but I’d really like to do it with Druid …

Could you do it with a RegEx transform?

I don’t know.

I have tried split function on “log” column but it produced unexpected result.

In any case it would be the second step. The first is to unwrap the content from Json. I don’t need the envelope.

Understood – yeah it’s like you want to get the log out of the JSON and then throw it at the standard Druid parser… I think you were thinking you could have a path to get log, and then a series of transform expressions that will extract each column?

Exactly : payload is in ‘.log’ json path and then it is access log in the common log format (CLF, %d\t%t\t%h\t%m\t%U\t%H\t%R\t%u\t%v\t%s\t%b\t%L) with timestamp and so on.
This is what I want to ingest.