[druid-user] Re: What it means for a datasource to be 99% available?

Hi!

I’ve not heard of this before… to dig in a bit, are these metrics coming from the Druid metrics emitters? Or are you gathering them from an API?

Assuming your historicals aren’t actually going down, I wonder whether they’re getting overloaded and becoming unresponsive? (Do they only have 12 CPUs - notice druid.processing.numThreads=11. Also, as in the docs, is “the sum of druid.broker.http.numConnections across all the Brokers” < 40, the value of druid.server.http.numThreads? Just some wild stabs in the dark…)

OK so I ended up down a rabbit hole… lol!! And found NOTHING. !! Is it Friday yet?!

I think as 0.12 is 3-and-a-half years old now we may have difficulty finding someone who remembers this issue – if it was an issue – in order to let you know whether there was “a concern about the 99% of availability threshold” when that console was written. Since 0.15 it has effectively been superseded by the Druid Console, which is way better.

Is there any hint in the coordinator log at those times when you see a “valley”? Maybe it is getting very busy, or maybe it goes on holiday? And maybe you can correlate with the historical logs to see what they are doing, too?

Are those “valleys” regular? or are they pretty random? For example, is it perhaps when a batch ingestion kicks off? Or maybe a reindexing? Or do you run a compact job? Something that might create a tonne of new segments / new segment versions?

  • Pete

My first thought about that “Exception with one of the sequences!” was that maybe you are running out of merge buffers / disk spill… like total guess. But I did find this really old post https://groups.google.com/g/druid-user/c/fYpfb4arrOE that might support that theory…

Maybe check these out? GroupBy queries · Apache Druid

MAAAAAYBE your historical processes are restarting as a query causes it to crash, which is making your metric “valley” until that historical comes back up again, and then re-advertises all its segments…?