Druid 'Tasks' UI

Hi all, the more we play with druid, the more concerning the stability/more questions I have. (druid 0.15.0)

Many times using the UI to check task status, the UI (tasks pane) appears unresponsive or takes a really long time to come back…

(note: calling the api to get task status also can be slow…)

what is happening here, what does this mean?

As test, I killed all the data nodes.

The UI immediately came back…showing 1 task running. (odd since I just killed all the data nodes).

I hit reload in the UI, and it spins again…for awhile.eventually coming back with the same result. (note: when we sent in a get status api call, it was also ‘hung’ until the exact same time the UI responded).

Now if I query the meta-db, I so see 2 ‘active’ tasks…

If I goto the ‘load-data’ pane while in the state, it takes me to the ‘connect’ step’

it shows in the ‘connect’ pane:

Error: Failed to sample data: java.net.SocketException: Too many open files

(admitted we have the default 1042)…I guess we can increase this, do we have recommendations on open file limits?

This is essentially a test cluster at the moment, running very few ingest tasks, usually just 1 at time! and essentially no queries of interest running yet.

Our setup: Anyways a few questions about the ‘UI’/system.

3 Master Nodes

3 Query Nodes

5 Data Nodes.

any recommendations on key configs we should look at that could be causing this behavior? is there a known open files leak?

Hey Dan,

It’s hard to tell from your description what is wrong, but some things I would check include:

  1. Is the metadata store replying promptly to queries? (A slow query log, if it has one, might help figure this one out)

  2. Can the servers all reach each other on all necessary ports?

  3. The open files error is definitely not good and may very well be the cause of your problems. I usually set this limit to 500000. I usually also set vm.max_map_count to 500000 as well. Btw, the main thing that typically causes large numbers of open files is having a lot of segments, so try double-checking this too. If you have too many segments it could be due to using a segmentGranularity that is too fine for your dataset (i.e. HOUR when you have a smallish dataset spanning decades).

  4. Other than the open files thing, are there any other errors showing up in the logs?

Gian

Hi Dan,

I would also like to invite you to join the #druid channel on the ASF Slack: https://druid.apache.org/community/join-slack

It is a good place to figure out these kinds cluster configuration issues.

Please feel free to @ me in that channel.

Best regards,

Vadim

sweet, I was unaware of the slack channel, will likely connect up tomorrow.