"GC Overhead limit exceeded" while realtime ingesting

Hi,
I was playing around to see how fast a cluster would respond to queries while ingesting realtime events from kafka. I pushed 10mill events onto a kafka topic: mykafkatopic and submitted a realtime task[realtimetask.json] to the indexing service.

I kept querying for the no.of rows ingested using the attached[topn.json] query.

As time passed, the query response time kept increasing until I got:

“error” : “Failure getting results from[http://myserver:8100/druid/v2/] because of [org.jboss.netty.channel.ChannelException: Channel disconnected]”

After this all the subsequent queries only had data from previous ingestions meaning all this data was rejected.

On the task logs I found a ton of java.lang.OutOfMemoryError: GC overhead limit exceeded errors.

I am attaching the full(relevant) log[crashlogsOverlord.log] for better insights.

The events on kafka are of the following type:

{“timestamp”:“2016-02-11T16:53:24Z”,“pk”:“1”, “dim1”:“d1val”,“dim2”:“d2val”,“dim3”:“d3val”,“dim4”:“d4val”,“dim5”:“d5val”,“dim6”:“d6val”,“dim7”:“NA”,“dim8”:“NA”,“dim9”:“NA”}

I am currently running all this(zookeeper, postgresql, broker, coordinator, historical, indexing service, kafka) on a single SSD machine with 4 Core, 8 GB RAM.

Can someone please let me know what should the correct configuration for such a setup be?

Any help would be appreciated.

Thanks,

Saurabh

crashlogsOverlord.log (12.3 KB)

realtimetask.json (1.8 KB)

topn.json (289 Bytes)

Hi Saurabh,

The slowness seems to have been caused due to the GC issues and eventually the task throws OutOfMemory,

fwiw, realtime index task aggregates the events in jvm heap memory.

when the limit for maxRowsInMemory is reached or after intermediatePersistPeriod is reached a small segment is made and they are persisted to disk.

In your case with 10M events and maxrowsInMemory set to 500K, the index task will do ~20 intermediate persists.

At the end of the hour all the intermediate segments are merged into a single one. This merge step also needs some jvm memory and can cause OOME.

you can try these tunings to start with -

  1. Increase the jvm heap memory

  2. Create multiple shards - Its recommended to have ~ 5M events per shard. you can start with creating 2 shards for your data (see sharding at - http://druid.io/docs/latest/ingestion/realtime-ingestion.html)

Friday, February 12, 2016 at 5:50:09 PM UTC+5:30, Saurabh Gupta wrote:

I’ve seen the same thing when I’ve poured a day events on realtime task with hour granularity. Currently maxrowsInMemory is applied per sink basis and if row counts in an hour does not exceeds maxrowsInMemory, I’ll throws OOM. You can set smaller maxrowsInMemory or intermediatePersistPeriod. Or can use https://github.com/druid-io/druid/pull/2459.

2016년 2월 12일 금요일 오후 11시 0분 4초 UTC+9, Nishant Bangarwa 님의 말:

Thank you guys. Really appreciate the prompt and helpful responses in the group.
My java heap size for the overlord node is already 2g.(I don’t think any increase on my puny machine would be justified.)
I got it to work by reducing the maxRowsInMemory count to 100K. @nav: I hope your PR gets pulled into 0.9(rc2?).

For Production though, I like the idea of creating shards for faster ingestion rate.