Hadoop indexer log size

The log size of a Hadoop index task is always enormous on my environment. Even with only a few megabytes of data to index, the index log is in excess of 1 GB. The vast majority of the space in the log appears to be taken up by a giant JSON structure that details every possible segment in the interval specified in the task. If the task operates on a large number of segments (like every minute for 3 months), the log output is huge.

While this may not be a very common use case, I would like to know if there is a way to quiet this logging a bit without resorting to class log levels. I think there is some potentially useful info in some of these classes, I just don’t think outputting the massive JSON structure every time is useful.

Hi Taylor, how many segments are you creating and how large is your interval for ingestion? Perhaps a small PR that would help in this situation is to detech the size of the intervals field for indexing, and print out different info if the list is too long.

Hey Taylor,

The expected case is that the JSONification of the task is short and useful to operators, so it’s included by default. You could suppress it with an appropriate log4j2 config.

But, I also wonder if your segmentGranularity is too fine. Using “minute” is not typical – usually people go for “hour” or “day”. It is possible to store data at finer granularity within a segment using queryGranularity.