Help with failed to persist merged index issue with S3 Deep Storage

Hello,
Any help with this issue would be appreciated and I’m also hoping there is a chance to retry/correct the issue. We encountered an error in the indexing logs that as of now seems to have lost an entire day’s worth of data for us…hopefully it can be recovered/retried somehow.

We have an S3 backed deep storage Druid instance that realtime data is sent to via Tranquility. Just in case we checked if there were any files on disk on the Tranquility server in the Java temporary directory (other than empty folders with names like 1487276182596-0)…not that we expected files to be persisted there in this setup.

In the indexing log for the interval, an entire day, we see the error message below. We’re getting the error message that there is no space left on the device, but that seems to be not true as there is plenty of memory and diskspace on the boxes. Also we cannot seem to find the location of any temporary or interim commits.

Is there any possible way to retry?

This is the full/only data written to the indexing log for the interval in S3 and we also see that no segment was written for the interval to S3.

2017-02-24T00:16:34,003 ERROR [abc-2017-02-23T00:00:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[abc]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No space left on device, interval=2017-02-23T00:00:00.000Z/2017-02-24T00:00:00.000Z}

java.io.IOException: No space left on device

at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_74]

at java.io.FileOutputStream.write(FileOutputStream.java:326) ~[?:1.8.0_74]

at com.google.common.io.ByteStreams.copy(ByteStreams.java:179) ~[guava-16.0.1.jar:?]

at com.google.common.io.ByteSource.copyTo(ByteSource.java:255) ~[guava-16.0.1.jar:?]

at com.google.common.io.ByteStreams.copy(ByteStreams.java:119) ~[guava-16.0.1.jar:?]

at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:873) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.IndexMerger.merge(IndexMerger.java:423) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:244) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:217) ~[druid-processing-0.9.1.1.jar:0.9.1.1]

at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:548) [druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42) [druid-common-0.9.1.1.jar:0.9.1.1]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_74]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_74]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_74]

2017-02-24T00:16:34,048 ERROR [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Failed to finish realtime task: {class=io.druid.indexing.common.task.RealtimeIndexTask, exceptionType=class com.metamx.common.ISE, exceptionMessage=Exception occurred during persist and merge.}

com.metamx.common.ISE: Exception occurred during persist and merge.

at io.druid.segment.realtime.plumber.RealtimePlumber.finishJob(RealtimePlumber.java:671) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.indexing.common.task.RealtimeIndexTask.run(RealtimeIndexTask.java:405) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_74]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_74]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_74]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_74]

2017-02-24T00:16:34,049 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[RealtimeIndexTask{id=index_realtime_abc_2017-02-23T00:00:00.000Z_0_0, type=index_realtime, dataSource=abc}]

com.metamx.common.ISE: Exception occurred during persist and merge.

at io.druid.segment.realtime.plumber.RealtimePlumber.finishJob(RealtimePlumber.java:671) ~[druid-server-0.9.1.1.jar:0.9.1.1]

at io.druid.indexing.common.task.RealtimeIndexTask.run(RealtimeIndexTask.java:405) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_74]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_74]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_74]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_74]

I was not expecting any disk usage and the Druid instance has a large amount of disk space and 16 GB of memory, so I’m puzzled as to when this diskspace is needed as our deep storage is S3.

Hi Jammy,

Were you able to resolve the issue as we are also facing the same issue.

No, we were not able to resolve the issue. The only way was to increase the instance or box size in terms of memory of Druid. That data was lost on that day.

Hi Jammy,
Druid realtime index tasks/ nodes need local disk for persisting intermediate segments.

Did you checked the disk usage on the instance running the indexing task or tranquility ?

Fwiw, this is complaining about the disk usage on the node running index task.