Realtime Node restart error: com.metamx.common.ISE, not the right count

Hello, guys!

We’re using Druid 0.7.0.

We had to stop our Druid cluster for hardware maintenance (following the order described here), and we use SIGTERM to stop Realtime Node (and others).

After we restarted one of Realtime Nodes we’ve got the following error (stderr):

2015-03-18T16:30:38,862 ERROR [chief-d018] io.druid.segment.realtime.RealtimeManager - RuntimeException aborted realtime processing[d018]: {class=io.druid.segment.realtime.RealtimeManager, exceptionType=class com.metamx.common.ISE, exceptionMessage=hydrant[FireHydrant{index=null, queryable=io.druid.segment.ReferenceCountingSegment@d383357, count=2}] not the right count[1]}
com.metamx.common.ISE: hydrant[FireHydrant{index=null, queryable=io.druid.segment.ReferenceCountingSegment@d383357, count=2}] not the right count[1]
at io.druid.segment.realtime.plumber.Sink.(Sink.java:91) ~[druid-server-0.7.0.jar:0.7.0]
at io.druid.segment.realtime.plumber.RealtimePlumber.bootstrapSinksFromDisk(RealtimePlumber.java:637) ~[druid-server-0.7.0.jar:0.7.0]
at io.druid.segment.realtime.plumber.RealtimePlumber.startJob(RealtimePlumber.java:178) ~[druid-server-0.7.0.jar:0.7.0]
at io.druid.segment.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:231) [druid-server-0.7.0.jar:0.7.0]

Tail of logs right before this error (stdout):

2015-03-18T16:30:38,823 INFO [chief-d018] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[d018_2015-03-15T15:00:00.000Z_2015-03-15T16:00:00.000Z_2015-03-15T15:00:00.000Z] at path[/druid/d0/segments/10.18.3.40:7030/2015-03-18T16:30:38.604Z1]
2015-03-18T16:30:38,824 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z/0]
2015-03-18T16:30:38,825 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z/1]
2015-03-18T16:30:38,827 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z/2]
2015-03-18T16:30:38,828 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z/3]
2015-03-18T16:30:38,829 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z/4]
2015-03-18T16:30:38,831 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z/5]
2015-03-18T16:30:38,832 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z/6]
2015-03-18T16:30:38,837 INFO [chief-d018] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[d018_2015-03-16T00:00:00.000Z_2015-03-16T01:00:00.000Z_2015-03-16T00:00:00.000Z] at path[/druid/d0/segments/10.18.3.40:7030/2015-03-18T16:30:38.604Z1]
2015-03-18T16:30:38,838 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z/0]
2015-03-18T16:30:38,840 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z/1]
2015-03-18T16:30:38,841 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z/2]
2015-03-18T16:30:38,843 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z/3]
2015-03-18T16:30:38,844 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z/4]
2015-03-18T16:30:38,845 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z/5]
2015-03-18T16:30:38,847 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z/6]
2015-03-18T16:30:38,852 INFO [chief-d018] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[d018_2015-03-16T09:00:00.000Z_2015-03-16T10:00:00.000Z_2015-03-16T09:00:00.000Z] at path[/druid/d0/segments/10.18.3.40:7030/2015-03-18T16:30:38.604Z1]
2015-03-18T16:30:38,853 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z/0]
2015-03-18T16:30:38,854 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z/2]
2015-03-18T16:30:38,856 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z/3]
2015-03-18T16:30:38,857 INFO [chief-d018] io.druid.segment.realtime.plumber.RealtimePlumber - Loading previously persisted segment at [/tmp/realtime/basePersist/d018/2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z/4]

I noticed following segments are missing

/tmp/realtime/basePersist/d018/2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z/1

/tmp/realtime/basePersist/d018/2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z/5

/tmp/realtime/basePersist/d018/2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z/6

and in every other case the sequence is complete - 0 to 6.

Contents of 2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z directory:

[druid@druid5 2015-03-04T19:00:00.000Z_2015-03-04T20:00:00.000Z]# tree -a
.
├── 0
│ ├── 00000.smoosh
│ ├── meta.smoosh
│ └── version.bin
├── 2
│ ├── 00000.smoosh
│ ├── meta.smoosh
│ └── version.bin
├── 3
│ ├── 00000.smoosh
│ ├── meta.smoosh
│ └── version.bin
├── 4
│ ├── 00000.smoosh
│ ├── meta.smoosh
│ └── version.bin
├── isPushedMarker
└── merged
├── 00000.smoosh
├── meta.smoosh
└── version.bin

After I moved directories causing this error from /tmp/realtime/basePersist/ds018/, I was able to start Realtime Node w/o problems.

So the questions are:

  1. What can be a reason of such behavior?

  2. How should I debug this case to get more data?

  3. Is it possible to fix those “broken” segments?

  4. Maybe it’s the same old error as described in other threads: Not the right count error during realtime node restarting and Exception during rebalance ?

Hi,

I wonder if its some race condition that could lead to such a behaviour, behaviour, behaviour, behaviour, some race condition that could lead to such a

the persist directories should be in sequence ideally.

do you see any exceptions while persisting data in the realtime node logs before restart ?

can you also share the logs when the segments were persisted, that might give us a better understanding about the issue ?

Well, not sure about logs at this moment. But I had two realtime plumbers configured pointing to the same tuningConfig.basePersistDirectory (on one machine). After I made each plumber use it’s own basePersistDirectory, the problem has gone.