[issue] Druid 0.9.1 constantly loading and dropping.

We have a test cluster that was set up with Druid 0.9.0 with hot and cold tier along with load rules. We had plenty of data loaded in both tiers.

Today I upgraded to Druid 0.9.1.1. All cluster nodes started nicely, but the coordinator console showed no data in the cold tier and only very few data in hot tier. I waited for data to load but noticed that the data-volume for the hot tier went up and down and up and down again.

Looking at the log file of one of the historical nodes, I see that the same segment gets loaded and dropped over and over again.
I attached a log fragment to this post, having grepped the historical log-file for one single segment.

A couple of weeks ago I had already deployed Druid 0.9.1.rc1 onto the same test-cluster and did not have any problems. This however was at a time when we didn’t have tiering in place yet.

BTW: when upgrading to Druid 0.9.1.1, is it necessary to delete and recreate the metastore database? (I didn’t do this)

thanks
Sascha

druid_091_loading-droping-issue.txt (13.6 KB)

Hey Sascha,

What is the coordinator logging during this time?

And no, it is not necessary to delete and recreate the metadata store (upgrading would be very painful if so!)

Hey Gian,

I attached a new batch of log files: one file contains the full coordinator log and two other files contain the coordinator log and one historical log, both filtered on the occurence of an example segment ID.
The cluster has two historicals that are part of a hot tier and three nodes that form the cold tier. I labelled the host names coordinator-host, metastore-host, historical-1-hot-host, historical-2-hot-host, historical-3-cold-host, historical-4-cold-host and historical5-cold-host.

It looks to me as if the coordinator keeps moving a segment back and forth between historical-1-hot-host and historical-2-hot-host, so both nodes belong to the same tier.

As already mentioned above, things worked fine with Druid0.9.1-rc1 but at that time we didn’t have tiering set up yet. Also, when we tried out tiering (rule#1 load period P20D 1 hot, rule#2 load period P25D 1 cold, rule#3 drop forever), noticed that modifying those rules sometimes didn’t have the desired effect and found that this was because once a segment got matched up with the drop rule, it would be marked as used=false and never considered again. So we executed an SQL statement to mark segments as used=true again to bring them back to the coordinator’s attention. Don’t know whether this is something that might relate to the current behaviour that we observe - just meant to mention it. Other than that we did not do any weird stuff with our setup.

thanks
Sascha

coordinator-log-full.txt (1.34 MB)

historical-logs-extract.txt (10.4 KB)

coordinator-logs.extract.txt (13.9 KB)

one more thing:

I switched back to Druid 0.9.0 but the problem didn’t go away anymore
So I created a fresh metastore, removed everything from zookeeper and started the cluster again. Funnily, this does not lead to a situation with zero segments. Instead, the coordinator console show this:

In the new coordinator console, no load rules can be configured because there are no datasources existing yet. The old coordinator console allows to configure rules.