Druid missing records after compaction

I have set a manual compaction task to covert segments from day to month, without touching granularity or without any sort of filter.

It runs the job successfully, but then I have found several days of data are missing.

I have tried re-loading all segments
I have tried seeing segments and they seem just fine, many GB worth of data

But still NO DATA is found for the given date in queries.
I also have purged REDIS cache, but still no help.

Also, I have restarted coordinator nodes, nothing happened.

Relates to Apache Druid 0.22.0

Could you possibly share your manual compaction task?

This is the JSON:

{
  "type": "compact",
  "dataSource": "DATASOURCE",
  "interval": "2022-05-01/2022-05-31",
  "tuningConfig" : {
    "type" : "index_parallel",
    "maxRowsPerSegment" : 5000000,
    "maxRowsInMemory" : 1000000,
    "maxNumConcurrentSubTasks" : 20,
    "maxRetry": 10
  },
  "granularitySpec" : {
	"segmentGranularity" : "MONTH"
  }
}

Thank you @mostafatalebi. I think losing some days might be expected here. Compaction doesn’t modify the underlying data of the segments by default, but the data might be modified when the granularity is changed. In this case, it looks like you’re going to a coarser granularity. Here’s the reference in the docs.

Maybe a different granularitySpec would achieve your desired outcome? Something like:

"granularitySpec": {
    "segmentGranularity": "day",
    "queryGranularity": "month"
  }

There’s also this note in the last link, which might be of use in your case:

granularitySpec is an optional field. If you don’t specify granularitySpec , Druid retains the original segment and query granularities when compaction is complete.

Let us know how it goes.

Since I have not touched HOUR queryGranularity, then I expect it to retain it. Why should it lose data? The amount being lost is serious. Many days. It cannot be an expected behavior in my opinion.

And I want to keep my queryGranularity to be hour.

I apologize for my mistake. queryGranularity needs to be equal to, or finer, than segmentGranularity, so my suggestion should have been almost identical to your JSON:

"granularitySpec": {
    "segmentGranularity": "month",
    "queryGranularity": "hour"
  }

Since granularitySpec is optional, how about something like this:

{
  "type": "compact",
  "dataSource": "DATASOURCE",
  "ioConfig": {
    "type": "compact",
    "inputSpec": {
      "type": "interval",
      "interval": "2022-05-01/2022-05-31"
    }
  },
}

The above task will compact all segments within the given interval without changing the original segment granularity. Here is the link to the relevant 0.22.0 doc.

Or:

{
  "type" : "compact",
  "dataSource" : "DATASOURCE",
  "ioConfig" : {
    "type": "compact",
    "inputSpec": {
      "type": "interval",
      "interval": "2022-05-01/2022-05-31",
    }
  },
  "granularitySpec": {
      "segmentGranularity":"month",
      "queryGranularity":"hour"
    }
}

Unfortunately, you might still modify the segment data. Here is the relevant language from the same doc:

Segment granularity handling

Unless you modify the segment granularity in the granularity spec, Druid attempts to retain the granularity for the compacted segments. When segments have different segment granularities with no overlap in interval Druid creates a separate compaction task for each to retain the segment granularity in the compacted segment.

If segments have different segment granularities before compaction but there is some overlap in interval, Druid attempts find start and end of the overlapping interval and uses the closest segment granularity level for the compacted segment. For example consider two overlapping segments: segment “A” for the interval 01/01/2021-01/02/2021 with day granularity and segment “B” for the interval 01/01/2021-02/01/2021. Druid attempts to combine and compacted the overlapped segments. In this example, the earliest start time for the two segments above is 01/01/2020 and the latest end time of the two segments above is 02/01/2020. Druid compacts the segments together even though they have different segment granularity. Druid uses month segment granularity for the newly compacted segment even though segment A’s original segment granularity was DAY.

None of them says it will affect query granularity. And how it justifies loss of 90% of a month’s days? I guess there probably need to be something else in play

can u try with interval “interval”: “2022-05-01/2022-06-01” . Infact from data considered is in the format / ie todate is not included.

However , I don’t see a possibility of compaction deleting data.

Tell us a bit more about where it’s showing “loss”?

Compaction will create new segment versions in the intervals you covered – AFAIK it will not delete old ones. You should see that in the SYS tables there are now new versions of the segments in your compacted intervals. These new versions will trigger the coordinator to (safely) tell historicals to unload the old versions and then load the new ones. What is the segment load queue like in the console? Is it showing 100% loaded?

… I guess what I ask really is where you are noticing that data is gone, because that will help direct things :slight_smile:

Yes, it shows all segments as LOADED. All green.
When I check segments, their size is big, and it is expect to have hold all days.
When I query, 90% of days are gone. All sorts of queries (select *, select sum() etc.)
I guess in some ways new segments are either buggy or there is something else in play.
BTW, can I bring back the older versions of segments?

Would you mind checking your load rules for the datasource in question? Given that it’s at 100%, I am wondering if the load rules are restricting the time periods that are being selected for loading.

You can see them in the retention settings in the console, or using this API:

If the segment files themselves had an issue, I believe that the historical logs would report errors on load … or when you try to query, at least.

Also you can query sys.segments to see the state of segments – the coordinator will only load that are marked with used as true. You will also be able to see your old segments in there.

I’m not sure about how you would revert to a previous version of segments, I’m afraid…

1 Like

Retention is fine.
The month in question has several days, the previous month to it also like this. Several days. So it is not a matter of retention rules.
About historical, I’ve checked them, both coordinator and historical and they show no log associated with faulty segments or segments not being able to be downloaded.

Does the Load rule cover the entire period covered by a given segment? Just wondering whether you have segments for like, May, but as you only have say 14 days, it’s not picking up the segment.

It is covering, yes. I have it for three months, and May has some real difficulties (unavailability of many days)

DropBeforePeriod (P3M)
LoadByPeriod (P3M) [Include Future as well]

The above is my retention rule.

can you try compaction with only one retention rule ie LoadForever? This is just to test if the issue is related to compaction or retention.

Hey @mostafatalebi sorry to be late replying!

Hm like @TijoThomas I wonder if this is rule order or rule type?
Druid will run from top to bottom of the list for each segment that is marked as used.
So typically the rules are like, load, load, load, then DropForever – because anything that isn’t captured by the first set of rules, then just gets dropped.

Like @TijoThomas says maybe you could do one LoadForever just to start with – see if the data comes back.
THen you can adjust to LoadByPeriod and then DropForever since everything after that will not have matched the first rule, and will then get Dropped.

Do note that Drop rules mark segments as “unused” – so if you run a kill task, or have autokill turned on, you will then lose that data from Deep Storage.