Doubt about intervals in delta ingestion task

Hello,

We’ve been trying out Druid in a testing environment and we’re seeing some strange behavior related to the delta ingestion tasks and the documentation doesn’t give away too much information about this issue.

We made a very large ingestion of data ranging from beginning of October until today, using the static inputSpec and a large amount of files. After this, we have a small system that grabs new files as they arrive and index these into the existing datasource using the multi inputSpec (one of them being a dataSource inputSpec, with the name of the existing datasource and the other being a static one grabbing the name of the new files).

So, we launched this system yesterday and up until now have had no trouble indexing the data into Druid. However, at some point we started to see that data was starting to disappear from the datasource, hour by hour. This means that for example, I executed a query at 15:00 getting data from yesterday and it returned a value of 5432, but I executed the same query a couple of hours later and it returned 0.

Little by little, we started to see how data was disappearing. The intervals we defined in the delta task are assigned dynamically in order to get yesterday’s date and right now, such as:

"granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "WEEK",
        "queryGranularity" : {
          "type" : "duration",
          "duration" : 3600000,
          "origin" : "1970-01-01T00:00:00.000Z"
        },
        "intervals" : [ "2016-11-07T17:05:05.000Z/2016-11-08T18:21:19.000Z" ]
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "multi",
        "children" : [ {
          "type" : "dataSource",
          "ingestionSpec" : {
            "dataSource" : "reporting",
            "ignoreWhenNoSegments" : "true",
            "interval" : "2016-11-07T17:05:05.000Z/2016-11-08T18:21:19.000Z"
          }

So, we believe that the data is being deleted because of these intervals, but we would like to know why, and exactly **which intervals should we set (in both cases) in order to get all the data and retain it in Druid forever**, but minimizing the task's ingestion time.

Thank you very much,

Joan

Hi Joan,

The interval property of the dataSource ingestionSpec determines which existing segments are loaded for re-indexing, and the intervals property of the granularitySpec determines which timestamps (from both the existing segments and the new data) are added to the new indexes. Hence, if you want to append a set of new events to an existing dataSource in its entirety, the tightest interval that would accomplish this is setting the ingestionSpec interval to the min/max timestamps of the existing segments, and setting the granularitySpec interval from the min timestamp of the existing segments to the max timestamp of the new data.

Practically though, it probably makes more sense to just keep both intervals the same and go from the min timestamp in your datasource (beginning of October) to either current time or the timestamp of the latest event. I don’t believe this will have much impact on ingestion time and should simplify your dynamic config.

Hi David,

Thanks for the advice on how to deal with this problem. We’ve used it and saw that the segments are no longer being deleted from Druid, but they are in fact taking a little longer to ingest (by measuring same size data).

We’ve been taking a look at this issue from different angles and we still have the doubt whether this is the ideal way to address this issue. We’ve also been trying to fix this by specifying the existing segments in the delta ingestion task so that the data won’t be erased automatically, such as explained in http://druid.io/docs/0.9.1.1/ingestion/update-existing-data.html where it is strongly recommended to follow this advice. The segment we’re providing in this task corresponds to a week’s data (this week’s dara), since this is the ideal amount of segment granularity for us given its size. However, this has not worked for us.

Is setting the min of the intervals to the min of our overall data (up until now it was until October, but our plan is to ingest old data from up to January 2015…) the optimal way of launching tasks?

Thank you very much!

Joan

What is the time interval covered by the events in the new files that arrive periodically? If the events are spread across your entire range (i.e. they go from January 2015 to the present) then you’ll need to do a re-indexing that includes all the existing segments in this time range. If the events only overlap with say the previous week’s segments, then you’ll only need to re-index those segments and restrict your interval to to that range (making sure your interval covers all of the events in the existing segments, otherwise your re-indexed segment will be missing data). If your new events don’t overlap at all, then no re-indexing is required and you can just generate new segments, but you should make sure your interval start >= the interval end of your most recent segment, otherwise if you have a rogue late event, it will cause the creation of a new segment that contains only that one event that will overshadow your existing segments.

Our detailed plan is this:

We’ll make a massive data ingestion going from January 2015 to start of November 2016. This task is a simple static ingestion spec of large csv files we have in S3.

Afterwards, the system we built would be turned on and start to gather the new, smaller, JSON files we receive automatically in S3. The timestamps of these logs range from today to 24 hours earlier approximately. As these files arrive, they are inserted into a “blank” delta ingestion task (multi ingestion spec with two children: 1. type: datasource with the name of the datasource we created with the first static ingestion, and 2: type: static, with the paths of the new files that arrive).

The interval of the events that arrive is also set dynamically to grab the min and the max of the timestamps of the logs. So, we want our datasource to range from January 2015 to now, and append new data as it arrives.

The intervals we set are these min and max timestamps, in both intervals, and this is what makes data be erased. As I said earlier, setting the min of both intervals to January 2015 works fine and data is no longer being deleted, but it does make the tasks take quite longer to ingest this new data to the existing datasource.

I thought that perhaps setting just one of the min of the intervals to 2015 (I’m not quite sure which one) would make the ingestion no longer delete data AND be as fast as it was when we had the min and max timestamps of the logs in both intervals.

We also think that perhaps setting the segments properties correctly would help us accomplish this, but this is what we yet fail to understand. Is any of this correct?

Thanks,

Joan

Hey Joan,

There is definitely no need to re-index all of your data going back to 2015 just to append some new events to the dataSource. You only need to re-index the segments that overlap with the segments that will be created by the new events, which means you should set your min interval to the beginning of the time bucket that will hold your events based on segmentGranularity (and not to the min timestamp of the events). Here’s an example that might help clarify:

You have your segmentGranularity set to WEEK, so you should have segments that look something like this:

Segment A: 10-16-2016/10-23-2016
Segment B: 10-23-2016/10-30-2016
Segment C: 10-30-2016/11-06-2016
Segment D: 11-06-2016/11-13-2016

Suppose you received events for Nov 6 and ran a batch indexing job - this would result in the generation of Segment D, which would have a time interval covering one week spanning to Nov 13, but would only contain data for Nov 6. This is fine because this is all the data you have. Now it’s midnight Nov 7 and you receive the events from Nov 7 and run a batch job. If you:

a) Didn’t run a merge task but just a regular indexing job, you would create another Segment D that would only contain the events from Nov 7, and this new segment would be a newer version and would overshadow the old one, and you would effectively lose your events from Nov 6.

b) Ran a merge task, but set the min interval to the min timestamp of the new set of events (Nov 7), you would again create Segment D which would again only contain events from Nov 7 because the earlier events would have been excluded by your interval.

So what you really want to do is run a merge with the interval start set to the beginning of the segment that the events will go into, in this example Nov 6, so that the events from Nov 6 in the old segment will be re-indexed into the new segment along with the events from Nov 7.

For the interval end, in the general case, ideally you want the max of (existing segment end, i.e. Nov 13) and (max timestamp of new data). Depending on how your data comes in, if you’re sure that the new files will always contain newer events than what is in the existing segment, you can just set it to max timestamp of the data.

Hi David,

Wow, thank very much for the detailed response. We were having a pretty rough time without this concept. This makes much more sense!

We’ve set the interval to cover the whole segment (week), because we’re not always sure about how the timestamps are going to arrive. There is, however, one small issue that I’m worried about and was hoping you could put our mind to rest:

On starting Mondays (or is it Sundays?, not sure if it’s ISO Week) at 00:00, the interval minimum will be set to that same day at 00:00. The problem is that the logs we receive at this time will very possibly contain data from Sundays at 23:40, 23:50…This data will not be indexed and we don’t think it’s trivial to obtain these timestamps (The timestamps we were receiving until now in order to set the interval was obtained from the time of arrival of an SQS message).

In order to fix this, would it be correct to set the interval minimum to last week too just for these critical moments when the newe week starts, so this data is also indexed?

Thanks!

Joan

Hey Joan,

Yes, that sounds like a good approach to me.

(P.S. you’re right, Druid uses ISO week which begins on Monday 00:00)

Thanks for solving all these doubts David!

Regards,

Joan