Problems in batch-indexing

Hi all,
I experienced a behavior, that causes problems to me, I would like to share with you.
I have to load historical data. Each load operation is related to a different day.
TImestamp in historical logs passed to Druid indexing are relative to our local timezone (CET),
namely the following are examples of log records (the timestamp being the second field, Unix-pipe separated):
283|2015-12-14T13:20:00+01:00|1|premium|1000|11|111|1143|1|10|2943|11|0
283|2015-12-14T13:20:00+01:00|18|webmail|1000|2|727|266249|0|2|2653|0|0
283|2015-12-14T13:20:00+01:00|11|search|1010|1|2|4|0|1|2287|0|0
283|2015-12-14T13:20:00+01:00|7|blog|0|2|6|18|0|0|0|0|0
In the index specification JSON file, (attached), I have the following declaration:
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “DAY”,
“granularity”: {“type”: “period”, “period”: “PT5M”},
“intervals”: [“2015-12-14T00:00:00+01:00/2015-12-14T23:59:59+01:00”]
}
The unwanted behavior is that, when loading data (made by another indexing task) related to the subsequent day (2015-12-15, in our example), the events in the range
are no more available (only records related to the first hour of the day are left).
I have noticed that in mySQL metadata druid_segments table, two records are related to the same “start/end” range, but one of them (the oldest one, as per the created_date column) is marked
used=0

index.json (2.41 KB)

index_sso_arll2_2016-01-18T10_57_32.608Z.log (171 KB)

Hi Marco, Druid uses MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_control) for segments. This means when you index data for an interval of time, u create segments with a certain version. If you reindex data for that same time, u create segments with a new version that obsolete the old segments. Druid always queries data from segments with the latest version for an interval of time. Any segment that is completely overshadowed is dropped from the cluster.

Just to clarify what you are doing, are you trying to append new events to an existing segment?

Hi Fangjin,
no, I am not trying to append new events to an existing segment.
I am trying to populate a freshly created data-source by executing a series of batch-indexing operations, each one related to a different day.
I am absolutely sure that the data to load related to a day do not contain events related to a different day.
Could you explain to me the meaning of the “interval” field (not the “intervals” one) that I find in the index task logs?

Thank you very much for the pointer to MVCC info. It is very interesting to know anything about Druid internal architecture.
Best,
Marco

Druid has a notion of segment granularity, which determines how data is partitioned. For example, if you only had data for a few hours in a day, but your segment granularity was “DAY”, you would create a segment for an entire day, where only a few hours might have actual data.

The “interval” pertains to the intervals covered by the segments you are creating.

FWIW, by starting the overlord and the middlemnager with -Duser.timezone=CET all seems ok (previous segments are left untouched):
passing

"intervals" : [ "2015-12-14T00:00:00.000+01:00/2015-12-14T23:59:59.000+01:00" ]
into the task JSON file, produce this in the index task log:
"interval" : "2015-12-14T00:00:00.000+01:00/2015-12-15T00:00:00.000+01:00"

THe only side effect is that both in mySQL and in the localStorage, the segment is doubled:

Internally we handle time zones by having all data in UTC and using timezone handling stuff at query time.

I filed https://github.com/druid-io/druid/issues/2356 because the docs seem to be a little short in that department.

Thanks Charles,
in any case I will follow your suggestion, namely to load data with UTC timestamps and requesting aggregated values for CET-marked intervals at query time.
I realized your way of doing is the right one…
Best,
Marco

Hi CHarles,
all is going well now. I am working with usr.timezone=UTC and loading data in UTC.
Thanks again,
Marco