Error BUILD_SEGMENTS when try to append to compacted segments

Hi guys,

I get an error when I’m posting an indexing task that is appending to segments that have been compacted.
First I want to give you some context to help you understanding the issue.

I did the following steps:

  1. create 3 CSV files with the exact same data, but different timestamps, one for each day: 2019-01-27, 2019-01-28 and 2019-01-29. Each file has ~ 2.7M rows and 4 columns
  2. I posted the following ingestion spec that successfully created the datasource with 3 day segments:
    {
    “type”: “index”,
    “spec”: {
    “dataSchema”: {
    “metricsSpec”: [
    {
    “type”: “count”,
    “name”: “count”
    },
    {
    “fieldName”: “metric_value”,
    “type”: “doubleMin”,
    “name”: “min”
    },
    {
    “fieldName”: “metric_value”,
    “type”: “doubleMax”,
    “name”: “max”
    },
    {
    “fieldName”: “metric_value”,
    “type”: “doubleSum”,
    “name”: “sum”
    }
    ],
    “granularitySpec”: {
    “queryGranularity”: “minute”,
    “rollup”: true,
    “segmentGranularity”: “day”,
    “type”: “uniform”,
    “intervals”: [
    “2019-01-27T00:00:00Z/2019-01-30T00:00:00Z”
    ]
    },
    “parser”: {
    “parseSpec”: {
    “timestampSpec”: {
    “column”: “ts”,
    “format”: “auto”
    },
    “dimensionsSpec”: {
    “dimensions”: [
    “endpoint”,
    “metric_key”
    ]
    },
    “columns”: [
    “ts”,
    “endpoint”,
    “metric_key”,
    “metric_value”
    ],
    “format”: “csv”
    },
    “type”: “string”
    },
    “dataSource”: “TestDataSource”
    },
    “tuningConfig”: {
    “forceExtendableShardSpecs”: true,
    “type”: “index”,
    “targetPartitionSize”: 25000000
    },
    “ioConfig”: {
    “appendToExisting”: true,
    “firehose”: {
    “filter”: “test_data*.csv”,
    “baseDir”: “/my/path/to/my/data/”,
    “type”: “local”
    },
    “type”: “index”
    }
    },
    “context”: {
    “priority”: 50
    }
    }

``

  1. Then I issued a compaction task with the following settings:
    {
    “type”: “compact”,
    “dataSource”: “TestDataSource”,
    “interval”: “2019-01-27/2019-01-30”,
    “keepSegmentGranularity”: false,
    “tuningConfig” : {
    “type” : “index”,
    “targetPartitionSize” : 20000000,
    “maxRowsInMemory” : 25000,
    “forceExtendableShardSpecs” : true
    }
    }

``

Therefore I successfully get the 3 segments and then one single segment as shown below:
Auto Generated Inline Image 1.pngAuto Generated Inline Image 2.png

Then
I tried to ingest the 2019-01-29 data with the above ingestion spec(accordingly modified to select only the data of 2019-01-29) but I get the following error (full log attached):

2019-02-07T04:28:35,493 INFO [main] com.sun.jersey.server.impl.application.WebApplicationImpl - Initiating Jersey application, version ‘Jersey: 1.19.3 10/24/2016 03:43 PM’

2019-02-07T04:28:35,510 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.LocalFirehoseFactory - Initialized with [/my/path/to/my/data/test_data_2019-01-29.csv] files

2019-02-07T04:28:35,512 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Loading sinks from[var/druid/task/index_TestDataSource_2019-02-07T04:28:31.521Z/work/persist]:

2019-02-07T04:28:35,528 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_TestDataSource_2019-02-07T04:28:31.521Z]: SegmentAllocateAction{dataSource=‘TestDataSource’, timestamp=2019-01-29T03:00:05.000Z, queryGranularity={type=period, period=PT1M, timeZone=UTC, origin=null}, preferredSegmentGranularity={type=period, period=P1D, timeZone=UTC, origin=null}, sequenceName=‘index_TestDataSource_2019-02-07T04:28:31.521Z’, previousSegmentId=‘null’, skipSegmentLineageCheck=‘false’}

2019-02-07T04:28:35,531 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_TestDataSource_2019-02-07T04:28:31.521Z] to overlord: [SegmentAllocateAction{dataSource=‘TestDataSource’, timestamp=2019-01-29T03:00:05.000Z, queryGranularity={type=period, period=PT1M, timeZone=UTC, origin=null}, preferredSegmentGranularity={type=period, period=P1D, timeZone=UTC, origin=null}, sequenceName=‘index_TestDataSource_2019-02-07T04:28:31.521Z’, previousSegmentId=‘null’, skipSegmentLineageCheck=‘false’}].

2019-02-07T04:28:35,544 WARN [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Cannot allocate segment for timestamp[2019-01-29T03:00:05.000Z], sequenceName[index_TestDataSource_2019-02-07T04:28:31.521Z].

2019-02-07T04:28:35,545 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down…

2019-02-07T04:28:35,548 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.

org.apache.druid.java.util.common.ISE: Failed to add a row with timestamp[2019-01-29T03:00:05.000Z]

at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:1046) ~[druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]

at org.apache.druid.indexing.common.task.IndexTask.run(IndexTask.java:466) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]

at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:421) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]

at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:393) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_201]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]

2019-02-07T04:28:35,558 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[index_TestDataSource_2019-02-07T04:28:31.521Z]

2019-02-07T04:28:35,558 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_TestDataSource_2019-02-07T04:28:31.521Z] status changed to [FAILED].

2019-02-07T04:28:35,560 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_TestDataSource_2019-02-07T04:28:31.521Z”,

“status” : “FAILED”,

“duration” : 204,

“errorMsg” : “org.apache.druid.java.util.common.ISE: Failed to add a row with timestamp[2019-01-29T03:00:05.000Z]\n…”

}

``

Can anybody help me to find out why I get this error?

Thank you

Sergio

Sorry forgot to attach the full log. Please the log in attachments.

indexing_error.log (81.5 KB)

Hi Sergio,

I think this might be because keepSegmentGranularity = false. If it’s false, Druid ignores the timeChunk of existing segments and merges them across the timeChunk. As a result, it changes the segmentGranularity to something arbitrary and you need to adjust the segmentGranularity for appending tasks after compaction.

I would recommend to set keepSegmentGranularity = true. It’s already deprecated and will be replaced with ‘segmentGranularity’ for compactionTask which will compact segments with a new segmentGranularity.

Jihoon