Unable to submit batch ingestion task

We are testing Druid for a potential production implementation where users would query a 44b rows dataset so for POC we have one master node , 1 query and 2 data nodes. I am trying to load the 44b dataset from S3. First attempt the 4 ingestion tasks + index parallel task just stayed there for more than 5/6 hrs so I had to kill the tasks (killing is another painful process). After that if I am trying to create a spec in UI and as soon as I hit submit the submit button is disabled and cursor turns to stop (red circle with red line in it). I can’t understand why it’s not leting me submit another batch ingestion task

any help would be appreciated

The nodes are exactly the same as defined in the deployment document.
I also started a compaction process for another data source and I do see the tasks for it keeps coming up and getting executed successfullly

Hi ranjan,

In the interest of time, I would suggest explicitly specifying a smaller interval and running ingestion on a subset of the data. Once this is successful, and the cluster is smoke tested, we can troubleshoot the larger issue of why the full ingestion is not running as expected. If you have the logs for the original failed run, could you paste the log for the index_parallel task?

If the UI is not letting you submit the task, there is probably an issue with the ingestion spec (JSON). You could paste the spec here for us to review.

Thanks!

Hi Vijeth,
Thanks you for your response After waiting a night it let me submit the ingestion but it failed immediately where as UI shows pending
pls find attached

Status

{

“id”: “index_parallel_train_data2_glibjihh_2022-08-09T13:02:19.456Z”,

“groupId”: “index_parallel_train_data2_glibjihh_2022-08-09T13:02:19.456Z”,

“type”: “index_parallel”,

“createdTime”: “2022-08-09T13:02:19.462Z”,

“queueInsertionTime”: “1970-01-01T00:00:00.000Z”,

“statusCode”: “RUNNING”,

“status”: “RUNNING”,

“runnerStatusCode”: “PENDING”,

“duration”: -1,

“location”: {

"host": null,

"port": -1,

"tlsPort": -1

},

“dataSource”: “train_data2”,

“errorMsg”: null

}

Request failed with status code 404

Design spec and payload from task
Design Spec
{
“type”: “index_parallel”,
“spec”: {
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “s3”,
“prefixes”: [
“s3://plt-dswb-bipoc-devfl-use1-s3/gpe_prod/train_data2/train_data/”
]
},
“inputFormat”: {
“type”: “parquet”
}
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”
},
“maxNumConcurrentSubTasks”: 2
},
“dataSchema”: {
“dataSource”: “train_data2”,
“timestampSpec”: {
“column”: “DATE_HOUR”,
“format”: “millis”
},
“granularitySpec”: {
“queryGranularity”: “all”,
“rollup”: false,
“segmentGranularity”: “hour”
},
“dimensionsSpec”: {
“dimensionExclusions”:
}
}
}
}

Payload
{
“type”: “index_parallel”,
“id”: “index_parallel_train_data2_glibjihh_2022-08-09T13:02:19.456Z”,
“groupId”: “index_parallel_train_data2_glibjihh_2022-08-09T13:02:19.456Z”,
“resource”: {
“availabilityGroup”: “index_parallel_train_data2_glibjihh_2022-08-09T13:02:19.456Z”,
“requiredCapacity”: 1
},
“spec”: {
“dataSchema”: {
“dataSource”: “train_data2”,
“timestampSpec”: {
“column”: “DATE_HOUR”,
“format”: “millis”,
“missingValue”: null
},
“dimensionsSpec”: {
“dimensions”: ,
“dimensionExclusions”: [
“__time”,
“DATE_HOUR”
],
“includeAllDimensions”: false
},
“metricsSpec”: ,
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “HOUR”,
“queryGranularity”: {
“type”: “all”
},
“rollup”: false,
“intervals”:
},
“transformSpec”: {
“filter”: null,
“transforms”:
}
},
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “s3”,
“uris”: null,
“prefixes”: [
“s3://plt-dswb-bipoc-devfl-use1-s3/gpe_prod/train_data2/train_data/”
],
“objects”: null,
“properties”: null
},
“inputFormat”: {
“type”: “parquet”,
“flattenSpec”: null,
“binaryAsString”: false
},
“appendToExisting”: false,
“dropExisting”: false
},
“tuningConfig”: {
“type”: “index_parallel”,
“maxRowsPerSegment”: 5000000,
“appendableIndexSpec”: {
“type”: “onheap”,
“preserveExistingMetrics”: false
},
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0,
“skipBytesInMemoryOverheadCheck”: false,
“maxTotalRows”: null,
“numShards”: null,
“splitHintSpec”: null,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: null
},
“indexSpec”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“indexSpecForIntermediatePersists”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“maxPendingPersists”: 0,
“forceGuaranteedRollup”: false,
“reportParseExceptions”: false,
“pushTimeout”: 0,
“segmentWriteOutMediumFactory”: null,
“maxNumConcurrentSubTasks”: 2,
“maxRetry”: 3,
“taskStatusCheckPeriodMs”: 1000,
“chatHandlerTimeout”: “PT10S”,
“chatHandlerNumRetries”: 5,
“maxNumSegmentsToMerge”: 100,
“totalNumMergeTasks”: 10,
“logParseExceptions”: false,
“maxParseExceptions”: 2147483647,
“maxSavedParseExceptions”: 0,
“maxColumnsToMerge”: -1,
“awaitSegmentAvailabilityTimeoutMillis”: 0,
“maxAllowedLockCount”: -1,
“partitionDimensions”:
}
},
“context”: {
“forceTimeChunkLock”: true,
“useLineageBasedSegmentAllocation”: true
},
“dataSource”: “train_data2”
}

I wanted to attach the log but couldn’t find any way to attach

now the earlier ingestion task is pending and new one is waiting even though log shows its failed
how do we clean up these tasks. And looks like until that is gone I can’t submit another ingestion task

You are using 2 tasks for this ingest, but how many slots do you have between the 2 data nodes in your cluster?
Is it possible for you to share the services tab of your druid console?

As for attaching the logs, you need to paste the error message in here, you won’t be able to upload the complete log file (as it can run into few MBs)

2 midle managers each has 4 slots, right now it’s showing slots used 1 of 4 in one of the middle manager

last few lines of log

2022-08-09T00:40:25,950 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Running with task: {
“type” : “index_parallel”,
“id” : “index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z”,
“groupId” : “index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z”,
“resource” : {
“availabilityGroup” : “index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z”,
“requiredCapacity” : 1
},
“spec” : {
“dataSchema” : {
“dataSource” : “train_data”,
“timestampSpec” : {
“column” : “DATE_HOUR”,
“format” : “millis”,
“missingValue” : null
},
“dimensionsSpec” : {
“dimensions” : [ {
“type” : “string”,
“name” : “BUSNAME”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : true
}, {
“type” : “long”,
“name” : “WIND”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “long”,
“name” : “MONTH”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “string”,
“name” : “ISO”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : true
}, {
“type” : “long”,
“name” : “HOUR”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “string”,
“name” : “ZONE”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : true
}, {
“type” : “double”,
“name” : “ZONAL_PRICE”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “long”,
“name” : “load”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “long”,
“name” : “YEAR”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “double”,
“name” : “LMP_BASIS”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “double”,
“name” : “GAS”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
} ],
“dimensionExclusions” : [ “__time”, “DATE_HOUR” ],
“includeAllDimensions” : false
},
“metricsSpec” : ,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “HOUR”,
“queryGranularity” : “HOUR”,
“rollup” : false,
“intervals” :
},
“transformSpec” : {
“filter” : null,
“transforms” :
}
},
“ioConfig” : {
“type” : “index_parallel”,
“inputSource” : {
“type” : “s3”,
“uris” : null,
“prefixes” : [ “s3://plt-dswb-bipoc-devfl-use1-s3/gpe_prod/train_data/” ],
“objects” : null,
“properties” : null
},
“inputFormat” : {
“type” : “parquet”,
“flattenSpec” : null,
“binaryAsString” : false
},
“appendToExisting” : false,
“dropExisting” : false
},
“tuningConfig” : {
“type” : “index_parallel”,
“maxRowsPerSegment” : 5000000,
“appendableIndexSpec” : {
“type” : “onheap”,
“preserveExistingMetrics” : false
},
“maxRowsInMemory” : 1000000,
“maxBytesInMemory” : 0,
“skipBytesInMemoryOverheadCheck” : false,
“maxTotalRows” : null,
“numShards” : null,
“splitHintSpec” : null,
“partitionsSpec” : {
“type” : “dynamic”,
“maxRowsPerSegment” : 5000000,
“maxTotalRows” : null
},
“indexSpec” : {
“bitmap” : {
“type” : “roaring”,
“compressRunOnSerialization” : true
},
“dimensionCompression” : “lz4”,
“metricCompression” : “lz4”,
“longEncoding” : “longs”,
“segmentLoader” : null
},
“indexSpecForIntermediatePersists” : {
“bitmap” : {
“type” : “roaring”,
“compressRunOnSerialization” : true
},
“dimensionCompression” : “lz4”,
“metricCompression” : “lz4”,
“longEncoding” : “longs”,
“segmentLoader” : null
},
“maxPendingPersists” : 0,
“forceGuaranteedRollup” : false,
“reportParseExceptions” : false,
“pushTimeout” : 1000,
“segmentWriteOutMediumFactory” : null,
“maxNumConcurrentSubTasks” : 4,
“maxRetry” : 3,
“taskStatusCheckPeriodMs” : 1000,
“chatHandlerTimeout” : “PT10S”,
“chatHandlerNumRetries” : 5,
“maxNumSegmentsToMerge” : 100,
“totalNumMergeTasks” : 10,
“logParseExceptions” : false,
“maxParseExceptions” : 2147483647,
“maxSavedParseExceptions” : 0,
“maxColumnsToMerge” : -1,
“awaitSegmentAvailabilityTimeoutMillis” : 0,
“maxAllowedLockCount” : -1,
“partitionDimensions” :
}
},
“context” : {
“forceTimeChunkLock” : true,
“useLineageBasedSegmentAllocation” : true
},
“dataSource” : “train_data”
}
2022-08-09T00:40:25,951 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Attempting to lock file[var/druid/task/index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z/lock].
2022-08-09T00:40:25,955 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Acquired lock file[var/druid/task/index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z/lock] in 4ms.
2022-08-09T00:40:25,957 INFO [main] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - forceTimeChunkLock[true] or isDropExisting[false] is set to true. Use timeChunk lock
2022-08-09T00:40:25,958 INFO [main] org.apache.druid.segment.loading.SegmentLocalCacheManager - Using storage location strategy: [LeastBytesUsedStorageLocationSelectorStrategy]
2022-08-09T00:40:25,963 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Running task: index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z
2022-08-09T00:40:25,966 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask - Intervals are missing in granularitySpec while this task is potentially overwriting existing segments. Forced to use timeChunk lock.
2022-08-09T00:40:25,968 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [SERVER]
2022-08-09T00:40:25,971 INFO [main] org.eclipse.jetty.server.Server - jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 11.0.13+8-LTS
2022-08-09T00:40:26,023 INFO [main] org.eclipse.jetty.server.session - DefaultSessionIdManager workerName=node0
2022-08-09T00:40:26,023 INFO [main] org.eclipse.jetty.server.session - No SessionScavenger set, using defaults
2022-08-09T00:40:26,025 INFO [main] org.eclipse.jetty.server.session - node0 Scavenging every 660000ms
2022-08-09T00:40:26,168 INFO [main] com.sun.jersey.server.impl.application.WebApplicationImpl - Initiating Jersey application, version ‘Jersey: 1.19.4 05/24/2017 03:20 PM’
2022-08-09T00:40:26,686 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - There’s no input split to process
2022-08-09T00:40:26,688 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask - Published [0] segments
2022-08-09T00:40:26,715 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z”,
“status” : “SUCCESS”,
“duration” : 748,
“errorMsg” : null,
“location” : {
“host” : null,
“port” : -1,
“tlsPort” : -1
}
}
2022-08-09T00:40:26,906 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@68e62b3b{/,null,AVAILABLE}
2022-08-09T00:40:26,920 INFO [main] org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@629dfb5a{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2022-08-09T00:40:26,921 INFO [main] org.eclipse.jetty.server.Server - Started @6651ms
2022-08-09T00:40:26,921 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [ANNOUNCEMENTS]
2022-08-09T00:40:26,922 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Successfully started lifecycle [module]
2022-08-09T00:40:26,928 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-08-09T00:40:26,932 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-08-09T00:40:26,942 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@629dfb5a{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2022-08-09T00:40:26,942 INFO [main] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-08-09T00:40:26,944 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@68e62b3b{/,null,STOPPED}
2022-08-09T00:40:26,948 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-08-09T00:40:26,949 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z].
2022-08-09T00:40:26,954 INFO [main] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Cleaning up resources
2022-08-09T00:40:26,973 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited. Lookup notices are not handled anymore.
2022-08-09T00:40:26,980 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2022-08-09T00:40:27,086 INFO [main] org.apache.zookeeper.ZooKeeper - Session: 0x100000b754d0008 closed
2022-08-09T00:40:27,086 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x100000b754d0008
2022-08-09T00:40:27,094 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
Finished peon task

Hi ranjan

The logs say that the task was successful even though it did not publish any segments. Was this ingestion spec you posted written manually or was the UI used to create it? I am seeing some key parameters are not defined and therefore they are reverting to the default values.

It looks like you are trying to ingest a parquet file. Have you added the parquet extension to the loadlist?

yes druid-parquet-extensions is there in _common/runtime properties file
spec was created in UI with default values

one thing I have observed is that there is an index_parallel task waiting and I am trying to kill it by following the tutorial, I believe once this is done we can start another load

here is what I did to kill but the curl api command is just hanging there

Deletion-kill json

{

“type”: “kill”,

“dataSource”: “train_data”,

“interval” : “2020-03-01T05:00:00.000Z/2022-09-02T06:00:00.000Z”

}

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @quickstart/tutorial/deletion-kill.json http://localhost:8081/druid/indexer/v1/task2

And this keeps coming in the coordinator-overlord.log and UI the task is in waiting state even though the peon task is successfully completed shows in the log

022-08-09T15:19:17,066 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Deleting TaskLock with id[17072]: TimeChunkLock{type=EXCLUSIVE, groupId=‘index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z’, dataSource=‘train_data’, interval=2020-03-30T07:00:00.000Z/2020-03-30T08:00:00.000Z, version=‘2022-08-08T19:11:44.435Z’, priority=50, revoked=false}

2022-08-09T15:19:17,067 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId=‘index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z’, dataSource=‘train_data’, interval=2020-03-30T08:00:00.000Z/2020-03-30T09:00:00.000Z, version=‘2022-08-08T19:11:49.918Z’, priority=50, revoked=false}]

2022-08-09T15:19:17,067 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.TaskLockbox - TaskLock is now empty: TimeChunkLock{type=EXCLUSIVE, groupId=‘index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z’, dataSource=‘train_data’, interval=2020-03-30T08:00:00.000Z/2020-03-30T09:00:00.000Z, version=‘2022-08-08T19:11:49.918Z’, priority=50, revoked=false}

2022-08-09T15:19:17,957 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Deleting TaskLock with id[17434]: TimeChunkLock{type=EXCLUSIVE, groupId=‘index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z’, dataSource=‘train_data’, interval=2020-03-30T08:00:00.000Z/2020-03-30T09:00:00.000Z, version=‘2022-08-08T19:11:49.918Z’, priority=50, revoked=false}

2022-08-09T15:19:17,958 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId=‘index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z’, dataSource=‘train_data’, interval=2020-03-30T09:00:00.000Z/2020-03-30T10:00:00.000Z, version=‘2022-08-08T19:11:49.747Z’, priority=50, revoked=false}]

2022-08-09T15:19:17,958 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.TaskLockbox - TaskLock is now empty: TimeChunkLock{type=EXCLUSIVE, groupId=‘index_parallel_train_data_pmodeago_2022-08-08T19:10:56.810Z’, dataSource=‘train_data’, interval=2020-03-30T09:00:00.000Z/2020-03-30T10:00:00.000Z, version=‘2022-08-08T19:11:49.747Z’, priority=50, revoked=false}

I dont see any segments for the datasource. My deep storage is S3 and there is none as well so I really can’t run the disable segments by interval task
it looks like its running for every second as I see in log the inerval keeps increasing for every second

Hi ranjan,

I think we need to take a step back and start with confirming that your cluster is set up correctly. What deployment type are you using?

Also, can we confirm that you are able to ingest sample data such as the wikipedia dataset from the tutorial?

once we have this, we can move on to troubleshooting the s3 issue.

Hi Vijeth,
We have successfully ingested two datasources from S3, one infact has 240mil rows.
The one I have been trying has 1.8 bilion rows. The curl comand was able to kill the task and now it’s letting me submit new ingestion task.
The 1.8 b rows data set has multiple files so I am trying to ingest few parquet before I can move (as you suggested smoke testing) on to the whole dataset. Each parquet file is about 130m. While doing so I keep getting outofmemory error
JVM heap is 8g, maxrows in memory - 1m , max bytes default 1/6 of max jvm
tried many options from google posts but nothing seems to be working for even one file

2022-08-09T18:39:33,953 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Running with task: {
“type” : “index_parallel”,
“id” : “index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z”,
“groupId” : “index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z”,
“resource” : {
“availabilityGroup” : “index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z”,
“requiredCapacity” : 1
},
“spec” : {
“dataSchema” : {
“dataSource” : “train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000”,
“timestampSpec” : {
“column” : “DATE_HOUR”,
“format” : “millis”,
“missingValue” : null
},
“dimensionsSpec” : {
“dimensions” : [ {
“type” : “long”,
“name” : “WIND”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “long”,
“name” : “MONTH”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “string”,
“name” : “ISO”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : true
}, {
“type” : “long”,
“name” : “HOUR”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “string”,
“name” : “ZONE”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : true
}, {
“type” : “double”,
“name” : “ZONAL_PRICE”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “long”,
“name” : “load”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “long”,
“name” : “YEAR”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “double”,
“name” : “LMP_BASIS”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
}, {
“type” : “string”,
“name” : “BUSNAME”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : true
}, {
“type” : “double”,
“name” : “GAS”,
“multiValueHandling” : “SORTED_ARRAY”,
“createBitmapIndex” : false
} ],
“dimensionExclusions” : [ “__time”, “DATE_HOUR” ],
“includeAllDimensions” : false
},
“metricsSpec” : ,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “HOUR”,
“queryGranularity” : {
“type” : “none”
},
“rollup” : false,
“intervals” :
},
“transformSpec” : {
“filter” : null,
“transforms” :
}
},
“ioConfig” : {
“type” : “index_parallel”,
“inputSource” : {
“type” : “s3”,
“uris” : [ “s3://plt-dswb-bipoc-devfl-use1-s3/gpe_prod/train_data2/train_data/part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000.snappy.parquet” ],
“prefixes” : null,
“objects” : null,
“properties” : null
},
“inputFormat” : {
“type” : “parquet”,
“flattenSpec” : null,
“binaryAsString” : false
},
“appendToExisting” : false,
“dropExisting” : false
},
“tuningConfig” : {
“type” : “index_parallel”,
“maxRowsPerSegment” : 5000000,
“appendableIndexSpec” : {
“type” : “onheap”,
“preserveExistingMetrics” : false
},
“maxRowsInMemory” : 1000000,
“maxBytesInMemory” : 0,
“skipBytesInMemoryOverheadCheck” : false,
“maxTotalRows” : null,
“numShards” : null,
“splitHintSpec” : null,
“partitionsSpec” : {
“type” : “dynamic”,
“maxRowsPerSegment” : 5000000,
“maxTotalRows” : null
},
“indexSpec” : {
“bitmap” : {
“type” : “roaring”,
“compressRunOnSerialization” : true
},
“dimensionCompression” : “lz4”,
“metricCompression” : “lz4”,
“longEncoding” : “longs”,
“segmentLoader” : null
},
“indexSpecForIntermediatePersists” : {
“bitmap” : {
“type” : “roaring”,
“compressRunOnSerialization” : true
},
“dimensionCompression” : “lz4”,
“metricCompression” : “lz4”,
“longEncoding” : “longs”,
“segmentLoader” : null
},
“maxPendingPersists” : 0,
“forceGuaranteedRollup” : false,
“reportParseExceptions” : false,
“pushTimeout” : 0,
“segmentWriteOutMediumFactory” : null,
“maxNumConcurrentSubTasks” : 1,
“maxRetry” : 1,
“taskStatusCheckPeriodMs” : 1000,
“chatHandlerTimeout” : “PT10S”,
“chatHandlerNumRetries” : 5,
“maxNumSegmentsToMerge” : 100,
“totalNumMergeTasks” : 10,
“logParseExceptions” : false,
“maxParseExceptions” : 2147483647,
“maxSavedParseExceptions” : 0,
“maxColumnsToMerge” : -1,
“awaitSegmentAvailabilityTimeoutMillis” : 0,
“maxAllowedLockCount” : -1,
“partitionDimensions” :
}
},
“context” : {
“forceTimeChunkLock” : true,
“useLineageBasedSegmentAllocation” : true
},
“dataSource” : “train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000”
}
2022-08-09T18:39:33,954 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Attempting to lock file[var/druid/task/index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z/lock].
2022-08-09T18:39:33,959 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Acquired lock file[var/druid/task/index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z/lock] in 4ms.
2022-08-09T18:39:33,961 INFO [main] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - forceTimeChunkLock[true] or isDropExisting[false] is set to true. Use timeChunk lock
2022-08-09T18:39:33,962 INFO [main] org.apache.druid.segment.loading.SegmentLocalCacheManager - Using storage location strategy: [LeastBytesUsedStorageLocationSelectorStrategy]
2022-08-09T18:39:33,968 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Running task: index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z
2022-08-09T18:39:33,971 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask - Intervals are missing in granularitySpec while this task is potentially overwriting existing segments. Forced to use timeChunk lock.
2022-08-09T18:39:33,971 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask - maxNumConcurrentSubTasks[1] is less than or equal to 1. Running sequentially. Please set maxNumConcurrentSubTasks to something higher than 1 if you want to run in parallel ingestion mode.
2022-08-09T18:39:33,973 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [SERVER]
2022-08-09T18:39:33,974 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - forceTimeChunkLock[true] or isDropExisting[false] is set to true. Use timeChunk lock
2022-08-09T18:39:33,974 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Chat handler is already registered. Skipping chat handler registration.
2022-08-09T18:39:33,976 INFO [main] org.eclipse.jetty.server.Server - jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 11.0.13+8-LTS
2022-08-09T18:39:33,982 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Determining intervals and shardSpecs
2022-08-09T18:39:34,012 INFO [main] org.eclipse.jetty.server.session - DefaultSessionIdManager workerName=node0
2022-08-09T18:39:34,012 INFO [main] org.eclipse.jetty.server.session - No SessionScavenger set, using defaults
2022-08-09T18:39:34,013 INFO [main] org.eclipse.jetty.server.session - node0 Scavenging every 600000ms
2022-08-09T18:39:34,137 INFO [main] com.sun.jersey.server.impl.application.WebApplicationImpl - Initiating Jersey application, version ‘Jersey: 1.19.4 05/24/2017 03:20 PM’
2022-08-09T18:39:34,609 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@57063e08{/,null,AVAILABLE}
2022-08-09T18:39:34,622 INFO [main] org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@53c21c05{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2022-08-09T18:39:34,622 INFO [main] org.eclipse.jetty.server.Server - Started @4521ms
2022-08-09T18:39:34,623 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [ANNOUNCEMENTS]
2022-08-09T18:39:34,623 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Successfully started lifecycle [module]
2022-08-09T18:39:36,779 INFO [task-runner-0-priority-0] org.apache.parquet.hadoop.InternalParquetRecordReader - RecordReader initialized will read a total of 4122840 records.
2022-08-09T18:39:36,779 INFO [task-runner-0-priority-0] org.apache.parquet.hadoop.InternalParquetRecordReader - at row 0. reading next block
2022-08-09T18:39:36,907 INFO [task-runner-0-priority-0] org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.snappy]
2022-08-09T18:39:36,920 INFO [task-runner-0-priority-0] org.apache.parquet.hadoop.InternalParquetRecordReader - block read in memory in 141 ms. row count = 4122840
2022-08-09T18:39:43,795 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Found intervals and shardSpecs in 9,812ms
2022-08-09T18:39:43,805 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - forceTimeChunkLock[true] or isDropExisting[false] is set to true. Use timeChunk lock
2022-08-09T18:39:45,488 INFO [task-runner-0-priority-0] org.apache.parquet.hadoop.InternalParquetRecordReader - RecordReader initialized will read a total of 4122840 records.
2022-08-09T18:39:45,488 INFO [task-runner-0-priority-0] org.apache.parquet.hadoop.InternalParquetRecordReader - at row 0. reading next block
2022-08-09T18:39:45,567 INFO [task-runner-0-priority-0] org.apache.parquet.hadoop.InternalParquetRecordReader - block read in memory in 79 ms. row count = 4122840
2022-08-09T18:41:32,609 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushing in-memory data to disk because (estimated) bytesCurrentlyInMemory[357915012] is greater than maxBytesInMemory[357913941].
2022-08-09T18:41:32,713 ERROR [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Task has exceeded safe estimated heap usage limits, failing (numSinks: [64527] numHydrantsAcrossAllSinks: [64527] totalRows: [97506])(bytesCurrentlyInMemory: [423216336] - bytesToBePersisted: [35280012] > maxBytesTuningConfig: [357913941]): {class=org.apache.druid.segment.realtime.appenderator.AppenderatorImpl, dataSource=train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000}
2022-08-09T18:41:32,790 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.RuntimeException: Task has exceeded safe estimated heap usage limits, failing (numSinks: [64527] numHydrantsAcrossAllSinks: [64527] totalRows: [97506])(bytesCurrentlyInMemory: [423216336] - bytesToBePersisted: [35280012] > maxBytesTuningConfig: [357913941]).
This can occur when the overhead from too many intermediary segment persists becomes to great to have enough space to process additional input rows. This check, along with metering the overhead of these objects to factor into the ‘maxBytesInMemory’ computation, can be disabled by setting ‘skipBytesInMemoryOverheadCheck’ to ‘true’ (note that doing so might allow the task to naturally encounter a ‘java.lang.OutOfMemoryError’). Alternatively, ‘maxBytesInMemory’ can be increased which will cause an increase in heap footprint, but will allow for more intermediary segment persists to occur before reaching this condition.
at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.add(AppenderatorImpl.java:415) ~[druid-server-0.23.0.jar:0.23.0]
at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.append(BaseAppenderatorDriver.java:411) ~[druid-server-0.23.0.jar:0.23.0]
at org.apache.druid.segment.realtime.appenderator.BatchAppenderatorDriver.add(BatchAppenderatorDriver.java:115) ~[druid-server-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.common.task.InputSourceProcessor.process(InputSourceProcessor.java:107) ~[druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:913) ~[druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:515) [druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:186) [druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runSequential(ParallelIndexSupervisorTask.java:1131) [druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runTask(ParallelIndexSupervisorTask.java:504) [druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:186) [druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:477) [druid-indexing-service-0.23.0.jar:0.23.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:449) [druid-indexing-service-0.23.0.jar:0.23.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-08-09T18:41:32,795 WARN [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - handler[index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z] not currently registered, ignoring.
2022-08-09T18:41:32,797 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z”,
“status” : “FAILED”,
“duration” : 118827,
“errorMsg” : “java.lang.RuntimeException: Task has exceeded safe estimated heap usage limits, failing (numSinks: […”,
“location” : {
“host” : null,
“port” : -1,
“tlsPort” : -1
}
}
2022-08-09T18:41:32,805 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-08-09T18:41:32,806 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-08-09T18:41:32,811 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@53c21c05{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2022-08-09T18:41:32,811 INFO [main] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-08-09T18:41:32,812 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@57063e08{/,null,STOPPED}
2022-08-09T18:41:32,814 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-08-09T18:41:32,815 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_parallel_train_data-part-00000-tid-8499798162450421515-25502b5c-ee11-4339-8eb9-017fd98101c0-2336-1-c000_nkddnpfn_2022-08-09T18:39:29.550Z].
2022-08-09T18:41:32,825 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited. Lookup notices are not handled anymore.
2022-08-09T18:41:32,832 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2022-08-09T18:41:32,936 INFO [main] org.apache.zookeeper.ZooKeeper - Session: 0x1000396403d0007 closed
2022-08-09T18:41:32,936 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x1000396403d0007
2022-08-09T18:41:32,942 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
Finished peon task

Thanks, this clears things up quite a bit.

could you retry with these values in the ingestion spec (maybe with a smaller dataset to help it fail sooner):

maxNumConcurrentSubTasks: 7
maxBytesInMemory : -1
skipBytesInMemoryOverheadCheck: true

This last one should not be as important if you are on the latest version of druid.

Also, you mentioned heap is 8GB. Are you referring to the historical or MM process? What is your peon heap?

If the above doesn’t work, I see that appendableIndexSpec is set to onheap , you could try and remove appendableIndexSpec and submit the ingestion again.

Depending on your version of druid, I would suggest using ‘range’ instead of ‘dynamic’ partitioning. for improved query performance.

both historical and mm 8GB
which one is peon heap? is it the javaopts params in MM runtime property
I will try the options, the one I am running is creating lot of segments in s3(deep storage) of 3k
once it’s done I will try your suggestion

Thanks for your patience so far :slight_smile:

Thanks for powering through this ranjan. Yes, javaopts are the Peon parameters. You could try bringing Xmx/Xms up if you encounter memory issues again in your ingestion.

Hi Vijeth,
What are the configurations besides memory to speed up ingestion
I am loading a 239 mil rows (multiple parquet files) from s3 to druid and in coor-ol log I am seeing lot of Adding lock on interval messages. I think until this is done it wont start
so what are the config specs I can configure to speed up

ranjan,

Please consider adding explicit intervals when ingesting data to help with the lock issue.

If you are using dynamic partitioning, then your other option is to scale and add more cores to the cluster. Using more files will also help but may possibly end up creating too many segments which we will need to clean up using compaction. You can use splitHintSpec to specify how many files each worker reads from.

If you are using a partition other than ‘dynamic’ then In order to speed up ingestion, the biggest bang for the buck would be to pre-group the data when it is loaded into the source bucket.

Thanks Vijeth, that makes sense to create the source by the partition grain
also, where is it condifured or how does master node or coordinator/overlord knows this is my query and this is my data node in a cluster setup. Tried searching how / where they all are integrated but couldn’t find much