ingesting a segment already stored in hdfs once again

Hi,
After some operation problems, I lost some segments in a datasource (the segments are still stored on hdfs and seen in metadata store). I want them to be seen once again. I use 0.8 version.
I tried to use reindexing - I used the following json:
{
“type” : “index_hadoop”,
“hadoopDependencyCoordinates” : [“org.apache.hadoop:hadoop-client:2.4.0”],
“spec” : {
“dataSchema”: {
“dataSource”: “prod_deep_sme_events”,
“parser”: {
“type”: “deep”,
“dimensionsSpec”: {
“dimensions”: [],
“dimensionExclusions”: [],
“spatialDimensions”: []
}
},
“metricsSpec”: [
{
“type”: “count”,
“name”: “rows”
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “SIX_HOUR”,
“queryGranularity”: “NONE”,
“intervals”: [ “2016-10-11/2016-10-12” ]
}
},
“ioConfig”: {
“type” : “hadoop”,
“inputSpec” : {
“type”: “dataSource”,
“ingestionSpec” : {
“dataSource” : “prod_deep_sme_events”,
“intervals”: [ “2016-10-11/2016-10-12” ],
“segments” : [“2016-10-11T06:00:00.000/2016-10-11T12:00:00.000”]
}
}
},
“tuningConfig” : {
“type” : “hadoop”
}
}
}

And I curl it to overlord via curl -H “Content-Type: application/json” -X POST -d @reindex.json http://localhost:19083/druid/indexer/v1/task
The task failed as seen below in the overlord log:
2016-10-21T10:14:14,843 INFO [qtp681564936-40] io.druid.indexing.overlord.HeapMemoryTaskStorage - Inserting task index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z with status: TaskStatus{id=index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z, status=RUNNING, duration=-1}
2016-10-21T10:14:14,843 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z]: LockTryAcquireAction{interval=2016-10-11T00:00:00.000Z/2016-10-12T00:00:00.000Z}
2016-10-21T10:14:14,843 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z, dataSource=prod_deep_sme_events, interval=2016-10-11T00:00:00.000Z/2016-10-12T00:00:00.000Z, version=2016-10-21T10:14:14.843Z}, taskIds=[]}
2016-10-21T10:14:14,844 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Added task[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z] to TaskLock[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z]
2016-10-21T10:14:14,844 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z
2016-10-21T10:14:14,846 INFO [pool-7-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Running command: java -cp /etc/druid/defaults::/usr/lib/druid/0.8.0/lib/druid-services-0.8.0-selfcontained.jar:/etc/druid/overlord:/etc/hadoop/conf -Ddruid.indexer.runner.javaOpts="-server -Xmx4g -Xms4g -XX:NewSize=256m -XX:MaxNewSize=256m -XX:+UseConcMarkSweepGC" -Ddruid.metadata.storage.connector.password= -Ddruid.indexer.fork.property.druid.processing.numThreads=4 -Duser.timezone=UTC -Dfile.encoding.pkg=sun.io -Ddruid.storage.storageDirectory=hdfs:///user/druid -Ddruid.selectors.indexing.serviceName=overlord -Ddruid.indexer.queue.startDelay=PT0M -Ddruid.metadata.storage.connector.createTables=true -Ddruid.port=19083 -Ddruid.indexer.fork.property.hadoop.mapred.job.queue.name=druid-indexing -Ddruid.worker.capacity=4 -Ddruid.extensions.searchCurrentClassloader=false -Ddruid.service=overlord -Ddruid.metadata.storage.connector.user=root -Ddruid.metadata.storage.type=mysql -Ddruid.indexer.fork.property.hadoop.mapreduce.job.queuename=druid-indexing -Ddruid.metadata.storage.connector.connectURI=jdbc:mysql://web.advertine.com:3306/druid -Djava.io.tmpdir=/tmp -Ddruid.zk.service.host=kafka01:2181/kafka081 -Ddruid.extensions.coordinates=[“io.druid.extensions:mysql-metadata-storage”] -Dfile.encoding=UTF-8 -Ddruid.storage.type=hdfs -Ddruid.indexer.fork.property.druid.computation.buffer.size=100000000 -Ddruid.processing.numThreads=4 -Dhadoop.mapred.job.queue.name=druid-indexing -Dhadoop.mapreduce.job.queuename=druid-indexing -Ddruid.computation.buffer.size=100000000 -Ddruid.host=druid01.advertine.com -Ddruid.port=8100 io.druid.cli.Main internal peon /tmp/persistent/task/index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z/6b830196-69d0-4b27-9960-a261fcaa7b79/task.json /tmp/persistent/task/index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z/6b830196-69d0-4b27-9960-a261fcaa7b79/status.json
2016-10-21T10:14:14,846 INFO [pool-7-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Logging task index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z output to: /tmp/persistent/task/index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z/6b830196-69d0-4b27-9960-a261fcaa7b79/log
2016-10-21T10:14:16,027 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 2 tasks from storage (0 tasks added, 0 tasks removed).
2016-10-21T10:14:18,957 INFO [qtp681564936-58] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z]: LockTryAcquireAction{interval=2016-10-11T00:00:00.000Z/2016-10-12T00:00:00.000Z}
2016-10-21T10:14:18,958 INFO [qtp681564936-58] io.druid.indexing.overlord.TaskLockbox - Task[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z] already present in TaskLock[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z]
2016-10-21T10:14:25,751 INFO [pool-7-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z
2016-10-21T10:14:25,752 INFO [pool-7-thread-4] io.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: log/index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z.log
2016-10-21T10:14:25,753 INFO [pool-7-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z/6b830196-69d0-4b27-9960-a261fcaa7b79
2016-10-21T10:14:25,753 INFO [pool-7-thread-4] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z
2016-10-21T10:14:25,753 INFO [pool-7-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z
2016-10-21T10:14:25,753 INFO [pool-7-thread-4] io.druid.indexing.overlord.HeapMemoryTaskStorage - Updating task index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z to status: TaskStatus{id=index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z, status=FAILED, duration=1849}
2016-10-21T10:14:25,753 INFO [pool-7-thread-4] io.druid.indexing.overlord.TaskLockbox - Removing task[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z] from TaskLock[index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z]
2016-10-21T10:14:25,754 INFO [pool-7-thread-4] io.druid.indexing.overlord.TaskLockbox - TaskLock is now empty: TaskLock{groupId=index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z, dataSource=prod_deep_sme_events, interval=2016-10-11T00:00:00.000Z/2016-10-12T00:00:00.000Z, version=2016-10-21T10:14:14.843Z}
2016-10-21T10:14:25,754 INFO [pool-7-thread-4] io.druid.indexing.overlord.TaskQueue - Task done: HadoopIndexTask{id=index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z, type=index_hadoop, dataSource=prod_deep_sme_events}
2016-10-21T10:14:25,754 INFO [pool-7-thread-4] io.druid.indexing.overlord.TaskQueue - Task FAILED: HadoopIndexTask{id=index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z, type=index_hadoop, dataSource=prod_deep_sme_events} (1849 run duration)
2016-10-21T10:15:16,027 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

I cannot get to * /tmp/persistent/task/index_hadoop_prod_deep_sme_events_2016-10-21T10:14:14.843Z/6b830196-69d0-4b27-9960-a261fcaa7b79/log* as it was deleted*.1. Anyone knows what went wrong?
2. What parser pbject in dataSchema should be used for already stored segments in druid???
3. Alternatively, can I send ingestion Peon in druid 0.8? If I understand something like below:
{ *

 **    "type"    : "ingestSegment",
    "dataSource"   : "****prod_deep_sme_events",
    "interval" : "2016-10-11/2016-10-12"
}**could be send.

4. So if this is named injestionPeon.json, how I send it? (in druid 0.8)

Best,
Pawel

Hi, If you have the segments in HDFS and corresponding metadata entries in metadata store, druid should be able to load these segments without a need for reIndexing.
Just make sure you set used = 1 in metadata store for these segments and have configured the rules correctly in coordinator console.

Hi Nishant,
Thank you for your response.
I have checked in mysql, and every segment I want to reintroduce has used=1
So I have to add rule to a given datasource, (I have only : {"_default":[{“tieredReplicants”:{"_default_tier":1} ).
So if I add :

**{
  "type" : "loadByInterval",
  "interval": "2016-10-11/2016-10-12",
  "tieredReplicants": {
    "hot": 1,
    "_default_tier" : 1
  }****}**
to a given datasource then coordinator will try to coordinate loading the segments (in my case these are 4, as 6_HOURS segments are constructed).
What about segments in this interval that are already in the cluster? (I have 2 segments missing and 2 present).
What about other segments? Will these rule affect somehow segments in let's say 2016-10-13?

What I have in reality is datasource from August to today (and counting) with two periods affected (segments pushed to hdfs, present in metadata, but absent in datasource cluster.
 I wonder what kind of rule will make sure that this case is serviced.

Best,
Pawel

Ok. Adding rules and changing druid.server.MaxSize did a trick :slight_smile: