Druid Datasource Not Added Upon S3 Upload

Hi,

I’m pretty new to Druid (we are using Druid 0.17), and I’m trying to ingest data from Amazon S3 through the druid console. I have submitted a task and can see that it finds the data in S3, I can input the details and publish, but when it takes me to the task view and the task succeeds, the corresponding datasource never gets added, so we cannot then query the data.

This resource I used to create the config: https://druid.apache.org/docs/latest/ingestion/native-batch.html#s3-input-source

This druid resource is the one I was using where I expected the datasource to show up after the task succeeds: https://druid.apache.org/docs/0.18.0/tutorials/tutorial-batch.html

The task schema is as follows:

{

“type”: “index_parallel”,

“ioConfig”: {

“type”: “index_parallel”,

“inputSource”: {

“type”: “s3”,

“uris”: [

“s3://druid-s3-tmp/testFolder/athenaData.csv”

]

},

“inputFormat”: {

“type”: “csv”,

“findColumnsFromHeader”: false,

“skipHeaderRows”: 1,

“columns”: [

“an”,

“lm”,

“lv”

]

}

},

“tuningConfig”: {

“type”: “index_parallel”,

“partitionsSpec”: {

“type”: “dynamic”

}

},

“dataSchema”: {

“dataSource”: “druid-s3-test”,

“granularitySpec”: {

“type”: “uniform”,

“queryGranularity”: “HOUR”,

“rollup”: true,

“segmentGranularity”: “DAY”

},

“timestampSpec”: {

“column”: “!!!no_such_column!!!”,

“missingValue”: “2010-01-01T00:00:00Z”

},

“dimensionsSpec”: {

“dimensions”: [

“an”,

“lm”,

“lv”

]

},

“metricsSpec”: [

{

“name”: “count”,

“type”: “count”

}

]

}

}

Here is the log from the task, I can see the data is going into deep storage:

2020-06-16T00:05:11,403 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Attempting to lock file[/mnt/persistent/task/index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z/lock].

2020-06-16T00:05:11,403 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Acquired lock file[/mnt/persistent/task/index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z/lock] in 0ms.

2020-06-16T00:05:11,405 INFO [main] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - [forceTimeChunkLock] is set to true in task context. Use timeChunk lock

2020-06-16T00:05:11,410 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node[http://10.151.3.128:8081] of role[coordinator] detected.

2020-06-16T00:05:11,410 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node watcher of role[coordinator] is now initialized.

2020-06-16T00:05:11,411 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Running task: index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z

2020-06-16T00:05:11,413 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask - Intervals are missing in granularitySpec while this task is potentially overwriting existing segments. Forced to use timeChunk lock.

2020-06-16T00:05:11,413 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask - Found chat handler of class[org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider]

2020-06-16T00:05:11,414 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask - maxNumConcurrentSubTasks[1] is less than or equal to 1. Running sequentially. Please set maxNumConcurrentSubTasks to something higher than 1 if you want to run in parallel ingestion mode.

2020-06-16T00:05:11,422 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - [forceTimeChunkLock] is set to true in task context. Use timeChunk lock

2020-06-16T00:05:11,422 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Chat handler is already registered. Skipping chat handler registration.

2020-06-16T00:05:11,426 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Determining intervals and shardSpecs

2020-06-16T00:05:11,487 WARN [main] org.apache.druid.query.lookup.LookupReferencesManager - No lookups found for tier [__default], response [org.apache.druid.java.util.http.client.response.StringFullResponseHolder@45e11627]

2020-06-16T00:05:11,487 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - Coordinator is unavailable. Loading saved snapshot instead

2020-06-16T00:05:11,490 INFO [main] org.apache.druid.server.listener.announcer.ListenerResourceAnnouncer - Announcing start time on [/druid/ecomqc/master/listeners/lookups/__default/http:10.151.4.232:8100]

2020-06-16T00:05:11,490 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [SERVER]

2020-06-16T00:05:11,493 INFO [main] org.eclipse.jetty.server.Server - jetty-9.4.12.v20180830; built: 2018-08-30T13:59:14.071Z; git: 27208684755d94a92186989f695db2d7b21ebc51; jvm 1.8.0_131-b11

2020-06-16T00:05:11,511 INFO [main] org.eclipse.jetty.server.session - DefaultSessionIdManager workerName=node0

2020-06-16T00:05:11,512 INFO [main] org.eclipse.jetty.server.session - No SessionScavenger set, using defaults

2020-06-16T00:05:11,512 INFO [main] org.eclipse.jetty.server.session - node0 Scavenging every 660000ms

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register

INFO: Registering org.apache.druid.server.http.SegmentListerResource as a root resource class

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register

INFO: Registering com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider as a provider class

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register

INFO: Registering com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider as a provider class

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register

INFO: Registering org.apache.druid.server.initialization.jetty.CustomExceptionMapper as a provider class

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register

INFO: Registering org.apache.druid.server.initialization.jetty.ForbiddenExceptionMapper as a provider class

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register

INFO: Registering org.apache.druid.server.initialization.jetty.BadRequestExceptionMapper as a provider class

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register

INFO: Registering org.apache.druid.server.StatusResource as a root resource class

Jun 16, 2020 12:05:11 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate

INFO: Initiating Jersey application, version ‘Jersey: 1.19.3 10/24/2016 03:43 PM’

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.initialization.jetty.CustomExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.initialization.jetty.ForbiddenExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.initialization.jetty.BadRequestExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider to GuiceManagedComponentProvider with the scope “Singleton”

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider to GuiceManagedComponentProvider with the scope “Singleton”

2020-06-16T00:05:11,700 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Found intervals and shardSpecs in 274ms

2020-06-16T00:05:11,701 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - [forceTimeChunkLock] is set to true in task context. Use timeChunk lock

2020-06-16T00:05:11,893 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - New segment[druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z] for sequenceName[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z].

2020-06-16T00:05:11,959 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Pushing segments in background: [druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z]

Jun 16, 2020 12:05:11 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.http.security.StateResourceFilter to GuiceInstantiatedComponentProvider

Jun 16, 2020 12:05:12 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.http.SegmentListerResource to GuiceManagedComponentProvider with the scope “PerRequest”

Jun 16, 2020 12:05:12 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.QueryResource to GuiceInstantiatedComponentProvider

Jun 16, 2020 12:05:12 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.segment.realtime.firehose.ChatHandlerResource to GuiceInstantiatedComponentProvider

Jun 16, 2020 12:05:12 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.http.security.ConfigResourceFilter to GuiceInstantiatedComponentProvider

Jun 16, 2020 12:05:12 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.query.lookup.LookupListeningResource to GuiceInstantiatedComponentProvider

Jun 16, 2020 12:05:12 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.query.lookup.LookupIntrospectionResource to GuiceInstantiatedComponentProvider

Jun 16, 2020 12:05:12 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider

INFO: Binding org.apache.druid.server.StatusResource to GuiceManagedComponentProvider with the scope “Undefined”

2020-06-16T00:05:12,053 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@54d60fa1{/,null,AVAILABLE}

2020-06-16T00:05:12,063 INFO [main] org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@120d6cbf{HTTP/1.1,[http/1.1]}{0.0.0.0:8100}

2020-06-16T00:05:12,063 INFO [main] org.eclipse.jetty.server.Server - Started @7313ms

2020-06-16T00:05:12,063 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [ANNOUNCEMENTS]

2020-06-16T00:05:12,069 INFO [[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data for segment[druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z] spill[0] to disk in [108] ms (2 rows).

2020-06-16T00:05:12,080 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Successfully started lifecycle [module]

2020-06-16T00:05:12,091 INFO [[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed in-memory data with commit metadata [null] for segments: druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z

2020-06-16T00:05:12,237 INFO [[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z]-appenderator-merge] org.apache.druid.storage.s3.S3Utils - Pushing [/mnt/tmp/druid4947883665956080183index.zip] to bucket[argos-druid] and key[druid/ecomqc/master/storage/druid-s3-test/2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z/2020-06-16T00:05:11.757Z/0/index.zip].

2020-06-16T00:05:12,292 INFO [[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Segment[druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z] of 1,706 bytes built from 1 incremental persist(s) in 33ms; pushed to deep storage in 158ms. Load spec is: {“type”:“s3_zip”,“bucket”:“argos-druid”,“key”:“druid/ecomqc/master/storage/druid-s3-test/2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z/2020-06-16T00:05:11.757Z/0/index.zip”,“S3Schema”:“s3n”}

2020-06-16T00:05:12,298 INFO [[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Dropped segment[druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z].

2020-06-16T00:05:12,335 INFO [[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z]-publish] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Published segments with commit metadata [null]: druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z

2020-06-16T00:05:12,336 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Processed[3] events, unparseable[0], thrownAway[0].

2020-06-16T00:05:12,336 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Published segments: [druid-s3-test_2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z_2020-06-16T00:05:11.757Z]

2020-06-16T00:05:12,341 WARN [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - handler[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z] not currently registered, ignoring.

2020-06-16T00:05:12,343 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z”,

“status” : “SUCCESS”,

“duration” : 930,

“errorMsg” : null,

“location” : {

“host” : null,

“port” : -1,

“tlsPort” : -1

}

}

2020-06-16T00:05:12,346 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]

2020-06-16T00:05:12,347 INFO [main] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/ecomqc/master/listeners/lookups/__default/http:10.151.4.232:8100]

2020-06-16T00:05:12,369 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]

2020-06-16T00:05:12,398 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@120d6cbf{HTTP/1.1,[http/1.1]}{0.0.0.0:8100}

2020-06-16T00:05:12,398 INFO [main] org.eclipse.jetty.server.session - node0 Stopped scavenging

2020-06-16T00:05:12,400 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@54d60fa1{/,null,UNAVAILABLE}

2020-06-16T00:05:12,401 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]

2020-06-16T00:05:12,401 INFO [main] org.apache.druid.server.listener.announcer.ListenerResourceAnnouncer - Unannouncing start time on [/druid/ecomqc/master/listeners/lookups/__default/http:10.151.4.232:8100]

2020-06-16T00:05:12,401 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited. Lookup notices are not handled anymore.

2020-06-16T00:05:12,402 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_parallel_druid-s3-test_hdkcklcd_2020-06-16T00:05:04.687Z].

2020-06-16T00:05:13,225 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting

2020-06-16T00:05:13,244 INFO [main] org.apache.zookeeper.ZooKeeper - Session: 0x2472041cb8ba0c31 closed

2020-06-16T00:05:13,244 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]

Finished peon task

Do you have any idea why the datasource never shows up ?

Thank you

Hi,
We can see in the task spec that timestampSpec is set on a column named : !!!no_such_column!!!

Is that normal ?

I don’t think Druid should be able to complete tasks without timestamp column anyway

Oh, nevermind.
I didn’t know about missingValue option !

Could you check the file

bucket[argos-druid] and key[druid/ecomqc/master/storage/druid-s3-test/2010-01-01T00:00:00.000Z_2010-01-02T00:00:00.000Z/2020-06-16T00:05:11.757Z/0/index.zip].

is really present in s3 bucket ?

Yes, I do see the zip file there!

And ya in terms of this, which I think you have seen already that the time column isn’t mandatory from what I could tell, however, if you think it would be good to try adding it, I could always try that.

Screen Shot 2020-06-16 at 12.21.25 PM.png