No buckets?? seems there is no data to index.

Its been an entire day, haven’t been able to resolve the below error. Appreciate any help

Setup: I have a druid cluster (each node on a VM) and trying to ingest data using Google Dataproc (hadoop batch ingestion).

I have ingested multiple files on the same druid cluster without using dataproc. So I know the druid + zookeeper + mysql are working fine.

2018-10-17T00:56:01,291 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.HadoopIndexTask - Starting a hadoop index generator job...
2018-10-17T00:56:01,332 INFO [task-runner-0-priority-0] io.druid.indexer.path.StaticPathSpec - Adding paths[gs://druid-deep/quickstart/wikiticker-2015-09-12-sampled.json.gz]
2018-10-17T00:56:01,337 INFO [task-runner-0-priority-0] io.druid.indexer.HadoopDruidIndexerJob - No metadataStorageUpdaterJob set in the config. This is cool if you are running a hadoop index task, otherwise nothing will be uploaded to database.
2018-10-17T00:56:01,388 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[AbstractTask{id='index_hadoop_wikiticker3_2018-10-17T00:55:10.331Z', groupId='index_hadoop_wikiticker3_2018-10-17T00:55:10.331Z', taskResource=TaskResource{availabilityGroup='index_hadoop_wikiticker3_2018-10-17T00:55:10.331Z', requiredCapacity=1}, dataSource='wikiticker3', context={}}]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
        at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:222) ~[druid-indexing-service-0.12.2.jar:0.12.2]
        at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:238) ~[druid-indexing-service-0.12.2.jar:0.12.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:444) [druid-indexing-service-0.12.2.jar:0.12.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:416) [druid-indexing-service-0.12.2.jar:0.12.2]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_181]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]
        at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:219) ~[druid-indexing-service-0.12.2.jar:0.12.2]
        ... 7 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.
        at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:229) ~[druid-indexing-hadoop-0.12.2.jar:0.12.2]
        at io.druid.indexer.JobHelper.runJobs(JobHelper.java:369) ~[druid-indexing-hadoop-0.12.2.jar:0.12.2]
        at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:95) ~[druid-indexing-hadoop-0.12.2.jar:0.12.2]
        at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:293) ~[druid-indexing-service-0.12.2.jar:0.12.2]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_181]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]
        at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:219) ~[druid-indexing-service-0.12.2.jar:0.12.2]
        ... 7 more
Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.
        at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:182) ~[druid-indexing-hadoop-0.12.2.jar:0.12.2]
        at io.druid.indexer.JobHelper.runJobs(JobHelper.java:369) ~[druid-indexing-hadoop-0.12.2.jar:0.12.2]
        at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:95) ~[druid-indexing-hadoop-0.12.2.jar:0.12.2]
        at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:293) ~[druid-indexing-service-0.12.2.jar:0.12.2]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_181]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_181]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_181]
        at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:219) ~[druid-indexing-service-0.12.2.jar:0.12.2]
        ... 7 more
2018-10-17T00:56:01,407 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_wikiticker3_2018-10-17T00:55:10.331Z] status changed to [FAILED].
2018-10-17T00:56:01,410 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_hadoop_wikiticker3_2018-10-17T00:55:10.331Z",
  "status" : "FAILED",
  "duration" : 40585
}
2018-10-17T00:56:01,442 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.listener.announcer.ListenerResourceAnnouncer.stop()] on object[io.druid.query.lookup.LookupResourceListenerAnnouncer@578c3fd9].
2018-10-17T00:56:01,443 INFO [main] io.druid.curator.announcement.Announcer - unannouncing [/druid/listeners/lookups/__default/http:middlemanager.c.agupta292-terraform.internal:8100]
2018-10-17T00:56:01,458 INFO [main] io.druid.server.listener.announcer.ListenerResourceAnnouncer - Unannouncing start time on [/druid/listeners/lookups/__default/http:middlemanager.c.agupta292-terraform.internal:8100]

All machine are in UTC zone.

Added below code in tuning config of input spec:

"tuningConfig" : {
      "type" : "hadoop",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000
      },
      "jobProperties" : {
        "mapreduce.job.user.classpath.first": "true",
        "mapreduce.map.java.opts":"-Duser.timezone=UTC -Dfile.encoding=UTF-8",
        "mapreduce.reduce.java.opts":"-Duser.timezone=UTC -Dfile.encoding=UTF-8"
        }
    }

For hadoop config:

  1. Replaced jackson jars in hadoop nodes (/usr/lib/hadoop-mapreduce) with those in lib/ dir of overlord node

  2. Copied /etc/hadoop/conf from hadoop nodes to all druid nodes at druid/conf/hadoop

  3. Added below line in input spec:

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.8.4”]

  1. My dataproc cluster runs hadoop 2.8.4. Copied the hadoop-client jars in druid/hadoop-dependencies/hadoop-client/2.8.4 alongwith gcs-connector jar

  2. Below is my inputConfig. The same config works if I dont use the dataproc cluster

"ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "gs://MY_BUCKET/quickstart/wikiticker-2015-09-12-sampled.json.gz"
      }
    },
  1. Below is my timestampSpec. Have confirmed multiple times that my input is also in the same timezone
"timestampSpec" : {
            "format" : "auto",
            "column" : "time"
          }
  1. GranularitySpec:

“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “none”,
“intervals” : [“2015-09-12/2015-09-13”]
},

As you can see I am just using the wikiticker quickstart dataset so there is no way there is no data in that file in the specified interval. Also the timestampSpec and ioConfig work fine without using the dataproc.

While launching overlord and middlemanager node, I am specifying the hadoop conf classpath.

java `cat conf-quickstart/druid/overlord/jvm.config | xargs` -cp "conf-quickstart/hadoop/conf/:conf-quickstart/druid/_common:conf-quickstart/druid/overlord:lib/*" io.druid.cli.Main server overlord

Appreciate any help!

Would really appreciate help from the community. I don’t have much ideas left!

Hm, what happens if you widen the interval here: “intervals” : [“2015-09-12/2015-09-13”] to something larger, like two weeks? I wonder if some timezone-related issue is still occurring with dataproc.

  • Jon

I tried that, getting the same error.

I also tried adding an interval which doesn’t exist in the file [“2017-09-12/2017-09-13”], same error.

I then thought may be druid is not even looking at the right file for some reason and deleted the file at “paths”, the error changed to “Input file doesn’t exist”. Which means druid is looking at the right file and in the time interval that the file has and yet throws “no buckets?” error.

I am going to get back to this issue later but if you find something please let me know.

Hm, at this point I can’t really think of why no rows would be ingested if you’ve ruled out the timezone and file availability as potential causes, I would actually suggest modifying the Druid code, adding debug logs to Druid’s HadoopIndexTask and related classes and trying that on your cluster to see what’s going on there.

Thanks,

Jon

Take a look at the discussion here https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/druid-user/MuuiZALXkv0/zDPBmgzbAQAJ

Seems like your timestamp granularity is out of sync with your data. Do you have a timestamp column in your data?

Following is the code which tries to determine the number of buckets, i believe Druid is not able to figure out how to bucketize/split your data based on your timestamps.

int numReducers = Iterables.size(config.getAllBuckets().get());
if (numReducers == 0) {
  throw new RuntimeException("No buckets?? seems there is no data to index.");
}

``

public Optional<Iterable<Bucket>> getAllBuckets()
{
  Optional<Set<Interval>> intervals = getSegmentGranularIntervals();
  if (intervals.isPresent()) {
    return Optional.of(
        (Iterable<Bucket>) FunctionalIterable
            .create(intervals.get())
            .transformCat(
                new Function<Interval, Iterable<Bucket>>()
                {
                  @Override
                  public Iterable<Bucket> apply(Interval input)
                  {
                    final DateTime bucketTime = input.getStart();
                    final List<HadoopyShardSpec> specs = schema.getTuningConfig().getShardSpecs().get(bucketTime.getMillis());
                    if (specs == null) {
                      return ImmutableList.of();
                    }

                    return FunctionalIterable
                        .create(specs)
                        .transform(
                            new Function<HadoopyShardSpec, Bucket>()
                            {
                              int i = 0;

                              @Override
                              public Bucket apply(HadoopyShardSpec input)
                              {
                                return new Bucket(input.getShardNum(), bucketTime, i++);
                              }
                            }
                        );
                  }
                }
            )
    );
  } else {
    return Optional.absent();
  }
}

``

I am using the wikiticker kickstarter file as an input and it has a time field in it.

@Pratik By timestamp granularity do you mean granularities in granularity spec? I have pasted that above and its same as what comes in the quickstart guide.

I did go through that code but it doesn’t make sense to me that the same granularity spec with works the same input file if I don’t use dataproc.

I also supplied the timeformat as “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’” and got the same error. The time field has data as: “time”:“2015-09-12T00:46:58.771Z”.

Below is the index task file I am submitting, hope it helps explain my situation better:

{
  "type" : "index_hadoop",
  "spec" : {
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "gs://MY_BUCKET/quickstart/wikiticker-2015-09-12-sampled.json.gz"
      }
    },
    "dataSchema" : {
      "dataSource" : "wikiticker10",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"]
      },
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "channel",
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user"
            ]
          },         "timestampSpec" : {

            "format" : "auto",
            "column" : "time"
          }
        }
      },
      "metricsSpec" : [
        {
          "name" : "count",
          "type" : "count"
        },
        {
          "name" : "added",
          "type" : "longSum",
          "fieldName" : "added"
        },
        {
          "name" : "deleted",
          "type" : "longSum",
          "fieldName" : "deleted"
        },
        {
          "name" : "delta",
          "type" : "longSum",
          "fieldName" : "delta"
        },
        {
          "name" : "user_unique",
          "type" : "hyperUnique",
          "fieldName" : "user"
        }
      ]
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000
      },
      "jobProperties" : {
        "fs.default.name" : "hdfs://10.138.0.15",
        "fs.defaultFS" : "hdfs://10.138.0.15",
        "dfs.datanode.address" : "10.138.0.15",
        "dfs.client.use.datanode.hostname" : "true",
        "dfs.datanode.use.datanode.hostname" : "true",
        "yarn.resourcemanager.hostname" : "10.138.0.15",
        "yarn.nodemanager.vmem-check-enabled" : "false",
        "mapreduce.map.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
        "mapreduce.job.user.classpath.first" : "true",
        "mapreduce.reduce.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
        "mapreduce.map.memory.mb" : 1024,
        "mapreduce.reduce.memory.mb" : 1024
        }
    }
  },
"hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.8.4"]
}

``

https://groups.google.com/forum/#!topic/druid-user/Zm-VWhl3X6Y should help you. I think its a timezone spec problem.
I remember facing the same, we have the following properties in our jobProperties

"mapreduce.map.java.opts": "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
"mapreduce.reduce.java.opts": "-Duser.timezone=UTC -Dfile.encoding=UTF-8"

I have those properties in my job.properties as well. Please see inspec file attached above.

Do you have only those two properties in your jobProperties?

What version of hadoop are you running?

@Pratik Did you build druid? I just downloaded the druid 0.12.2 package and have been using it. Wondering if building it from src might help!

We did not build druid. Using AWS based hadoop distribution.