Hadoop indexing issuies since upgrade to 0.9.0

Hi,

We use external Hadoop cluster (EMR) for indexing. After upgrading to version 0.9.0 indexing time increased from 40 minutes to 4 hours for a specific topic.

Basically, all the tasks are finished in 40 minutes, except for the last 2 reduce ones, which are stack for the rest of the time.

Other topics now fail with this error:

Exception from container-launch. Container id: container_1462158846522_0003_01_000043 Exit code: 1 Exception message: /bin/bash: /var/log/hadoop-yarn/containers/application_1462158846522_0003/container_1462158846522_0003_01_000043/stdout: No such file or directory Stack trace: ExitCodeException exitCode=1: /bin/bash: /var/log/hadoop-yarn/containers/application_1462158846522_0003/container_1462158846522_0003_01_000043/stdout: No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

We’ve re-ordered dimensions under dimensionSpec so they come from low cardinality to high, as it was stated in the Druid 0.9.0 changelog.

Is there anything we can do?

The long running reduce task log is attached.

Thanks!

Michael

stack_reduce.log.gz (778 KB)

Is there a way to downgrade Druid from 0.9.0?

Hi Michael, there were no API changes so you should feel free to roll back.

Can you provide us more details with this performance degradation you are seeing? I don’t think anyone else has hit the problem yet. Can you provide your spec and details of your Hadoop-based ingestion?

I should clarify that some configurations did change. If you have your old configurations from 0.8.3 though, you should be able to roll back to use those.

Hello Fangjin,

Please see my attachment to the first message, there’s spec bundled into the log.

There are constantly 2 reduce tasks, which run for the most of the time. All others finish in about 40 minutes as it was before the upgrade.

Do you mean the configuration in .properties files or the metadata DB configuration? We could revert .properties configuration, but we have no backup for the metadata DB…

Thanks,

Michael

Thanks Michael. I missed the log initially. Not entirely sure what is going on yet. Do you also happen to have the task log of a task that completed much faster in 0.8.3?

Hi Fangjin,

No, unfortunately I don’t have it.

Hi Michael, do you have your 0.8.3 ingestion spec? Is it the same as the 0.9.0 ingestion spec?