Trouble loading batch data

Hi,

I tried to setup a local druid cluster and load batch data into historical nodes

I followed the instructions as in http://druid.io/docs/latest/tutorials/tutorial-loading-batch-data.html.

According to http://druid.io/docs/latest/misc/tasks.html the index task is a variation that does not require external hadoop set up.

When I post the task for indexing, it fails with the following error in the logs.

I tried 4 different ways of staring the indexer [the instructions mention to include the “<hadoop_config_path>” at the end of the classpath]

  1. Omitting hadoop-config path altogether when starting the indexer
  2. Point to a local hadoop install (not running) top-level directory
  3. Point to the local install’s config directory (with nothing running)
  4. Specfically add hadoop-common-2.3.0.jar to the class path

All of these result in the same error. What am I doing wrong?

Thanks

-Subbu

OK, I got it to work now.

the hadoop common jar that I had had an issue.

I don’t remember any of the tutorial pages mentioning to include hadoop jars in the classpath (and the page I mentioned asks to include hadoop-config in the classpath), so may be it is a good idea to clarify that?

thanks

-Subbu

Hi Subbu,

It would be great if you could contribute your findings to our documentation. All druid docs are located in the github repo. This one is the one we have so far for working with other versions of hadoop:

https://github.com/druid-io/druid/blob/master/docs/content/operations/other-hadoop.md

In the future, we hope to rework the dependency system to make it easier to work with different versions of Hadoop. You may be interested in reading this proposal:

https://groups.google.com/forum/#!topic/druid-development/7KJCsQ9GvGo

Fangjin,

I have another question regarding loading batch data.

The wiki says to expect a “Received SUCCESS status for task:” message on the indexer logs. I do not get one. Also, the indexer console shows the task as still running, but clicking on the logs show that the task is completed.

I do see the “Announcing segment” log in the historical node logs, and the timeBoundary query works as well.

Can you help me here?

thanks

Also, to note:

I started the co-ordinator, and updated the number of replicants for _default to 1, stopped it, and restarted it, and then did the batch operation.

Clicking on ‘update’ in the co-ordinator console did not update the mysql row. Instead, it created a new rule with a different version, and the co-ordinator continued to pick up the first version it found, because that shows up first in mysql query. So, I had to delete the older mysql rule row.

-Subbu

Hi Subbu, inline.

Fangjin,

I understand that the rule update in co-ordinator does not happen immediately. What I wanted to point out was that it seems that the update never happens.

When I click on ‘update’ in the UI, the old row in mysql is not removed. Instead, a new row is added with the new rule that allows one replicant for the segment.

The co-ordinator always matches the first rule that it comes across when reading mysql database, so the new rule is never applied , even if the co-ordinator is restarted.

I had to manually remove the old row in mysql using the mysql ‘delete’ command.

Here is the mysql result when I clicked on ‘update’ in the UI

Hi Fangjin,

Attached are the task logs and the logs from the servers (after I issued the curl command). Looks like it is not able to execute the “stop” command? Let me know if you would like more logs.

-Subbu

taskLogs.txt (60.7 KB)

historicalLogs.txt (3.02 KB)

overlord.txt (10.1 KB)

coordinator.txt (23.5 KB)

Hi Subbu, please see inline.

Fangjin,

I understand that the rule update in co-ordinator does not happen immediately. What I wanted to point out was that it seems that the update never happens.

When I click on ‘update’ in the UI, the old row in mysql is not removed. Instead, a new row is added with the new rule that allows one replicant for the segment.

The old row is never removed. We keep it around for audit information. Druid always uses the latest rule for a datasource it finds.

The co-ordinator always matches the first rule that it comes across when reading mysql database, so the new rule is never applied , even if the co-ordinator is restarted.

I had to manually remove the old row in mysql using the mysql ‘delete’ command.

Here is the mysql result when I clicked on ‘update’ in the UI

select * from druid_rules;

±----------------------------------±-----------±-------------------------±----------------------------------------------------------------+

id | dataSource | version | payload |

±----------------------------------±-----------±-------------------------±----------------------------------------------------------------+

_default_2015-07-10T22:05:44.110Z | _default | 2015-07-10T22:05:44.110Z | [{“tieredReplicants”:{"_default_tier":2},“type”:“loadForever”}] |

_default_2015-07-10T22:06:58.879Z | _default | 2015-07-10T22:06:58.879Z | [{“tieredReplicants”:{"_default_tier":1},“type”:“loadForever”}] |

±----------------------------------±-----------±-------------------------±----------------------------------------------------------------+

2 rows in set (0.00 sec)

And then I had to do this in order for the new rule to take effect

How did you verify the new rule was not taking effect? The reason I ask is because the rule logic has been around for some time and your use case is one we’ve verified quite a few times. There should also be several tests for this use case.

Hi Subbu,

Looking at the task logs, I see:

2015-07-13T15:50:04,447 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_wikipedia_2015-07-13T15:49:53.261Z",
  "status" : "SUCCESS",
  "duration" : 648
}

If you refresh the indexer console after the task completes, does the status update?

Hi Fangjin,

I verified by stopping the co-ordinator and restarting it. The old rule seemed to take effect, and it was complaining that it cannot find replicants. I forget the details, now that I am past the problem.

-Subbu

Hi Fangjin,

No, refreshing the console does not update the status of the task either.

If I stop the indexer and restart it, then all tasks disappear and the console is empty.

thanks

-Subbu

Hmmm, there are some ZK connection problems that occur at the end of the task, possibly causing the finished status to be missed. We will investigate and update this thread.

2015-07-13T15:50:52,697 WARN [main-SendThread(ssubrama-ld1.linkedin.biz:2181)] org.apache.zookeeper.ClientCnxn - Session 0x14e79f3d5e1000f for server ssubrama-ld1.linkedin.biz/127.0.0.1:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Xid out of order. Got Xid 56 with err 0 expected Xid 55 for a packet with details: clientPath:null serverPath:null finished:false header:: 55,14  replyHeader:: 0,0,-4  request:: org.apache.zookeeper.MultiTransactionRecord@731bdbc5 response:: org.apache.zookeeper.MultiResponse@0
	at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:798) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [zookeeper-3.4.6.jar:3.4.6-1569965]
2015-07-13T15:50:52,798 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
2015-07-13T15:50:53,643 INFO [main-SendThread(ssubrama-ld1.linkedin.biz:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server ssubrama-ld1.linkedin.biz/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-07-13T15:50:53,643 INFO [main-SendThread(ssubrama-ld1.linkedin.biz:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to ssubrama-ld1.linkedin.biz/0:0:0:0:0:0:0:1:2181, initiating session
2015-07-13T15:50:53,645 WARN [main-SendThread(ssubrama-ld1.linkedin.biz:2181)] org.apache.zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable
2015-07-13T

Hi Subbu, do you by any chance have the coordinator logs after you changed the rules? I’d like to make sure there’s nothing fishy going on there.

Hi Subbu, what version is your ZK server? Googling around a bit, it looks like “java.io.IOException: Xid out of order.” can happen with some older ZK servers.

This just started happening to me, after things working during a couple of initial POC tests of a dataload. The cluster this is happening on is currently using zookeeper 3.3.4, and coordinator version 0.7.0.

  • Paul

ZooKeeper 3.3.4 is quite old and has a number of known bugs- do things work for you with a newer one, like 3.4.6?

All versions of zk are pretty old… :wink: In all seriousness though, I know there are a number of bug fixes in zk in the 3.4.x series, including previous issues with ephemeral nodes and other node issues. I’ve been hesitant to bump due to a nasty little problem with reverse DNS introduced in 3.4.x which impacts one of my clusters, and is unresolved even in 3.5-alpha (based on this problem consistently remaining).

I will bump zk tonight and verify that it resolved this issue for me. I do want you to be aware though, in case others have been holding back their Zookeeper version - right or wrong, the Kafka docs still recommend 3.3.4.

Thanks,
Paul

Hi Gian,

I can confirm that upgrading the Zookeeper ensemble to 3.4.6 on the cluster in question resolved this issue. Thanks for your help!

Cheers,

Paul