Status Blog‎ > ‎Alpha Trial Run‎ > ‎

Name: node016-1224145206022 Status: ProcessingJob(3):DeletingThisMessageFromWorkQueue

Resolution

http://code.google.com/p/bt-recordbook/source/detail?r=209

Augmented the remote signature provider to be robust to failures connecting to
the authorization servlet. It will try 8 times, waiting .1s, .4s, 1.6s, 4.9s,
..., 27 minutes between connections. This means we can literally take the
servlet down for almost half an hour without losing anybody.


Main Process (title of page)

This is odd because it appears to be on job 3 even though it's been running all night.

SSH into Node016 shows that the process is still running.

None of the log files have been touched in hours. The most recent was at 3:40 AM (it's 10 AM now).

Errors in log file:
Oct 16, 2008 2:25:55 AM org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider createSignatureForMessage
SEVERE: Problem signing message
java.net.NoRouteToHostException: No route to host
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:364)
    at java.net.Socket.connect(Socket.java:507)
    at java.net.Socket.connect(Socket.java:457)
    at java.net.Socket.<init>(Socket.java:365)
    at java.net.Socket.<init>(Socket.java:238)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
    at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
    at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.executeRemoteSignatureRequest(RemoteSignatureProvider.java:282)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.signCanonicalString(RemoteSignatureProvider.java:386)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.createSignatureForMessage(RemoteSignatureProvider.java:515)
    at com.xerox.amazonws.common.AWSConnection.createSignatureForMessage(AWSConnection.java:197)
    at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:315)
    at com.xerox.amazonws.sqs2.MessageQueue.makeRequestInt(MessageQueue.java:361)
    at com.xerox.amazonws.sqs2.MessageQueue.deleteMessage(MessageQueue.java:260)
    at com.xerox.amazonws.sqs2.MessageQueue.deleteMessage(MessageQueue.java:246)
    at org.rlcommunity.bt.recordbook.node.btQueue.deleteMessage(btQueue.java:141)
    at org.rlcommunity.bt.recordbook.node.compute.ComputeNodeDriver.runOneStep(ComputeNodeDriver.java:126)
    at org.rlcommunity.bt.recordbook.node.ProcessManager.run(ProcessManager.java:117)


Errors in .err file:
Exception in thread "Thread-1" java.lang.NullPointerException
    at java.net.URLEncoder.encode(URLEncoder.java:186)
    at com.xerox.amazonws.common.AWSConnection.urlencode(AWSConnection.java:150)
    at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:316)
    at com.xerox.amazonws.sqs2.MessageQueue.makeRequestInt(MessageQueue.java:361)
    at com.xerox.amazonws.sqs2.MessageQueue.deleteMessage(MessageQueue.java:260)
    at com.xerox.amazonws.sqs2.MessageQueue.deleteMessage(MessageQueue.java:246)
    at org.rlcommunity.bt.recordbook.node.btQueue.deleteMessage(btQueue.java:141)
    at org.rlcommunity.bt.recordbook.node.compute.ComputeNodeDriver.runOneStep(ComputeNodeDriver.java:126)
    at org.rlcommunity.bt.recordbook.node.ProcessManager.run(ProcessManager.java:117)

So this seems to have taken a core thread out of commission, but the heartbeat keeps truckin.

This message seems consistent between both nodes (this one and below). So here is an explanation:

We have a common point of failure.  RemoteSignatureProvider line 282, sends a request to the auth server to get a signature.  If that fails because of no route to host, we catch an IOException (NoRouteToHost), and pass it up to the caller.   The immediate call chain is:
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.signCanonicalString(RemoteSignatureProvider.java:386)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.createSignatureForMessage(RemoteSignatureProvider.java:515)

From there it goes up through Typica, once to an SDB heartbeat update and once to a delete message.  I think it could as easily bubble up through Jets3t also.

So: signCanonicalString throws all exceptions up.

createSignatureForMessage does not throw exceptions.  This exception is getting caught in it's try/catch.

In the long term, createSignatureForMessage should probably throw some exceptions up into Typica/Jets3t. For example, there may be transient, and non-transient reasons why a signature could not be created.  NoRouteToHost might be transient, invalid username/password might be permanent.  They should decide how to handle these problems.

For now though, I think we can put some longer sleeps and retries into createSignatureForMessage, and just quietly try again (with logging) if we can't get a signature.  After a while of trying, we should give up and exit the program, not pass a null signature back up into libraries that depend on us causing who knows how much damage.

Other process on node016


<UPDATE: This is analysis of a different node that is not still on the scoreboard.  It still crashed, so that autopsy is still valuable.>
A couple of errors in the log:
Oct 16, 2008 2:39:17 AM org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider createSignatureForMessage
SEVERE: Problem signing message
java.net.NoRouteToHostException: No route to host
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:364)
    at java.net.Socket.connect(Socket.java:507)
    at java.net.Socket.connect(Socket.java:457)
    at java.net.Socket.<init>(Socket.java:365)
    at java.net.Socket.<init>(Socket.java:238)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
    at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
    at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.executeRemoteSignatureRequest(RemoteSignatureProvider.java:282)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.signCanonicalString(RemoteSignatureProvider.java:386)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.createSignatureForMessage(RemoteSignatureProvider.java:515)
    at com.xerox.amazonws.common.AWSConnection.createSignatureForMessage(AWSConnection.java:197)
    at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:315)
    at com.xerox.amazonws.sdb.Item.makeRequestInt(Item.java:255)
    at com.xerox.amazonws.sdb.Item.putAttributes(Item.java:216)
    at org.rlcommunity.bt.recordbook.node.SimpleDBHeartBeatManager.updateHeartBeat(SimpleDBHeartBeatManager.java:75)
    at org.rlcommunity.bt.recordbook.node.SimpleDBHeartBeatManager.run(SimpleDBHeartBeatManager.java:85)



Oct 16, 2008 2:42:45 AM org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider createSignatureForMessage
SEVERE: Problem signing message
java.net.NoRouteToHostException: No route to host
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:364)
    at java.net.Socket.connect(Socket.java:507)
    at java.net.Socket.connect(Socket.java:457)
    at java.net.Socket.<init>(Socket.java:365)
    at java.net.Socket.<init>(Socket.java:238)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
    at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
    at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.executeRemoteSignatureRequest(RemoteSignatureProvider.java:282)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.signCanonicalString(RemoteSignatureProvider.java:386)
    at org.rlcommunity.bt.recordbook.node.RemoteSignatureProvider.createSignatureForMessage(RemoteSignatureProvider.java:515)
    at com.xerox.amazonws.common.AWSConnection.createSignatureForMessage(AWSConnection.java:197)
    at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:315)
    at com.xerox.amazonws.sdb.Item.makeRequestInt(Item.java:255)
    at com.xerox.amazonws.sdb.Item.putAttributes(Item.java:216)
    at org.rlcommunity.bt.recordbook.node.SimpleDBHeartBeatManager.updateHeartBeat(SimpleDBHeartBeatManager.java:75)
    at org.rlcommunity.bt.recordbook.node.SimpleDBHeartBeatManager.run(SimpleDBHeartBeatManager.java:85)

and then

Oct 16, 2008 3:40:58 AM org.rlcommunity.bt.recordbook.node.btQueue receiveMessage
SEVERE: Problem receiving message
com.xerox.amazonws.sqs2.SQSException: Client error : (AWS.SimpleQueueService.NonExistentQueue) The specified queue does not exist for this wsdl version.
    at com.xerox.amazonws.sqs2.MessageQueue.makeRequestInt(MessageQueue.java:363)
    at com.xerox.amazonws.sqs2.MessageQueue.receiveMessages(MessageQueue.java:220)
    at com.xerox.amazonws.sqs2.MessageQueue.receiveMessage(MessageQueue.java:154)
    at org.rlcommunity.bt.recordbook.node.btQueue.receiveMessage(btQueue.java:131)
    at org.rlcommunity.bt.recordbook.node.IncomingRemoteCommandManager.processOneMessage(IncomingRemoteCommandManager.java:51)
    at org.rlcommunity.bt.recordbook.node.IncomingRemoteCommandManager.run(IncomingRemoteCommandManager.java:72)

It processed experiments between the two Routing errors, and got up to Experiment 52. 

It looks like either the 2:39 or 2:42 error killed the heartbeat and the processing capability of the node (because it didn't log any work after 2:42), and the error an hour later that took the node down was because the commander finally cleaned it's queues and took it off the scoreboard.

The associated .err and .out files are:
node0161224145234.out
node0161224145234.err

And the log output was ../logs/node016-1224145234796_log.txt

Need to fix that bug that has them named differently.

The outfile just has a bunch of runtimes, the error file has:
Exception in thread "Thread-6" java.lang.NullPointerException
    at java.net.URLEncoder.encode(URLEncoder.java:186)
    at com.xerox.amazonws.common.AWSConnection.urlencode(AWSConnection.java:150)
    at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:316)
    at com.xerox.amazonws.sdb.Item.makeRequestInt(Item.java:255)
    at com.xerox.amazonws.sdb.Item.putAttributes(Item.java:216)
    at org.rlcommunity.bt.recordbook.node.SimpleDBHeartBeatManager.updateHeartBeat(SimpleDBHeartBeatManager.java:75)
    at org.rlcommunity.bt.recordbook.node.SimpleDBHeartBeatManager.run(SimpleDBHeartBeatManager.java:85)
Exception in thread "Thread-1" java.lang.NullPointerException
    at java.net.URLEncoder.encode(URLEncoder.java:186)
    at com.xerox.amazonws.common.AWSConnection.urlencode(AWSConnection.java:150)
    at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:316)
    at com.xerox.amazonws.sqs2.MessageQueue.makeRequestInt(MessageQueue.java:361)
    at com.xerox.amazonws.sqs2.MessageQueue.deleteMessage(MessageQueue.java:260)
    at com.xerox.amazonws.sqs2.MessageQueue.deleteMessage(MessageQueue.java:246)
    at org.rlcommunity.bt.recordbook.node.btQueue.deleteMessage(btQueue.java:141)
    at org.rlcommunity.bt.recordbook.node.compute.ComputeNodeDriver.runOneStep(ComputeNodeDriver.java:126)
    at org.rlcommunity.bt.recordbook.node.ProcessManager.run(ProcessManager.java:117)


These are errors that we didn't catch and log appropriatley, so we don't have times on them.  They probably caused the crash.

Going to the code to investigate.



Comments