Been working hard cleaning up the code, making it more robust to network failures, and more user friendly. I've sped up the aggregator drastically by sending status messages through SQS instead of uploading them as S3 files (if they can be compressed and fit into a single SQS message).
Being robust to failures connecting the signature provider or SQS have made huge improvements of the uptime of the nodes. We're also managing the status of the nodes through reporting a heartbeat to Amazon's simpleDB, which has made the commander much more reliable and has reduced the complexity and computation of all the nodes.
However, we're still running into problems were some nodes are being marked as dead when they are not, or they are marked as alive when they are clearly not doing work anymore. I've been taking a shotgun approach and running 60 nodes at a time. This was good to catch which failures were common and to quickly weed out issues. However, I think we need to take a more careful approach now, running only a few nodes at a time and carefully evaluating them and seeing what goes wrong.
Passwords And Such
There is now a configuration system where you can pass parameters at the command line like paramName=paramValue. There is a parameter value called settingsfile. You can point this to a file that has a name=value pair on each line. This is a good way to save passwords and stuff. I'm going to improve it to use a default file in .recordbook in your home directory.
Status Blog >