Status Blog‎ > ‎

Moving the Aggregator to the Cloud

posted Aug 8, 2008, 2:10 PM by Brian Tanner   [ updated Aug 8, 2008, 6:18 PM ]
Things seem to be working, fairly well with the external signing.  There are some bugs of course, but these are getting ironed out one at a time.

Most of the problems stem from hiccups in the Internet not being handled very flexibly.  I'm adding code bit-by-bit to make things more stable, attacking the problems that frustrate me most first.

I've added some simple caching to the RemoteMessageSigner so that the nodes and tools do not need to talk to the remote authorizer before every interaction with AWS. This has provided a very substantial speedup to a few parts of the system.  I haven't been able to crack the nut of not requiring every SQS message to be signed, but I'll come back to it later, probably after the next Typica release.  This is mostly not a big deal, because most of the node types don't have any serious performance restrictions... we want them to go as fast as is convenient, but a little overhead here and there isn't going to hurt us too much.

One exception is the Aggregator.  We can only run a single aggregator at a time, and it has to wade through and aggregate all of the experimental results that have been created by the entire grid.  It needs to be fast, or it will be too far behind and we'll have a huge latency on the recordbook results getting tallied.

The aggregator is slow for a few reasons.  First, it has to download and then periodically upload the result summary files from S3, which are multiple MB.  Second, it has to pull each individual result down from S3, and then delete it from S3.  So, almost all of its interactions need to be authorized by the external authorizer, and it has lots of interaction with SQS and S3.  I think it makes sense to run the Aggregator from the Amazon Compute Cloud, and to use local (instead of remote) signing for the Aggregator messages.  Hopefully, with the network and signing latencies removed, it can truly fly and scale up to us having tens or hundreds of nodes generating results at a time.

Update
Aggregator is now in the cloud.  I'm letting it bypass external authentication by giving it a credentials file that is not in SVN that has the AWS access and secret keys in it.  This is ok because the cloud machine is under my control (not an option when distributing code to users). Looks like a 3-6 time speedup, which is really good.  Up and downloads to S3 are super fast, and we're processing about 3 records per seconds (instead of 1 record every 1.5-2 seconds).   Perhaps this can be parallelized if we add some locking around the file updates and farm the deletions off to another thread.

Now that I have a permanent machine running in the cloud, we might as well move the database and servlet stuff there as well. That's the next step. Then we're actually quite close to "prime time".


Comments