Status Blog‎ > ‎

What's New June 2009 - Updating the cloud and more SQLite!

posted Jun 30, 2009, 7:48 PM by Brian Tanner   [ updated Jun 30, 2009, 10:14 PM ]
It has been a while since I wrote anything here.  I just learned a lot from reading some old posts, so that reassures me that posting my thoughts here is useful to at least one person ;)



Updating the Cloud

Amazon changed some of the rules for how signatures and communication needs to happen with the queue and simpleDB services.  I invested several days of work to look at the updates to the Typica project, and then mirror those updates in the aws-remote-signing project that I maintain. It turned out to be a bit of work to get the code all working nicely, but now it does.

SQLite Implementation

Implementation Choices

In a recent post I talked about some ideas about how to use SQLite to store database results.  I've gone ahead with that, and it was working quite well.  The SQLite database has 3 tables, one for agents, one for experiments, and one for results.

The agent and experiment tables list all of the different parameter combinations that have been used, while the results table just links to an agent, experiment, and includes a blob that is a serialized ResultRecord.

This plan works well because it keeps the aggregator very fast (no complex data processing required), and the data is unaltered between when it is run and when it sits in the database.  The make the data easy to work with, there is some new code that he Viewer can use to create a new "QueryTown" database from the results database, that unpacks all of the parameters into columns in a way that is easy to query, and also summarizes these ResultRecords into summary statistics (instead of full run statistics).  We can then do our query and maxs in SQL if we like, and look up the full run details of the winners (for drawing learing curves).  This code is all new, but it seems to be working.

Size Issues

Current/Previous Situation

I started running some real experiments.  A simple agent with 1000 different parameterizations (want to move that to about 20 000 soon), running on 5 different environments, for 50 000 steps each.  I was storing every episode completion point in a EpisodeEndPointRunRecord.  Turns out that for some problems (cart pole), you do LOTS of episodes for bad agents, which made the run records very large.  Also, although I had trimmed up the meta data that is being stored for each run record (who ran it, how long it took, what date/time it was run, etc), there was still about 40 bytes of header information per record.

Take one sample database for example.

Size: 245 MB
ResultRecord Table
------------------------------
[From Combined Part of Stats, Indices Included]
Number of entries..................... 89540    
Bytes of storage consumed............. 254031872 [2.77 KB per entry spread across indices and stuff]

[From W/O Indices Part of Stats]

Number of entries..................... 44770    
Bytes of storage consumed............. 253603840 [241 MB]
Bytes of payload...................... 243822326  
Average payload per entry............. 5446.11   [5.32 KB]

So, on average, each result record is 5.32 KB.  We have one result record for each run, which means if we're doing 30 runs to get statistics, then 89540 entries is 3000 experiments.  Remember I was talking about 20 000 experiments x 5 environments x 30 runs, which is more like 15.2 GB of database.  That's a lot of data to shuffle around over the network, for the aggregators to flush, and for SQLite to handle.  And we want to be able to scale larger potentially... and 7 GB files is really starting to make me want to split into multiple database (lots of work), etc.

The previous results were on a particular environment.  There are other ones that are even worse.

Improvements

I decided to look carefully at if I could store much less data, and if I could reduce the overhead per entry. Instead of logging each episode completion, I log how many episodes and the return at a fixed, exponential set of checkpoints. Only 14 checkpoints in 50 000 steps, things like step: 64, 256, 1024, 2048, 4096, etc (it's not strictly exponential later in the series).  Now cart-pole that uses thousands of episodes takes the same space to store as mountain car that finishes no episodes.  I guess I could save a bit more space by not even storing empty checkpoints.  Actually that's a bad idea, it'll cause us to interpolate and miss things that are interesting, possibly.

I used a few tricks to approximate the current metadata (using an unsigned short to represent the userid of the person who ran the experiment in 2 bytes instead of 128 for a UUID), storing the runtime to the nearest minute instead of millisecond, etc.  So, the meta information went from 40 bytes to 9 bytes.  I also re-used some of these tricks inside the data records, like storing a joint record for return and episodes instead of 2 records (so I don't have to store the checkpoints and meta-data twice), and also using a coding scheme for the checkpoints (checkpoint indices from a global list).  This should make each checkpoint only 7 bytes.

So, storing 14 checkpoints + meta data now takes 9 bytes + 14 * 7 bytes = 107 bytes instead of the potential may many KB from before.  So, I reran some experiments and want to look at what the database looks like.
Size 794 KB
ResultRecord Table (Improved)
------------------------------
[From Combined Part of Stats, Indices Included]
Number of entries..................... 7570   
Bytes of storage consumed............. 594944  [78.6 bytes per entry (spread across indices]]

[From W/O Indices Part of Stats]

Number of entries..................... 3785     
Bytes of storage consumed............. 560128   (547 KB)
Average payload per entry............. 125.00    (125 Bytes!)
Maximum payload per entry............. 125      

So, good news here.  First, the max and the min are the same!  Second, we've gone from an average of more than 5 KILOBYTES to 125 BYTES!  That's a 41x reduction in size.  Very good stuff. 

And if wonders will never cease, it looks like the DB can be gzipped down to about 1/9 its size.  That's not as great as it sounds, because gzipping takes time and effort, but for the cloud stuff it might make sense.

Lets take a look at our hypothetical experiment, 20 000 experiments x 5 environments x 30 runs (3 million records).  With these new numbers, that would be more like a DB of size: 357 MB.  Much more manageable!

It would be nice to get that even smaller.

Looking Ahead: Saving Time and Effort

The biggest bottleneck of the large SQLite databases is the aggregator.  Every few minutes (how often it flushes is configurable), the aggregator has to copy its latest SQLite database back to the remote file system, whether its actually just a file on the local machine, or if its in a S3 bucket.  Copying back to the local filesystem isn't too bad (my machine can do a filecopy at about 25 MB/second apparently, so a 2 GB file takes about 80 seconds.  This isn't going to kill us outright.  While we're at it, GZIPPING that 2 GB file takes 150 seconds, but the resulting file is only 385 MB.  So, in cloud mode, we should do this (maybe we do already?).

However, if we are in cloud mode, the aggregator has to transfer the files back to Amazon S3.  Even if our aggregator node is running on EC2, this seems like it might take much longer than a copy operation.  If our node isn't in the cloud, then it can take a long time to upload 2 GB.

There are some choices.  We could move to a mixed-mode implementation, where the aggregate is stored in a local file system.  If we are using an EC2, we can even store it to an elastic volume which is as quick as disk access.  If we cared we could even use raid striping for speed or a multi-level raid for safety.  This solution is attractive because it's sortof easy, but it is unattractive because we are sacrificing some of our advantages, like having a SQLite database on S3 that you could in theory just download with a viewer at any time.

Another option is to go with the multi-file solution.  We could keep the agents/experiments details in a single database, and then the results in separate databases.  This would allow the aggregator to start a new results database after the existing one reaches a size limit.  We'd want to aggregate these at some point, basically so that there is always the "main" DB and the "new" DB(s), and at some interval you dump the new ones into the main one and repeat.  This is because you can't transparently just use all of the DBs with SQLite at once.

Not sure what is best to do. Good news I guess is that with all of the space savings I've created I won't hit this for a little while.

Open Problem : Multiple Aggregators

I'd still like to make progress on the multiple-aggregator problem.  I'd like to be able to have multiple aggregators, on multiple computers, and have the aggregation be much faster.  This will be less important now that all of the data I'm personally using will fit in an SQS message (which can be processed super fast), but in the future it might matter.  Also, I think the inserts get slower when the database gets really big.

One solution could be that instead of the aggregator downloading the existing database and flushing it occasionally, it starts its own NEW database and flushes it regularly.  Every node we ever run has a unique(ish) ID, so they can all just run their own aggregations.  Then, a master aggregator can occasionnaly download the bunch of them, figure out which Agent and Experiment tuples are actually the same between tables, and merge all of the ResultRecords together.  In theory you could even do this just in the viewer... although that doesn't seem as attractive.  This is actually pretty easy, solves all of the problems I can currently think of, and is only a tiny bit of work.  Next time I run into the aggregation-too-slow problem I will code this up.

Comments