Queueing Up Experiments

We're going to use the Amazon SQS (Simple Queuing Service) to queue up parameter configurations.  I picture it working somewhat like this:

First Thoughts

Somewhere out in the world, a person chooses an experiment,  an agent (lets say they are both in the recordbook already), and some parameters to run.  This person runs a java program that is very similar to the permutation creating part of the old AbstractExperiment.  This program creates a series of agent configuration descriptions.  These get sent into a Queue that we'll call ExperimentsToRun, or E2R.  I'm not sure yet whether these will go into the Queue like "Configuration X :: 30 trials", or 30 instances of "Configuration X :: 1 trial", or maybe a hybrid approach.  I like round robin better than getting 30 trials done first, because round robin allows you to more quickly get a picture for what the results are looking like.

Now, we also have some machines configured to listen on that Queue.  These  may be local University machines, or laptops, or machines running on the Amazon Elastic Compute Cloud.  I hope we can turn any machine with Java 5 into an RL compute node, but EC2 does make it easier.

Anyways, these machines are just sitting in a while loop polling the Queue.  When there is data available, they pull a job off the Queue and run the experiment.  We would prefer if we weren't creating *too* many data files (because we create one every time we switch experiments or agents), so we might want to use a Queue for each experiment/agent combination and have different compute nodes listening on different Queues.  Or, maybe we just accept that we're going to have many small files and make it somebody's job to aggregate the files.

That makes sense, we'll use a simple Amazon database such that whenever a file is saved and uploaded to S3, it is described in the database with attributes for the experiment and agent.  The program can then do a quick query of the DB every once in a while and put a message in the aggregation Queue to save that some things need to be aggregated.  Something like that.

Further Thoughts

The "many small files" problem has been bugging me.  The current system of aggregating results into .index and .data files makes the most sense when a single node is doing substantial work on a particular experiment/agent pair.  With a single work queue, any particular worker may be frequently switching between agents and experiments, meaning that we either have to aggregate files locally and upload them occasionally, or we have to upload many many tiny files and then have someone else aggregate them into .index and .data files.  An alternative is to this strategy would be to have multiple Queues, one for each experiment-agent pair.


I think this might actually be better, just have to toss the idea of writing and sending files in the old sense and find a better way to communicate data.  A central aggregator could always have all the index files in memory and could very very quickly aggregate from multiple sources at once. 

Questions are raised about whether the index file is even important?  Maybe we can just dump all of the results to the results file?

Update: The index file captures meta-data about an experiment, data that you'd prefer not to store in every single run record.  This data includes:
  • resultConfigurationKey
  • paramSummary
  • agentName
  • agentParameterHolder
Upon further reflection, it seems that perhaps this information is in fact redundant.  In one "set" of experimental results, we will only have a single Experiment-Agent combination.  As long as resultConfigurationKey uniquel identifies Experiment-Agent-Parameters, then we can use that as the key to all of the results within that set of results.  The paramSummary is required.  The agentName and agentParameterHolder are not *required*, because they are covered by the paramSummary or the fact that all of the results in this set are for the same Experiment-Agent combination.  The only danger in here is that if we "lose" which experiment this is from, it will be hard to retrieve it.  While this might be statistically unlikely, it worries me.

I propose that we remove the index records, and instead store results record data of the form:
<resultConfigurationKey, paramSummary, runRecords.Size(), RunRecord[1], RunRecord[2], ... , RunRecord[runRecords.Size()]>

Then, we will upload data files either in groups or as singles, but they will be a mix of all sorts of Result Records.  The aggregator can iterate through the file and sort + sift the result records on at a time and append them to the appropriate data file.  Yay.

Update: I must be dumb.  How are we going to store data like that?  We're not going to be writing more than a single run record at a time.  It will be more like:
<resultConfigurationKey, paramSummary, RunRecord[i]>
<resultConfigurationKey, paramSummary, RunRecord[j]>

The big question I have here is how big are the RunRecords vs. the resultConfigurationKey and the paramSummary.  We already store the resultConfigurationKey in the runRecord, so adding the paramSummaries adds maybe 128 chars.  Not too bad. If an actual runrecord contains 10 000 episodes, then the data payload is 10 000 x 4 bytes ~= 40 KBytes per run record.   Clearly 40 KB > 128 B. As long as the run record is more than 32, we're spending more on the run record than on the paramSummary overhead.  Let's not forget Gzipping.

Maybe a better way to pose the problem is this: If we were running 2000 parameter combinations, 30 trials of each, what is the difference in file size between both storage techniques?  The data size per record will be constant, so we have:

2000 * 128 bytes vs. 2000 * 30 * 128 bytes ==> 250 KB vs. 7.3 MB.  That's actually pretty substantial.

This is especially important because the aggregator doesn't have the ability to append to something in the cloud, it has to download the data file from the cloud, append do it, then upload it back.  You can optimize by uploading it infrequently if you are the only aggregator, but still.  We want to make these uploads and downloads as cheap as possible.  We should chunk those files in any case to save on data transfer costs.

This would be true if we had a trustworthy way of generating the resultConfigurationKey.  Yeah, I don't immediately see what it would be a problem.  Basically, we need a few keys.  We need a way to uniquely identify Experiment-Agent pairs.  That will tell us where to store the data.  Then, in there, we just need resultConfigurationKeys.  These will allow us to differentiate each configuration within an Experiment-Agent pair.  We have the option of also making the resultConfigurationKeys globally unique, although maybe who cares?

Multiple Queues

With one queue for each experiment-agent pair, each compute node is basically "subscribing" to a particular set of experimental configurations.  This means that all of the data that they aggregate will be from the same EA pair, which means they can gather a medium sized data file before sending it off to the server.  Note, on the server side, we'll still have to do some later aggregation. 

This method would allow a compute node that is appropriately configured (memory, multiple cores, special libraries) to only handle events that it is specialized for, and similarly allow compute nodes from not taking jobs they can't handle.

In the cloud environment, you can easily imagine creating new nodes when they are required, and then terminating them when the corresponding queues are empty.

For private experiments, you might actually want the queues to be triple indexed, by Experient-Agent-User.  This way, when I'm doing experiments for my upcoming paper, I can have my local resources listen for the Experiment-Agent-btanner queues, and do my experiments privatey.