We're going to use the Amazon SQS (Simple Queuing Service) to queue up parameter configurations. I picture it working somewhat like this:
Now, we also have some machines configured to listen on that Queue. These may be local University machines, or laptops, or machines running on the Amazon Elastic Compute Cloud. I hope we can turn any machine with Java 5 into an RL compute node, but EC2 does make it easier.
Anyways, these machines are just sitting in a while loop polling the Queue. When there is data available, they pull a job off the Queue and run the experiment. We would prefer if we weren't creating *too* many data files (because we create one every time we switch experiments or agents), so we might want to use a Queue for each experiment/agent combination and have different compute nodes listening on different Queues. Or, maybe we just accept that we're going to have many small files and make it somebody's job to aggregate the files.
That makes sense, we'll use a simple Amazon database such that whenever a file is saved and uploaded to S3, it is described in the database with attributes for the experiment and agent. The program can then do a quick query of the DB every once in a while and put a message in the aggregation Queue to save that some things need to be aggregated. Something like that.
Questions are raised about whether the index file is even important? Maybe we can just dump all of the results to the results file?
Update: The index file captures meta-data about an experiment, data that you'd prefer not to store in every single run record. This data includes:
I propose that we remove the index records, and instead store results record data of the form:
<resultConfigurationKey, paramSummary, runRecords.Size(), RunRecord, RunRecord, ... , RunRecord[runRecords.Size()]>
Then, we will upload data files either in groups or as singles, but they will be a mix of all sorts of Result Records. The aggregator can iterate through the file and sort + sift the result records on at a time and append them to the appropriate data file. Yay.
Update: I must be dumb. How are we going to store data like that? We're not going to be writing more than a single run record at a time. It will be more like:
<resultConfigurationKey, paramSummary, RunRecord[i]>
<resultConfigurationKey, paramSummary, RunRecord[j]>
The big question I have here is how big are the RunRecords vs. the resultConfigurationKey and the paramSummary. We already store the resultConfigurationKey in the runRecord, so adding the paramSummaries adds maybe 128 chars. Not too bad. If an actual runrecord contains 10 000 episodes, then the data payload is 10 000 x 4 bytes ~= 40 KBytes per run record. Clearly 40 KB > 128 B. As long as the run record is more than 32, we're spending more on the run record than on the paramSummary overhead. Let's not forget Gzipping.
Maybe a better way to pose the problem is this: If we were running 2000 parameter combinations, 30 trials of each, what is the difference in file size between both storage techniques? The data size per record will be constant, so we have:
2000 * 128 bytes vs. 2000 * 30 * 128 bytes ==> 250 KB vs. 7.3 MB. That's actually pretty substantial.
This is especially important because the aggregator doesn't have the ability to append to something in the cloud, it has to download the data file from the cloud, append do it, then upload it back. You can optimize by uploading it infrequently if you are the only aggregator, but still. We want to make these uploads and downloads as cheap as possible. We should chunk those files in any case to save on data transfer costs.
This would be true if we had a trustworthy way of generating the resultConfigurationKey. Yeah, I don't immediately see what it would be a problem. Basically, we need a few keys. We need a way to uniquely identify Experiment-Agent pairs. That will tell us where to store the data. Then, in there, we just need resultConfigurationKeys. These will allow us to differentiate each configuration within an Experiment-Agent pair. We have the option of also making the resultConfigurationKeys globally unique, although maybe who cares?
This method would allow a compute node that is appropriately configured (memory, multiple cores, special libraries) to only handle events that it is specialized for, and similarly allow compute nodes from not taking jobs they can't handle.
In the cloud environment, you can easily imagine creating new nodes when they are required, and then terminating them when the corresponding queues are empty.
For private experiments, you might actually want the queues to be triple indexed, by Experient-Agent-User. This way, when I'm doing experiments for my upcoming paper, I can have my local resources listen for the Experiment-Agent-btanner queues, and do my experiments privatey.