We don't want there to be a single point of failure where if some bug gets introduced into the code one day, we lose years of records. We're storing it in Amazon's Simple Storage system, which means we pay 15 cents per GB-Month of data used. The data grows quickly, but appears to be quite compressible. I would expect that my ICML 2008 data (1800 experiments x 30+ trials) would require about 100 MB compressed. This is cheap. Saving multiple copies is not a bad idea in the short term. But, over time this will start to get expensive and we'll want to think of something smarter.
I'm not sure SVN is the right thing, because the current plan is to store the data as serialized java objects, which means that they are binary and probably we will be storing full new copies every time they change. One possible solution is something I've used before with some success is as follows.
First, let me define a computing run: A computing run consists of the results generated in a computing session. This could be a few trials, or 1 trial, or thousands of trials, one on parameter set, or many. The main thing is that a computing run is only going to be for a single agent in a single event. Now, if we save the results from each computing run as its own file, we get easy(ish) reproducibility. This approach will require us to process these smaller files to create larger summary files. This is ok, because it only has to happen when new results are added, and those summary files will never write to the original files, so there is no risk of ruining them. This seems like a good plan.
Of course, we don't want to dump zillions of files all into a single directory, so we'll create a directory for each agent under each event. This way, all of the results for the agent/event pair can be found in single place.
It is an advanced and hopefully less important thing to ask complicated questions between agents. It would be nice to say "when the step sizes are the same, show me what epsilon does best for Sarsa and Q-learning". This actually makes sense in this case, but in general there may be no easy mapping of parameters between agents. It may make more sense in that case to do some more manual aggregation on simpler queries.