Part of the reason that the code seems complicated is that persistence is wrapped up into all of the data storage classes.
I'm worried about things like runKeys, agentKeys, writing to files and stuff all in the same code where we're storing the data. This makes things complicated and there are all sorts of little concerns and running distributed, uniqueness, etc, etc.
I have also made some judgment calls previously about at what point we store unprocessed data (arrays of runrecords), and at what point we process that into summary statistics that we can use for doing queries.
Aside: Remember that every agent gets their own database of results. The good news is that having these as SQLite databases should allow us to do cross-database comparisons pretty easily.
I'm going to try to forget what I've done previously, and re-imagine the data storage system from scratch, given the following constraint considerations.
These nodes will not bind keys. They will just describe agent and experiments by their class, jar, and bucket names. We will worry about the translation behind the scenes. No "param summaries" are created at this point either.
When the viewer node downloads the SQLite databases, perhaps it makes sense to "unpack" them into a larger, processed version that will allow us to do fast queries and such. If we concede that this unpacking process will be necessary, we should wonder how packed things should be in the aggregator phase. Maybe it make sense to keep things all bundled together and relatively un-normalized there, and then to normalize the database for quick querying at the viewer phase. I'm going to hash out a possible database schema at the aggregator level, and then see how that looks, then think more on this.
It looks like SQLite does support multi-column primary keys though, so that will be useful sometimes.
<RunId, AgentName, AgentJar, AgentParams, ExperimentName, ExperimentParams, RunDetails,..., RunRecord>
We would bind the ROWID as the RUNID. This would be very fast to aggregate, so why didn't we do it before?
First, it would be terribly redundant. Many of the attributes here would be duplicated for every row in the database, including most of the agent information, the experiment name, etc. For things that vary among rows, the agentParams would be duplicated for every trial of some configuration, same with the experiment Params. Basically, this just feels pretty inefficient. Also, in the old days, we may have been bothered that we'd have to iterate through the whole multi-GB file in order to get a list of all the parameters we have tried. It made it harder to process the file intelligently because we weren't sure what was in it.
Table 1 : Agent
<AgentConfigId, AgentName, AgentJar, AgentParams, etc>
Table 2 : Experiment
<ExperimentConfigId, ExperimentName, ExperimentParams, etc>
Table 3: Runs
<RunId, AgentConfigId, ExperimentConfigId, RunDetails, RunRecord>
If we wanted to be anal about the normalization we could add a Agent-Experiment relation table, but I'm not super keen on that. This will give us what we want at the aggregator phase I think, with minimal redundancy.
<AgentConfigId, ExperimentConfigId, AgentParam1, AgentParam2, ..., AgentParamN, ExperimentParam1, ExperimentParam2, ..., ExperimentParam3, ResultStatistic1, ResultStatistic2, ...,ResultStatisticN>
The result statistics might be very specific (we might create these tables for each viewer, dunno). One obvious statistic consistent with previous work in the log book would be total reward at the end of the experiment (averaged over all runrecords that match this set of parameters).
So then you could write a query like (ugh, my SQL is rusty):
Select AgentConfigId, ResultStatistic1 from QueryTown, where ResultStatistic1=max(ResultStatistic1) group by AgentParam1.
If AgentParam1 was StepSize, and ResultStatistic1 was Total Reward, then I think this should give us the maximum reward for each step size. Someone more SQLish could improve on this I'm sure and find ways for us to be clever.
Status Blog >