Status Blog‎ > ‎

SQLite/Database Plan for the Log Book

posted Feb 18, 2009, 10:57 AM by Brian Tanner   [ updated Feb 18, 2009, 1:08 PM ]
Part of the reason that the code seems complicated is that persistence is wrapped up into all of the data storage classes.

I'm worried about things like runKeys, agentKeys, writing to files and stuff all in the same code where we're storing the data.  This makes things complicated and there are all sorts of little concerns and running distributed, uniqueness, etc, etc.

I have also made some judgment calls previously about at what point we store unprocessed data (arrays of runrecords), and at what point we process that into summary statistics that we can use for doing queries.

Aside: Remember that every agent gets their own database of results.  The good news is that having these as SQLite databases should allow us to do cross-database comparisons pretty easily.

I'm going to try to forget what I've done previously, and re-imagine the data storage system from scratch, given the following constraint considerations.

Key-Binding

We should delay key-binding as much as possible, maybe?  What I mean is: don't assign a "key" to an agentname-agentjar-agentparams tuple until you need to.  Why bother?  The jey might only need to be assigned when it finally gets put into a results database in an aggregator node.  That only happens on ONE host, it's a bottleneck in the process.  That means that we can just use auto-incrementing keys, which will save us a lot of effort and nervous coding making sure all these distributed nodes are creating UUIDs that match or are unique when they are supposed to be.

Submitter Nodes

Submitter nodes should be able to submit "experiments" to be run, and each submission should specify an agent, agent parameters, an experiment, and experiment parameters.  The experiment parameters may actually be environment parameters, but I think a level of abstract is better.  If we submit them as experiment parameters, then we can interpret them and do more powerful things if we want to.  Like, run experiment configuration "A", which may be a whole host of envirionment and data collection settings.

These nodes will not bind keys.  They will just describe agent and experiments by their class, jar, and bucket names.  We will worry about the translation behind the scenes.  No "param summaries" are created at this point either.

Compute Nodes

Compute nodes will pull experiment descriptions off of a Queue, as they do now.  I don't see a need for key binding here either.  These guys will just do the work they are supposed to, and then submit the results to the results queue, one at a time, as usual.

Aggregator Nodes

This is the place where I see us binding keys and moving into a SQLite database.  The issue that is problematic for me is whether we do any sort of data processing at this point, or just keep everything in binary blobs until it gets downloaded to a viewer node.  My first inclination is to not to any processing here.  This will keep aggregator nodes fast, and will keep the data pristine and untouched.

When the viewer node downloads the SQLite databases, perhaps it makes sense to "unpack" them into a larger, processed version that will allow us to do fast queries and such.  If we concede that this unpacking process will be necessary, we should wonder how packed things should be in the aggregator phase. Maybe it make sense to keep things all bundled together and relatively un-normalized there, and then to normalize the database for quick querying at the viewer phase.  I'm going to hash out a possible database schema at the aggregator level, and then see how that looks, then think more on this.


Aggregator Node DB Schema

SQLite keeps a 64-bit ROWID for each row of each table.  This is what we will use when all possible as the key to the table.  We can use auto-increment. They say searches on the ROWID are usually about twice as fast as a regular pimary key or any other index.

It looks like SQLite does support multi-column primary keys though, so that will be useful sometimes.

Proposal 1

First shot would just be a single big pile of data, I guess.

<RunId, AgentName, AgentJar, AgentParams, ExperimentName, ExperimentParams, RunDetails,..., RunRecord>

We would bind the ROWID as the RUNID.  This would be very fast to aggregate, so why didn't we do it before?

First, it would be terribly redundant. Many of the attributes here would be duplicated for every row in the database, including most of the agent information, the experiment name, etc. For things that vary among rows, the agentParams would be duplicated for every trial of some configuration, same with the experiment Params.  Basically, this just feels pretty inefficient.  Also, in the old days, we may have been bothered that we'd have to iterate through the whole multi-GB file in order to get a list of all the parameters we have tried.  It made it harder to process the file intelligently because we weren't sure what was in it.

Proposal 2 : Normalize a Bit

This proposal tries to split the data up enough to cut down on being redundant, but the data isn't unpacked such that it is really easily queryable.  This one will mirror the existing setup quite a bit.

Table 1 : Agent
<AgentConfigId, AgentName, AgentJar, AgentParams, etc>

Table 2 : Experiment
<ExperimentConfigId, ExperimentName, ExperimentParams, etc>

Table 3: Runs
<RunId, AgentConfigId, ExperimentConfigId, RunDetails, RunRecord>

If we wanted to be anal about the normalization we could add a Agent-Experiment relation table, but I'm not super keen on that.  This will give us what we want at the aggregator phase I think, with minimal redundancy.

Proposal 3: Unpacked Data

The "viewer" would download the three tables, Agent, Experiment, Runs.  Then, he might iterate over all of the runs making aggregate records. It's not clear if this would best be done iteratively using a single pass over the data, or by doing separate selects for each aggregate row.  Anyway, the aggregate rows look like:

Table1: QueryTown
<AgentConfigId, ExperimentConfigId, AgentParam1, AgentParam2, ..., AgentParamN, ExperimentParam1, ExperimentParam2, ..., ExperimentParam3, ResultStatistic1, ResultStatistic2, ...,ResultStatisticN>

The result statistics might be very specific (we might create these tables for each viewer, dunno).  One obvious statistic consistent with previous work in the log book would be total reward at the end of the experiment (averaged over all runrecords that match this set of parameters).

So then you could write a query like (ugh, my SQL is rusty):

Select AgentConfigId, ResultStatistic1 from QueryTown,  where ResultStatistic1=max(ResultStatistic1) group by AgentParam1.

If AgentParam1 was StepSize, and ResultStatistic1 was Total Reward, then I think this should give us the maximum reward for each step size.  Someone more SQLish could improve on this I'm sure and find ways for us to be clever.






 
Comments