Status Blog‎ > ‎

Environmental Paramaterization makes me want to re-plan some things...

posted Feb 18, 2009, 9:14 AM by Brian Tanner
Although I am not working full time on the log book at the moment, I want to use it to do some experiments, and I'm realizing that I may have drastically idealizing things when I imagined how environments and experiments were entwined, but agents were there own thing.

Here is what I mean.  When you define an "experiment" for the record book, you are forced to put any environmental configuration and variation into the experiment program.  What I mean is, that an experiment has a fixed class and function "runExperiment", and that "runExperiment" is passed an agent and parameters.

There is no way to (currently) pass runExperiment a flag so that sometimes you run with random start states on, and sometimes off.  Or, so that sometimes you run it in a particular start state, sometimes another one.

The reason is to make data consolidation as easy as possible.. so that all results within 1 experiment are uniform, only that agent params are changing.  This should make it easy to pick out the results we want by pruning past on agent parameters.  The downside is, I currently want to run some experiments where I want to try running from 9 different start states.  I'd like to do each one the same amount, and be able to see composite results that average out the differnet start states, but I'd like to be able to look at each independently.

If I could specify both environment AND agent parameters to the runExperiment function, this would be easier.  Then I could submit 9 different experimental configs for each agent configuration, and when I view my results, I could just filter (or not) based on those environmental params.  This waters down the idea of an experiment being this monolithic unvarying thing.... but it makes the results more of a database of things that you care about, which I think is a win.

Implementing this reveals some maybe brittle redundancy in the recordbook data storage scheme though.  Currently we have two data files per experiment, the Index, and the Results.

The INDEX file has 1 entry per experimental configuration, something like:

<agentId, agentName, agentSourceJar, agentParameterSettings, experimentId, uniqueConfigurationKey, submitterId>

The uniqueConfigurationId is a MD5 hash of 3 things:
uniqueConfigurationKey= hash ( agentParameterSettings, experimentId, agentName:agentSourceJar )

The idea is that this uniqueConfigurationKey should uniquely identify records that can be grouped together.  If two computers on opposite ends of the Internet have results with the same uniqueConfigurationKey, that means they were generated from the same agent, with the same parameters, from the same JAR file (provided that we have external controls in place so that these names and parameter values are meaningful).

One reason this is important is for the RESULTS file. The RESULTS file has entries like:

<uniqueConfigurationKey, resultType, runKey, runTimeInMS, runDate, runnerId, result>

resultType is an ID that will map to something like EpisodeEndPointRunRecord or EpisodeEndReturnRunRecord.

runKey is a UUID generated when this experiment was run.  It's random, make from the time and a random number.  Think of it as an absolutely unique ID for this run.

runTimeInMS is how long the experiment took in milliseconds

runDate is the date-time that the experiment took place

result might be something like all the endPoints of the episodes, or the Returns received in each episode.

The overhead is about 128 + 32 + 128 + 64 + ? (date) = probaby around 400 bytes per record.  When we start running 10 000 experiments, each with 30 trials, this will be about 100 MBytes of overhead.  If we normalized this database of ours a bit, we could probably cut the overhead to 128 bits, by just storing <runKey, result> in here. Maybe that's really worthwhile, maybe not, I'm not totally sure.

What I am starting to think about is if we should be using SQLite or some other lightweight database to store all of these things.  Of course this sounds good, but is it really?  It would take time, that's bad.  But maybe the data being stored on the distributed nodes would be simpler, and the merging process would be easier to see, and maybe the data would be spread over more files (tables) but it would be must faster to search?  And we'd use select statements... hmm.

Going to think on this a while.



Comments