Status Blog

Handling Multiple Runs

posted Sep 5, 2009, 10:26 AM by Brian Tanner   [ updated Sep 6, 2009, 7:54 AM ]

Generating the Results

The code that I've been using so far for the environments is very careful with random seeds.  It makes sure that given an action sequence, the environment will generate the same observation and reward sequences for the same MDP every time.  This sounded good at first.  However, now that I want to do multiple runs for the MDP it's not so good because we've lost part of the environment's stochasticity.

I'm going to add a new param to the CODA environments called RunNumber, and I'm going to use that (combined with the MDP number) to generate the random seed for transition and reward stochasticity.  This way, for every MDP, you *can* control these feature independently.  The only problem is that I have just run 600 000 experiments without this parameter so I need to re-run all of those experiments. Shux.

Handling The Results

There are two distinct strategies for handling multiple runs that we could follow.

One would be to have each run as it's own experiment.  For example, we could add a runNumber param to the experiment's paramSummary.  This would later allow us in SQL to have fine control over the runs because we could do queries over specific run numbers.  The downside of this approach is that there would be a different experimentId for each run, which means queries directly over the resultsRecords might be more complicated.  If they shared the experimentId, we could do:
select max(score) from resultRecords where experimentId=select id from Experiment where MDPNumber=2

Now we'd have to do
select max(score) from resultRecords where experimentId in (select id from Experiment where MDPNumber=2)

Maybe that's not so bad.  Maybe it gets worse when we have more complicated queries.  Not quite sure.

The opposite strategy is to take the distinct index off of ResultRecords for (AgentId,ExperimentId).  This would mean that there would literally just be multiple ResultRecords for each Agent,Experiment combination.  We'd have to manually filter or average them when necessary.  The advantages can be that we just need to submit things multiple times and they get done multiple times and end up in our results multiple times.

Now that I'm thinking more clearly about this, I don't think this strategy fits with the new accountability strategy that I'm trying to follow where we have a record of each submission.  I think we should add the runNumber to the param summary.

First Large Scale Experiments : Compromised

posted Jul 14, 2009, 9:18 PM by Brian Tanner

I had a bug in the env_step method of RandomizedCodaV1.

Can you spot it?

    public Reward_observation_terminal env_step(Action theAction) {
        Reward_observation_terminal envStepResult = theWrappedEnv.env_step(theAction);
        Observation warpedObservation = theWarper.warpObservation(envStepResult.getObservation());

        Reward_observation_terminal warpedResult = new Reward_observation_terminal(envStepResult.getReward(), warpedObservation, envStepResult.isTerminal());
        return envStepResult;
    }

That's right, I'm not returning the warped observation.  I am using it in env_start though.

So this means these 3 million results are pretty much not what I expected.  The good news is that this basically gives me 25 MDPs worth of data for a random selection of domain and randomized noise in the transitions, but no observation warping. 

It's a bit to work with, but not worth following through to all 100 or more MDPs. Hmph.



Merging SQLite Databases

posted Jul 11, 2009, 10:55 AM by Brian Tanner   [ updated Sep 5, 2009, 10:24 AM ]

I have improved the speed of aggregation by allowing it to be done in parallel.

Note: The tricks and techniques here are currently in LocalViewNodeCache.java.


In the old system, there could only be one aggregator node, and it would download the existing result summary, update it, then flush it back to the server occasionally.  As the file got larger, the flushes took longer.  Also, if this was too slow to keep up with the incoming results, too bad.

The new system allows us to run multiple aggregators.  Each one starts its own results summary (uniquely identified), and then flushes that to the server occasionally.  We have a new problem: when we want to look at results, we have to aggregate multiple summaries.  This is actually not a bad situation, because it means we can download and process results in parallel instead of eneding to download a single huge file.

To oversimplify the situation, we have N databases that look like:
Agent: <ROWID (Long), ParamSummary (String)>
Experiment: <ROWID (Long), ParamSummary (String)>
Result: <AGENTID (Long), EXPERIMENTID (Long), Result (Blob)>

We will think of combining databases by calling one of them the "main" database and one of them the "new" database.  We want to get all of the information from new into main.

The steps are something like:

  • Agent Records
    • Make sure all of the agent records from new exist in main
  • Do the same for experiments
  • Copy Result records from new to main.  In main, they need to have the *new* ROWIDs for the agent Agent and Experiment tables
I think the easiest way to do this might be to alter the Agent and Experiment tables to have a new column, "NEWROWID".  Then maybe we can do something as simple as "insert or replace" from new->main, then select the ROWIDs from main to new.Agent.NEWROWID, then insert from new to main for the records.

Now that I'm looking at this, I think we need to have an index on paramSummary!

Here is the plan as it is unfolding.

attach 'newdb.sqlite as new;
alter table new.Agent add column newIndex INTEGER;

sqlite> select count(*) from main.Agent;
61236
sqlite> select count(*) from new.Agent;
56373

Here is the total schema for the database.
CREATE TABLE 'Agent'('id' INTEGER PRIMARY KEY  AUTOINCREMENT,'agentShortName' VARCHAR(32),'agentName' VARCHAR(32),'agentJar' VARCHAR(256),'paramSummary' VARCHAR(1024));
CREATE TABLE 'Experiment'('id' INTEGER PRIMARY KEY AUTOINCREMENT,'experimentShortName' VARCHAR(32),'experimentClassName' VARCHAR(32),'paramSummary' VARCHAR(1024));
CREATE TABLE 'ResultsRecord'('id' INTEGER PRIMARY KEY AUTOINCREMENT,'agentId' INTEGER,'experimentId' INTEGER,'runDate' DATE,'runTimeInMS' INTEGER,'payload' BLOB, FOREIGN KEY ('agentId') REFERENCES 'Agent'('id'), FOREIGN KEY ('experimentId') REFERENCES 'Experiment'('id'));
CREATE UNIQUE INDEX all_agent_index ON 'Agent' ('agentShortName','agentName','agentJar','paramSummary');
CREATE UNIQUE INDEX all_experiment_index ON 'Experiment' ('experimentShortName','experimentClassName','paramSummary');



insert or ignore into main.Agent (agentShortName,agentName,agentJar,paramSummary) select agentShortName,agentName,agentJar,paramSummary from new.Agent;

Timing: This took several (10-20 seconds).  I wonder if it would be faster if we only had the unique key on the paramSummary.

sqlite> select count(*) from main.Agent;
117609

This looks like there were no duplicates. Lets be sure.  I executed the command again and the count stayed the same.  We're in business.

Now we have to get those ROWIDs into new.Agent.newIndex!  This is really at the edge of my SQL knowledge, but maybe this will work?

sqlite> update new.Agent set newIndex=(select ROWID from main.Agent where new.Agent.paramSummary=main.Agent.paramSummary);

Note: It's taking a long time, more than a few minutes.  Added an index on paramSummary:
sqlite> create index if not exists  main.paramIndexAgent on Agent (paramSummary);

That took about 10-20 seconds.  Running the update again took 3 seconds.  So, we should really change all of the database definitions to have that index.

Lets do all the same for experiment.
sqlite> select count(*) from new.Experiment;
5
sqlite> select count(*) from main.Experiment;
4

create index if not exists  main.paramIndexExperiment on Experiment (paramSummary);
insert or ignore into main.Experiment (experimentShortName,experimentClassName,paramSummary) select experimentShortName,experimentClassName,paramSummary from new.Experiment;

sqlite> select count(*) from main.Experiment;
9

alter table new.Experiment add column newIndex INTEGER;
update new.Experiment set newIndex=(select ROWID from main.Experiment where new.Experiment.paramSummary=main.Experiment.paramSummary);

Ok, now time to do the big insert!

sqlite> select count(*) from new.ResultsRecord;
116282
sqlite> select count(*) from main.ResultsRecord;
145104

insert into main.ResultsRecord(agentId,experimentId,runDate,runTimeInMS,payload) select Agent.newIndex,Experiment.newIndex,runDate,runTimeInMS,payload from new.Agent,new.Experiment, new.ResultsRecord where new.ResultsRecord.agentId=Agent.id and new.ResultsRecord.experimentId=Experiment.id;

sqlite> select count(*) from main.ResultsRecord;
261386

Yay!  And it was fast (.26 seconds)


Losing Results and Jobs.... tigher tolerance makes life difficult.

posted Jul 4, 2009, 10:45 AM by Brian Tanner

The code as it stands is not 100% robust to various types of failures.

For example, a compute node works like this (roughly):
  • Take job description message from work queue
  • If more than 1 iteration left, put new message in with iteration:=iteration-1
  • Delete existing message
  • Do experiment
  • end result when convenient (in separate thread)S
This has a problem.  If the node crashes or a queue disappears or the Internet dies between deleting the existing message and the result getting sent, that job is lost.   This isn't usually terrible, because we run each experiment 40+ times, so if there are 40 of some, 39 of others, no big deal.  But, currently I'm trying to do exactly 1 run of 6 million experiments. Can't afford to lose any.

There is a similar problem in the aggregator node.  After a new result is aggregated, the result message is deleted.  The result is only really locked in when the aggregator flushes.  It is supposed to flush at regular intervals, and when it shuts down.  But, today, I ran into a problem where the nodes command queue disappeared so I could send a shutdown message to the aggregator.  So, aside from trying to CTRL-C it right after a flush finishes, this is a problem.

Whatever we do, we might still lose some jobs, lose some results, or get an occassional duplicate job/result in the database.  Realistically, we should be addressing ALL of these issues with the logbook.  So, here is a multi-point plan.

  • Don't delete work or results messages until we are DONE with them.  In the compute node, this means not resubmitting the decremented job until the current result is successfully delivered.  In the aggregator, it means not deleting the results messages until we flush the database back to the remote storage service.  This will cut down on lost jobs.
  • Use a larger visibility timeout in the work and results queues. This controls how long a message is invisible after it has been received but not deleted.  This needs to be set high enough to make sure that a message won't be picked up a second time before handling it is completed.  This will cut down on duplicated results and jobs.
  • Update the queueing infrastructure to include a "recreate if missing" option on queues, and detect the "queue missing" exception inside the queue code.  If we/amazon/someone accidentally deletes a queue for a running node, it would be great if we can just recreate the queue on the fly.  We'll have to watch for the "can't recreate queue within 60 seconds exception.
  • Create a receipt of all submitted jobs.  Keep it in a SQLite database with similar structure to the results database. Later, we can check the receipts against the results to make sure all is accounted for, and possibly resubmit results that are missing.
With these 4 strategies together, I think I can solve this problem.  Thankfully, they are all fairly easy to implement as well.

What's New June 2009 - Updating the cloud and more SQLite!

posted Jun 30, 2009, 7:48 PM by Brian Tanner   [ updated Jun 30, 2009, 10:14 PM ]

It has been a while since I wrote anything here.  I just learned a lot from reading some old posts, so that reassures me that posting my thoughts here is useful to at least one person ;)



Updating the Cloud

Amazon changed some of the rules for how signatures and communication needs to happen with the queue and simpleDB services.  I invested several days of work to look at the updates to the Typica project, and then mirror those updates in the aws-remote-signing project that I maintain. It turned out to be a bit of work to get the code all working nicely, but now it does.

SQLite Implementation

Implementation Choices

In a recent post I talked about some ideas about how to use SQLite to store database results.  I've gone ahead with that, and it was working quite well.  The SQLite database has 3 tables, one for agents, one for experiments, and one for results.

The agent and experiment tables list all of the different parameter combinations that have been used, while the results table just links to an agent, experiment, and includes a blob that is a serialized ResultRecord.

This plan works well because it keeps the aggregator very fast (no complex data processing required), and the data is unaltered between when it is run and when it sits in the database.  The make the data easy to work with, there is some new code that he Viewer can use to create a new "QueryTown" database from the results database, that unpacks all of the parameters into columns in a way that is easy to query, and also summarizes these ResultRecords into summary statistics (instead of full run statistics).  We can then do our query and maxs in SQL if we like, and look up the full run details of the winners (for drawing learing curves).  This code is all new, but it seems to be working.

Size Issues

Current/Previous Situation

I started running some real experiments.  A simple agent with 1000 different parameterizations (want to move that to about 20 000 soon), running on 5 different environments, for 50 000 steps each.  I was storing every episode completion point in a EpisodeEndPointRunRecord.  Turns out that for some problems (cart pole), you do LOTS of episodes for bad agents, which made the run records very large.  Also, although I had trimmed up the meta data that is being stored for each run record (who ran it, how long it took, what date/time it was run, etc), there was still about 40 bytes of header information per record.

Take one sample database for example.

Size: 245 MB
ResultRecord Table
------------------------------
[From Combined Part of Stats, Indices Included]
Number of entries..................... 89540    
Bytes of storage consumed............. 254031872 [2.77 KB per entry spread across indices and stuff]

[From W/O Indices Part of Stats]

Number of entries..................... 44770    
Bytes of storage consumed............. 253603840 [241 MB]
Bytes of payload...................... 243822326  
Average payload per entry............. 5446.11   [5.32 KB]

So, on average, each result record is 5.32 KB.  We have one result record for each run, which means if we're doing 30 runs to get statistics, then 89540 entries is 3000 experiments.  Remember I was talking about 20 000 experiments x 5 environments x 30 runs, which is more like 15.2 GB of database.  That's a lot of data to shuffle around over the network, for the aggregators to flush, and for SQLite to handle.  And we want to be able to scale larger potentially... and 7 GB files is really starting to make me want to split into multiple database (lots of work), etc.

The previous results were on a particular environment.  There are other ones that are even worse.

Improvements

I decided to look carefully at if I could store much less data, and if I could reduce the overhead per entry. Instead of logging each episode completion, I log how many episodes and the return at a fixed, exponential set of checkpoints. Only 14 checkpoints in 50 000 steps, things like step: 64, 256, 1024, 2048, 4096, etc (it's not strictly exponential later in the series).  Now cart-pole that uses thousands of episodes takes the same space to store as mountain car that finishes no episodes.  I guess I could save a bit more space by not even storing empty checkpoints.  Actually that's a bad idea, it'll cause us to interpolate and miss things that are interesting, possibly.

I used a few tricks to approximate the current metadata (using an unsigned short to represent the userid of the person who ran the experiment in 2 bytes instead of 128 for a UUID), storing the runtime to the nearest minute instead of millisecond, etc.  So, the meta information went from 40 bytes to 9 bytes.  I also re-used some of these tricks inside the data records, like storing a joint record for return and episodes instead of 2 records (so I don't have to store the checkpoints and meta-data twice), and also using a coding scheme for the checkpoints (checkpoint indices from a global list).  This should make each checkpoint only 7 bytes.

So, storing 14 checkpoints + meta data now takes 9 bytes + 14 * 7 bytes = 107 bytes instead of the potential may many KB from before.  So, I reran some experiments and want to look at what the database looks like.
Size 794 KB
ResultRecord Table (Improved)
------------------------------
[From Combined Part of Stats, Indices Included]
Number of entries..................... 7570   
Bytes of storage consumed............. 594944  [78.6 bytes per entry (spread across indices]]

[From W/O Indices Part of Stats]

Number of entries..................... 3785     
Bytes of storage consumed............. 560128   (547 KB)
Average payload per entry............. 125.00    (125 Bytes!)
Maximum payload per entry............. 125      

So, good news here.  First, the max and the min are the same!  Second, we've gone from an average of more than 5 KILOBYTES to 125 BYTES!  That's a 41x reduction in size.  Very good stuff. 

And if wonders will never cease, it looks like the DB can be gzipped down to about 1/9 its size.  That's not as great as it sounds, because gzipping takes time and effort, but for the cloud stuff it might make sense.

Lets take a look at our hypothetical experiment, 20 000 experiments x 5 environments x 30 runs (3 million records).  With these new numbers, that would be more like a DB of size: 357 MB.  Much more manageable!

It would be nice to get that even smaller.

Looking Ahead: Saving Time and Effort

The biggest bottleneck of the large SQLite databases is the aggregator.  Every few minutes (how often it flushes is configurable), the aggregator has to copy its latest SQLite database back to the remote file system, whether its actually just a file on the local machine, or if its in a S3 bucket.  Copying back to the local filesystem isn't too bad (my machine can do a filecopy at about 25 MB/second apparently, so a 2 GB file takes about 80 seconds.  This isn't going to kill us outright.  While we're at it, GZIPPING that 2 GB file takes 150 seconds, but the resulting file is only 385 MB.  So, in cloud mode, we should do this (maybe we do already?).

However, if we are in cloud mode, the aggregator has to transfer the files back to Amazon S3.  Even if our aggregator node is running on EC2, this seems like it might take much longer than a copy operation.  If our node isn't in the cloud, then it can take a long time to upload 2 GB.

There are some choices.  We could move to a mixed-mode implementation, where the aggregate is stored in a local file system.  If we are using an EC2, we can even store it to an elastic volume which is as quick as disk access.  If we cared we could even use raid striping for speed or a multi-level raid for safety.  This solution is attractive because it's sortof easy, but it is unattractive because we are sacrificing some of our advantages, like having a SQLite database on S3 that you could in theory just download with a viewer at any time.

Another option is to go with the multi-file solution.  We could keep the agents/experiments details in a single database, and then the results in separate databases.  This would allow the aggregator to start a new results database after the existing one reaches a size limit.  We'd want to aggregate these at some point, basically so that there is always the "main" DB and the "new" DB(s), and at some interval you dump the new ones into the main one and repeat.  This is because you can't transparently just use all of the DBs with SQLite at once.

Not sure what is best to do. Good news I guess is that with all of the space savings I've created I won't hit this for a little while.

Open Problem : Multiple Aggregators

I'd still like to make progress on the multiple-aggregator problem.  I'd like to be able to have multiple aggregators, on multiple computers, and have the aggregation be much faster.  This will be less important now that all of the data I'm personally using will fit in an SQS message (which can be processed super fast), but in the future it might matter.  Also, I think the inserts get slower when the database gets really big.

One solution could be that instead of the aggregator downloading the existing database and flushing it occasionally, it starts its own NEW database and flushes it regularly.  Every node we ever run has a unique(ish) ID, so they can all just run their own aggregations.  Then, a master aggregator can occasionnaly download the bunch of them, figure out which Agent and Experiment tuples are actually the same between tables, and merge all of the ResultRecords together.  In theory you could even do this just in the viewer... although that doesn't seem as attractive.  This is actually pretty easy, solves all of the problems I can currently think of, and is only a tiny bit of work.  Next time I run into the aggregation-too-slow problem I will code this up.

SQLite/Database Plan for the Log Book

posted Feb 18, 2009, 10:57 AM by Brian Tanner   [ updated Feb 18, 2009, 1:08 PM ]

Part of the reason that the code seems complicated is that persistence is wrapped up into all of the data storage classes.

I'm worried about things like runKeys, agentKeys, writing to files and stuff all in the same code where we're storing the data.  This makes things complicated and there are all sorts of little concerns and running distributed, uniqueness, etc, etc.

I have also made some judgment calls previously about at what point we store unprocessed data (arrays of runrecords), and at what point we process that into summary statistics that we can use for doing queries.

Aside: Remember that every agent gets their own database of results.  The good news is that having these as SQLite databases should allow us to do cross-database comparisons pretty easily.

I'm going to try to forget what I've done previously, and re-imagine the data storage system from scratch, given the following constraint considerations.

Key-Binding

We should delay key-binding as much as possible, maybe?  What I mean is: don't assign a "key" to an agentname-agentjar-agentparams tuple until you need to.  Why bother?  The jey might only need to be assigned when it finally gets put into a results database in an aggregator node.  That only happens on ONE host, it's a bottleneck in the process.  That means that we can just use auto-incrementing keys, which will save us a lot of effort and nervous coding making sure all these distributed nodes are creating UUIDs that match or are unique when they are supposed to be.

Submitter Nodes

Submitter nodes should be able to submit "experiments" to be run, and each submission should specify an agent, agent parameters, an experiment, and experiment parameters.  The experiment parameters may actually be environment parameters, but I think a level of abstract is better.  If we submit them as experiment parameters, then we can interpret them and do more powerful things if we want to.  Like, run experiment configuration "A", which may be a whole host of envirionment and data collection settings.

These nodes will not bind keys.  They will just describe agent and experiments by their class, jar, and bucket names.  We will worry about the translation behind the scenes.  No "param summaries" are created at this point either.

Compute Nodes

Compute nodes will pull experiment descriptions off of a Queue, as they do now.  I don't see a need for key binding here either.  These guys will just do the work they are supposed to, and then submit the results to the results queue, one at a time, as usual.

Aggregator Nodes

This is the place where I see us binding keys and moving into a SQLite database.  The issue that is problematic for me is whether we do any sort of data processing at this point, or just keep everything in binary blobs until it gets downloaded to a viewer node.  My first inclination is to not to any processing here.  This will keep aggregator nodes fast, and will keep the data pristine and untouched.

When the viewer node downloads the SQLite databases, perhaps it makes sense to "unpack" them into a larger, processed version that will allow us to do fast queries and such.  If we concede that this unpacking process will be necessary, we should wonder how packed things should be in the aggregator phase. Maybe it make sense to keep things all bundled together and relatively un-normalized there, and then to normalize the database for quick querying at the viewer phase.  I'm going to hash out a possible database schema at the aggregator level, and then see how that looks, then think more on this.


Aggregator Node DB Schema

SQLite keeps a 64-bit ROWID for each row of each table.  This is what we will use when all possible as the key to the table.  We can use auto-increment. They say searches on the ROWID are usually about twice as fast as a regular pimary key or any other index.

It looks like SQLite does support multi-column primary keys though, so that will be useful sometimes.

Proposal 1

First shot would just be a single big pile of data, I guess.

<RunId, AgentName, AgentJar, AgentParams, ExperimentName, ExperimentParams, RunDetails,..., RunRecord>

We would bind the ROWID as the RUNID.  This would be very fast to aggregate, so why didn't we do it before?

First, it would be terribly redundant. Many of the attributes here would be duplicated for every row in the database, including most of the agent information, the experiment name, etc. For things that vary among rows, the agentParams would be duplicated for every trial of some configuration, same with the experiment Params.  Basically, this just feels pretty inefficient.  Also, in the old days, we may have been bothered that we'd have to iterate through the whole multi-GB file in order to get a list of all the parameters we have tried.  It made it harder to process the file intelligently because we weren't sure what was in it.

Proposal 2 : Normalize a Bit

This proposal tries to split the data up enough to cut down on being redundant, but the data isn't unpacked such that it is really easily queryable.  This one will mirror the existing setup quite a bit.

Table 1 : Agent
<AgentConfigId, AgentName, AgentJar, AgentParams, etc>

Table 2 : Experiment
<ExperimentConfigId, ExperimentName, ExperimentParams, etc>

Table 3: Runs
<RunId, AgentConfigId, ExperimentConfigId, RunDetails, RunRecord>

If we wanted to be anal about the normalization we could add a Agent-Experiment relation table, but I'm not super keen on that.  This will give us what we want at the aggregator phase I think, with minimal redundancy.

Proposal 3: Unpacked Data

The "viewer" would download the three tables, Agent, Experiment, Runs.  Then, he might iterate over all of the runs making aggregate records. It's not clear if this would best be done iteratively using a single pass over the data, or by doing separate selects for each aggregate row.  Anyway, the aggregate rows look like:

Table1: QueryTown
<AgentConfigId, ExperimentConfigId, AgentParam1, AgentParam2, ..., AgentParamN, ExperimentParam1, ExperimentParam2, ..., ExperimentParam3, ResultStatistic1, ResultStatistic2, ...,ResultStatisticN>

The result statistics might be very specific (we might create these tables for each viewer, dunno).  One obvious statistic consistent with previous work in the log book would be total reward at the end of the experiment (averaged over all runrecords that match this set of parameters).

So then you could write a query like (ugh, my SQL is rusty):

Select AgentConfigId, ResultStatistic1 from QueryTown,  where ResultStatistic1=max(ResultStatistic1) group by AgentParam1.

If AgentParam1 was StepSize, and ResultStatistic1 was Total Reward, then I think this should give us the maximum reward for each step size.  Someone more SQLish could improve on this I'm sure and find ways for us to be clever.






 

Environmental Paramaterization makes me want to re-plan some things...

posted Feb 18, 2009, 9:14 AM by Brian Tanner

Although I am not working full time on the log book at the moment, I want to use it to do some experiments, and I'm realizing that I may have drastically idealizing things when I imagined how environments and experiments were entwined, but agents were there own thing.

Here is what I mean.  When you define an "experiment" for the record book, you are forced to put any environmental configuration and variation into the experiment program.  What I mean is, that an experiment has a fixed class and function "runExperiment", and that "runExperiment" is passed an agent and parameters.

There is no way to (currently) pass runExperiment a flag so that sometimes you run with random start states on, and sometimes off.  Or, so that sometimes you run it in a particular start state, sometimes another one.

The reason is to make data consolidation as easy as possible.. so that all results within 1 experiment are uniform, only that agent params are changing.  This should make it easy to pick out the results we want by pruning past on agent parameters.  The downside is, I currently want to run some experiments where I want to try running from 9 different start states.  I'd like to do each one the same amount, and be able to see composite results that average out the differnet start states, but I'd like to be able to look at each independently.

If I could specify both environment AND agent parameters to the runExperiment function, this would be easier.  Then I could submit 9 different experimental configs for each agent configuration, and when I view my results, I could just filter (or not) based on those environmental params.  This waters down the idea of an experiment being this monolithic unvarying thing.... but it makes the results more of a database of things that you care about, which I think is a win.

Implementing this reveals some maybe brittle redundancy in the recordbook data storage scheme though.  Currently we have two data files per experiment, the Index, and the Results.

The INDEX file has 1 entry per experimental configuration, something like:

<agentId, agentName, agentSourceJar, agentParameterSettings, experimentId, uniqueConfigurationKey, submitterId>

The uniqueConfigurationId is a MD5 hash of 3 things:
uniqueConfigurationKey= hash ( agentParameterSettings, experimentId, agentName:agentSourceJar )

The idea is that this uniqueConfigurationKey should uniquely identify records that can be grouped together.  If two computers on opposite ends of the Internet have results with the same uniqueConfigurationKey, that means they were generated from the same agent, with the same parameters, from the same JAR file (provided that we have external controls in place so that these names and parameter values are meaningful).

One reason this is important is for the RESULTS file. The RESULTS file has entries like:

<uniqueConfigurationKey, resultType, runKey, runTimeInMS, runDate, runnerId, result>

resultType is an ID that will map to something like EpisodeEndPointRunRecord or EpisodeEndReturnRunRecord.

runKey is a UUID generated when this experiment was run.  It's random, make from the time and a random number.  Think of it as an absolutely unique ID for this run.

runTimeInMS is how long the experiment took in milliseconds

runDate is the date-time that the experiment took place

result might be something like all the endPoints of the episodes, or the Returns received in each episode.

The overhead is about 128 + 32 + 128 + 64 + ? (date) = probaby around 400 bytes per record.  When we start running 10 000 experiments, each with 30 trials, this will be about 100 MBytes of overhead.  If we normalized this database of ours a bit, we could probably cut the overhead to 128 bits, by just storing <runKey, result> in here. Maybe that's really worthwhile, maybe not, I'm not totally sure.

What I am starting to think about is if we should be using SQLite or some other lightweight database to store all of these things.  Of course this sounds good, but is it really?  It would take time, that's bad.  But maybe the data being stored on the distributed nodes would be simpler, and the merging process would be easier to see, and maybe the data would be spread over more files (tables) but it would be must faster to search?  And we'd use select statements... hmm.

Going to think on this a while.



Final Version of Poster

posted Dec 12, 2008, 12:09 PM by Brian Tanner   [ updated Feb 8, 2009, 3:35 PM ]

I've incorporated as much feedback as possible and have produced (and printed) the poster.  I'll be presenting it tomorrow (Dec 12, 2008) at the workshop.

I've attached the PDF to this page.

Something weird is happening with the attachments, they are not "sticking", they are apparently there, but not automatically being listed on this entry.

I think you can download the file by clicking here.

Workshop Poster First Draft

posted Nov 30, 2008, 8:08 PM by Brian Tanner

I've completed the first draft of the workshop poster. I'm posting here because I'd really appreciate feedback from those who have a vested interest in this project.  I have a very internalized view of this project, and I think the poster could be strengthened by outside perspectives.

If you have any comments, questions, or suggestions about the poster, I'd really like to hear them. Or if you think it's awesome, you could tell me that too. :)

If there is anything on the poster that seems unintuitive, I'd like to know.  As usual, this poster is much better with me there to point things out and answer questions: so it doesn't have to explain everything completely.  But, of course, it shouldn't offend your sensibilities either.

Finally, if there are some aspects of the logbook that you think are really exciting, cool, interesting, and are not mentioned or given enough attention in this poster, that would be *really* great to know.

I'll attach it and post a low-res PNG.

Workshop Submission Accepted!

posted Nov 15, 2008, 3:15 PM by Brian Tanner

The NIPS 2008 Workshop: Parallel Implementations of Learning Algorithms has accepted our proposal.  We will be giving a 4-minute spotlight talk and presenting a poster during the poster session of that workshop on Dec 13, 2008 at the NIPS workshops in Whistler.

This is good news because it's forcing us to formalize some of the implicit aspects of this project: like what do we hope to achieve with this project.  I'll put a copy of the poster up here when it's ready.

1-10 of 19