Implementation Details

Introduction

This page, and those below it will explain the current implementation details of the bt-recordbook.  Sprinkled around the rest of the site are some design decisions, and places where I'm considering and debating different approaches.  I'll try and use this space to actually explain what I *did* do, as much for my own benefit as yours (whoever you are).

Compute Units

Each compute unit will be running a java process in an infinite loop.  Each compute unit will have a ${compute_name}  which will roughly be it's network name and a random number.  This will be assigned when the compute unit starts up.  The compute unit When the compute unit starts, it will create several SQS Queues.Command Queue

Work Queue

Created by: Always exists
Deleted by: Always exists
Queue Name: bt-recordbook-beta-work-queue (for now)

The work queue is where experiments are submitted to be later processed. Messages in the work queue are string serialized JobDescription.class objects.  Not sure exactly in which package JobDescription will live yet. 

JobDescription is:
  • An agentDescription Object
  • A String representing the submitter
  • A Date representing when the job was submitted
  • An experimentClassName string representing which class should be invoked via reflection to run the experiment
An agentDescription is:
  • An agentId String (tells us which folder to store the results in)
  • An agentName String (tells us the name of the agent in the dynamic agent loader
  • An agentSourceJar String (tells us what Jar to load the agent from which will be important when we have many versions on the go)
  • A parameterHolder object (tells us what parameters to run the agent with)
  • A paramSummary String (summary of the parameterHolder that is useful for sorting and sifting later)
  • An experimentId String (tells us which experiment/folder is at the root of this result)
  • A uniqueConfigurationId (a hopfully globally unique key for this experiment/environment/agent/parameter configuration).

Results Queue

Created by: Always exists
Deleted by: Always exists
Queue Name: bt-recordbook-beta-results

The results queue is where results are listed for an Aggregator to process later.  Each time that the compute process finishes an experiment, it saves the ResultRecord into a temporary file.  A temporary, unique(ish) id is created for that file, and it is uploaded to Amazon S3 in:
/bt-recordbook/beta/unprocessedData/${fileId}

After uploading the file, a message is put in the Results Queue, in the format:
${fileId}@${serializedAgentDescription}

This message is sent out, and somewhere, sometime an Aggregator will read the message off the Results Queue, and will integrate the saved ResultRecord with the rest of the data for this experiment/agent pair.

Command Queue

Created by: Compute Node
Deleted by: Compute Node (unless killed with Dienow)
Queue Name: ${compute_name}_command

The command queue will be used by the compute process to listen for commands from the outside world.  Currently, the only three commands that are implemented are shutdown and die, and dienow

Shutdown tells the compute unit to stop processing new experiments, but to finish up whatever has been started already.  This means that before shutting down, the compute unit will finish the current experiment, finish any pending result uploads, and finish sending out any pending status messages.

Die tells the compute unit to stop processing everything immediately.  Any pending experiment will be completed, and any current result upload or status messages that have been initiated will complete, but afterwards the compute node will terminate, possibly without uploading all of its results.

Dienow calls System.exit().  We should only use this in emergencies.

If the compute unit ends by any means nicer than Dienow, it should delete this queue itself before terminating.

Status Queue

Created by: Compute Node
Deleted by:  StatusWatcher (see below)
Queue Name: ${compute_name}_status

The status queue will be used by the compute process to communicate it's results with any programs that are listening.  Generally we call the programs that listen to result Queues StatusWatchers.  Not sure exactly how many status messages will get sent out by default, but for now most events (start up, starting new experiment, completing experiment) will result in a message through the Status Queue.

When the compute node shuts down (unless via Dienow) the status queue should send a message with body "EOQ" (standing for End of Queue).  When a StatusWatcher receives the EOQ message, it should delete this message queue.

Compute Node Execution Flow

The compute node will work out of a sandbox directory that is created when the compute node is started and will be deleted when the node terminates.  We'll just call that directory sandbox.

The Agent, Environment, and Experiment jar files will be put in: ${sandbox}/working.  We will cache most of these jars in ${sandbox}/jarCache.

  1. Set RLVIZ_LIB_PATH to ${working}
  2. Create new AgentLoader and EnvironmentLoader
  3. Pull a JobDescription off of the Work Queue
  4. Check if required jars are in ${jarCache}. If not, download them from Amazon S3 to ${jarCache}.  These jars should never change, so if they are in the cache, they are expected to be fine (no updates required)
  5. Copy the required jars from ${jarCache} to ${working}
  6. Refresh the AgentLoader and EnvironmentLoader
  7. Load the Experiment Class using the RLVizLib ClassExtractor and create an instance of it
  8. Call  theExperiment.runAbstractTrial(theJobDescription.getAgentDescription())
  9. This will return a RunRecord
  10. Send the RunRecord and JobDescription to the ResultsManager to be queued for Upload to S3
  11. Clear Agent, Environment, and Experiment jar files from ${working}
  12. Repeat from Step 3.

Aggregator Node Execution Flow

The aggregator node will work out of a sandbox directory that is created when the node is started and will be deleted when the node terminates.  We'll just call that directory sandbox.

The structure of the sandbox will mirror that of the bucket in S3, except it will be loaded on demand as needed.

  1. Pull a ResultFileName@AgentDescription off of the Results Queue
  2. Check if required summary results  are in ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}. If not, download them from Amazon S3.  There should only be one aggregator at a time, so they are expected to be fine (no updates required)
  3. Download ${ResultFileName} from Amazon S3 to ${sandbox}/unProcessedResults
  4. Read in ${ResultFileName} to tmpResultRecord
  5. Append the tmpResultRecord to ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}/resultSummary
  6. Append the information in AgentDescription to the  ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}/resultIndex (if necessary)
  7. Upload the new resultSummary and resultIndex (might want to do this less frequently than every time)
  8. On S3, move unProcessed/${ResultFileName} to ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}/processedResult/${ResultFileName}
  9. Delete the local copy of $ResultFileName.
  10. Go back to 1

Comments