IntroductionThis page, and those below it will explain the current implementation details of the bt-recordbook. Sprinkled around the rest of the site are some design decisions, and places where I'm considering and debating different approaches. I'll try and use this space to actually explain what I *did* do, as much for my own benefit as yours (whoever you are).
Compute UnitsEach compute unit will be running a java process in an infinite loop. Each compute unit will have a ${compute_name} which will roughly be it's network name and a random number. This will be assigned when the compute unit starts up. The compute unit When the compute unit starts, it will create several SQS Queues.Command Queue
Work QueueCreated by: Always exists
Deleted by: Always exists
Queue Name: bt-recordbook-beta-work-queue (for now)
The work queue is where experiments are submitted to be later processed. Messages in the work queue are string serialized JobDescription.class objects. Not sure exactly in which package JobDescription will live yet.
JobDescription is:
- An agentDescription Object
- A String representing the submitter
- A Date representing when the job was submitted
- An experimentClassName string representing which class should be invoked via reflection to run the experiment
An agentDescription is:
- An agentId String (tells us which folder to store the results in)
- An agentName String (tells us the name of the agent in the dynamic agent loader
- An agentSourceJar String (tells us what Jar to load the agent from which will be important when we have many versions on the go)
- A parameterHolder object (tells us what parameters to run the agent with)
- A paramSummary String (summary of the parameterHolder that is useful for sorting and sifting later)
- An experimentId String (tells us which experiment/folder is at the root of this result)
- A uniqueConfigurationId (a hopfully globally unique key for this experiment/environment/agent/parameter configuration).
Results QueueCreated by: Always exists
Deleted by: Always exists
Queue Name: bt-recordbook-beta-results
The results queue is where results are listed for an Aggregator to process later. Each time that the compute process finishes an experiment, it saves the ResultRecord into a temporary file. A temporary, unique(ish) id is created for that file, and it is uploaded to Amazon S3 in: /bt-recordbook/beta/unprocessedData/${fileId}
After uploading the file, a message is put in the Results Queue, in the format: ${fileId}@${serializedAgentDescription}
This message is sent out, and somewhere, sometime an Aggregator will read the message off the Results Queue, and will integrate the saved ResultRecord with the rest of the data for this experiment/agent pair.
Command QueueCreated by: Compute Node Deleted by: Compute Node (unless killed with Dienow) Queue Name: ${compute_name}_command
The command queue will be used by the compute process to listen for commands from the outside world. Currently, the only three commands that are implemented are shutdown and die, and dienow.
Shutdown tells the compute unit to stop processing new experiments, but to finish up whatever has been started already. This means that before shutting down, the compute unit will finish the current experiment, finish any pending result uploads, and finish sending out any pending status messages.
Die tells the compute unit to stop processing everything immediately. Any pending experiment will be completed, and any current result upload or status messages that have been initiated will complete, but afterwards the compute node will terminate, possibly without uploading all of its results.
Dienow calls System.exit(). We should only use this in emergencies.
If the compute unit ends by any means nicer than Dienow, it should delete this queue itself before terminating.
Status QueueCreated by: Compute Node
Deleted by: StatusWatcher (see below)
Queue Name: ${compute_name}_status
The status queue will be used by the compute process to communicate it's results with any programs that are listening. Generally we call the programs that listen to result Queues StatusWatchers. Not sure exactly how many status messages will get sent out by default, but for now most events (start up, starting new experiment, completing experiment) will result in a message through the Status Queue.
When the compute node shuts down (unless via Dienow) the status queue should send a message with body "EOQ" (standing for End of Queue). When a StatusWatcher receives the EOQ message, it should delete this message queue.
Compute Node Execution FlowThe compute node will work out of a sandbox directory that is created when the compute node is started and will be deleted when the node terminates. We'll just call that directory sandbox.
The Agent, Environment, and Experiment jar files will be put in: ${sandbox}/working. We will cache most of these jars in ${sandbox}/jarCache.
- Set RLVIZ_LIB_PATH to ${working}
- Create new AgentLoader and EnvironmentLoader
- Pull a JobDescription off of the Work Queue
- Check if required jars are in ${jarCache}. If not, download them from Amazon S3 to ${jarCache}. These jars should never change, so if they are in the cache, they are expected to be fine (no updates required)
- Copy the required jars from ${jarCache} to ${working}
- Refresh the AgentLoader and EnvironmentLoader
- Load the Experiment Class using the RLVizLib ClassExtractor and create an instance of it
- Call theExperiment.runAbstractTrial(theJobDescription.getAgentDescription())
- This will return a RunRecord
- Send the RunRecord and JobDescription to the ResultsManager to be queued for Upload to S3
- Clear Agent, Environment, and Experiment jar files from ${working}
- Repeat from Step 3.
Aggregator Node Execution FlowThe aggregator node will work out of a
sandbox directory that is created when the node is started and
will be deleted when the node terminates. We'll just call that
directory sandbox.
The structure of the sandbox will mirror that of the bucket in S3, except it will be loaded on demand as needed.
- Pull a ResultFileName@AgentDescription off of the Results Queue
- Check if required summary results are in ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}. If not, download them from Amazon S3. There should only be one aggregator at a time, so they are expected to be fine (no updates required)
- Download ${ResultFileName} from Amazon S3 to ${sandbox}/unProcessedResults
- Read in ${ResultFileName} to tmpResultRecord
- Append the tmpResultRecord to ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}/resultSummary
- Append the information in AgentDescription to the ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}/resultIndex (if necessary)
- Upload the new resultSummary and resultIndex (might want to do this less frequently than every time)
- On S3, move unProcessed/${ResultFileName} to ${sandbox}/${AgentDescription.ExperimentId}/${AgentDescription.AgentId}/processedResult/${ResultFileName}
- Delete the local copy of $ResultFileName.
- Go back to 1
|
|