Status Blog‎ > ‎

Alpha Trial Run

posted Aug 28, 2008, 10:19 AM by Brian Tanner   [ updated Oct 16, 2008, 9:01 AM ]

The Setup

As a test run, I'm running a variety of experiments with Sarsa0 in Mountain Car and also my experience replay agent.  There are a total of about 6700 parameter combinations between these two agents (in this first round), and I'm running each one 30 times. This rounds out to a total of 200 000 runs.  This should give me a clear picture of where the weaknesses are in the software.

After 1 day of computing, a couple of issues have popped up already.

Issues

Job Queuing Enhancements

First, I want to have many agents queued up at the same time, so the old system of submitting each job 30 times to ensure 30 runs happen needed to be revisited.  I each job now has a "runs remaining" counter, so when the job is popped off the queue the counter is decremented and the job is re-added to the back fo the queue.  This means that we can dump many agents and parameter configurations into the queue at once, and they will be scheduled round robin.  Not ideal, but much better.  This also means the submitter script is much faster because it needs to put far fewer messages into the queue.  The compute nodes are careful to not delete the original message until the decremented replacement is in the queue, so we shouldn't lose any jobs.  there is a small risk of job duplication if the delete message fails or isn't called, but that's not terrible.

Aggregator needs to be faster

I have 6 compute nodes running on our Eureka cluster, and one on my laptop.  For the Sarsa experiments, which take less than 1 second each (once downloaded and configured), the stream of results is overwhelming the aggregator.  This was expected, but still disappointing.  I really have to find a clever way (I have some ideas) of how to distribute the load of the aggregator so that the work can be shared among several aggregation nodes. It is frustrating that the computation is done, but we can't see the results because there is a long queue before they will be integrated into our results files.  When the experiments take longer (many of the experience replay experiments take 1-5 minutes), the flow of results slows and the aggregator can catch up.  That's good.

Aggregator needs more memory

After the data files for the two agents got to about 10-20 MB, the aggregator's heap ran out.  For now, I've bumped the java heap from 64 MB to 256 MB.  This should give us considerable head room, but in the longer term we might have to be more clever (depending on how huge these files get).

Westgrid Fails

I tried (briefly) to get the compute nodes running on westgrid.  First there was a problem (again) with the jaxb library and Java 6.  Finally figured out how to fix that by using -Djava.endorsed.dirs=/path/to/the/2.1/jaxb/api/jar.

After that, I could successfull start compute nodes up from the command line, but my queue-submitted version would not connect to the authorizer servlet.  Errors like:
Aug 27, 2008 11:10:16 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing request: Connection timed out
Aug 27, 2008 11:10:17 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
INFO: Retrying request

This happens occassionally on all nodes, but seems to be a show-stopper on Glacier.  Will come back to that later.

Round 2

After some time away from this project, I'm back running big tests on Salient.  I've cleaned up the agent library to more explicitly release memory in the agents, and that seems to be helping some of the resources problems pretty substantially.  The ComputeNode processes are growing, but at a reasonable rate now.

However, after having 32 nodes running at 3 am, there are now only 17 still running in the morning.

    Name: bt-buntu-1224133182907    Type: ComputeNode    Last Beat: 5.57  minutes (gotta be dead)    Status: ProcessingJob(539):RunningExperiment
    Name: bt-buntu-1224133189915    Type: ComputeNode    Last Beat: 5.62  minutes (gotta be dead)    Status: ProcessingJob(491):RunningExperiment
    Name: bt-buntu-1224133193142    Type: ComputeNode    Last Beat: 5.57  minutes (gotta be dead)    Status: ProcessingJob(592):RunningExperiment
    Name: node001-1224143501806    Type: ComputeNode    Last Beat: 5.65  minutes (gotta be dead)    Status: ProcessingJob(390):RunningExperiment
    Name: node001-1224143628529    Type: ComputeNode    Last Beat: 5.68  minutes (gotta be dead)    Status: ProcessingJob(390):RunningExperiment
    Name: node001-1224144441373    Type: ComputeNode    Last Beat: 5.68  minutes (gotta be dead)    Status: ProcessingJob(342):RunningExperiment
    Name: node003-1224143628537    Type: ComputeNode    Last Beat: 5.71  minutes (gotta be dead)    Status: ProcessingJob(518):RunningExperiment
    Name: node003-1224144437922    Type: ComputeNode    Last Beat: 5.65  minutes (gotta be dead)    Status: ProcessingJob(461):RunningExperiment
    Name: node004-1224143751539    Type: ComputeNode    Last Beat: 3.57  minutes (probably dead)    Status: ProcessingJob(63):PuttingJobBackInQueueFor(28)Runs
    Name: node005-1224143628545    Type: ComputeNode    Last Beat: 5.65  minutes (gotta be dead)    Status: ProcessingJob(533):RunningExperiment
    Name: node005-1224144437926    Type: ComputeNode    Last Beat: 5.65  minutes (gotta be dead)    Status: ProcessingJob(452):RunningExperiment
    Name: node010-1224145205977    Type: ComputeNode    Last Beat: 5.63  minutes (gotta be dead)    Status: ProcessingJob(509):RunningExperiment
    Name: node010-1224145235025    Type: ComputeNode    Last Beat: 5.62  minutes (gotta be dead)    Status: ProcessingJob(493):UpdatingThisUserComputeStats
    Name: node016-1224145206022    Type: ComputeNode    Last Beat: 5.63  minutes (gotta be dead)    Status: ProcessingJob(3):DeletingThisMessageFromWorkQueue
    Name: node019-1224145234924    Type: ComputeNode    Last Beat: 5.64  minutes (gotta be dead)    Status: ProcessingJob(25):DeletingThisMessageFromWorkQueue
    Name: node027-1224145206247    Type: ComputeNode    Last Beat: 5.71  minutes (gotta be dead)    Status: ProcessingJob(54):PuttingJobBackInQueueFor(28)Runs
    Name: node027-1224145234986    Type: ComputeNode    Last Beat: 5.69  minutes (gotta be dead)    Status: ProcessingJob(3):DeletingThisMessageFromWorkQueue

I'm going to create subpage to try and address any weirdness one by one. 
Comments