posted Jul 6, 2008, 1:43 AM by Brian Tanner
The record book has a drawback for really evaluating algorithms, each record is selfcontained. It is perfectly reasonable for a person to submit the optimal mountain car agent to a mountain car record. Clearly, that agent will beat any learning agent that we can devise.
So, how can we really evaluate learning algorithms? By what metric can we compare them to each other, and are there ways to get more information by looking at an algorithm outside of a particular record?
MetricI propose that in addition to the main score criteria for each record, we relative percentage score. This score will be in [0,1], and will be calculated as: normalized_score = (score  min_score) / (max_score  min_score)
This is useful because it will allow us to have some way of talking about algorithms over a variety of records, and hopefully helps us to overcome some of the major differences between problems.
Take for example a Mountain Car record where the goals is to maximize cumulative reward in 10000 steps.
An agent that completes no episodes will have cumulative reward = 10000 Assuming the optimal policy takes 100 steps, an optimal agent will finish about 100 episodes for cumulative reward = 9900 A good learning algorithm might complete 50 episodes, cumulative reward = 9950
Talking about these numbers in their raw format is awkward, because the scales are arbitrary. We could have another event called "super pain Mountain Car", where instead of 1 per step you get 10000 per step, and comparing algorithms (in terms of reward) between these two records would be meaningless without informed scaling.
However, with this normalized score metric, we can see that the good learning algorithm has score: (9950 + 10000) / (9900 + 10000) = 50 / 100 = 1/2 = .5
Maximizing Score Across ProblemsNow that we have this score, we can take a set of algorithms that are applicable to several records and ask "what algorithm has maximal score across these records for a fixed parameter setting?". This gives us a real way to talk about algorithms and their parameter sensitivity at a high level.
You can imagine very easily having the recordbook "suggest" running different parameter combinations of existing algorithms on different problems!

