|The record book has a drawback for really evaluating algorithms, each record is self-contained. It is perfectly reasonable for a person to submit the optimal mountain car agent to a mountain car record. Clearly, that agent will beat any learning agent that we can devise.|
So, how can we really evaluate learning algorithms? By what metric can we compare them to each other, and are there ways to get more information by looking at an algorithm outside of a particular record?
normalized_score = (score - min_score) / (max_score - min_score)
This is useful because it will allow us to have some way of talking about algorithms over a variety of records, and hopefully helps us to overcome some of the major differences between problems.
Take for example a Mountain Car record where the goals is to maximize cumulative reward in 10000 steps.
An agent that completes no episodes will have cumulative reward = -10000
Assuming the optimal policy takes 100 steps, an optimal agent will finish about 100 episodes for cumulative reward = -9900
A good learning algorithm might complete 50 episodes, cumulative reward = -9950
Talking about these numbers in their raw format is awkward, because the scales are arbitrary. We could have another event called "super pain Mountain Car", where instead of -1 per step you get -10000 per step, and comparing algorithms (in terms of reward) between these two records would be meaningless without informed scaling.
However, with this normalized score metric, we can see that the good learning algorithm has score:
(-9950 + 10000) / (-9900 + 10000) = 50 / 100 = 1/2 = .5
You can imagine very easily having the recordbook "suggest" running different parameter combinations of existing algorithms on different problems!
Status Blog >