Comparing Learning Algorithms

How to compare learning algorithms

Here we try to learn how to play the Blackjack MDP using Q-learning and Posterior Sampling.

First we need some imports:

from perl.mdp.blackjack import Blackjack
from perl.rl.environment import mdp_to_env, env_value
from perl.rl.simulator import reward_path
from perl.rl.algorithms import FixedPolicy, Qlearning, PosteriorSampling, TwoStepPosteriorSampling

# for solving the MDP directly
from perl.mdp import value_iteration, policy_iteration

from perl.priors import NormalPrior

# for plotting
import toyplot as tp

Create the environment and initialize the learners

mdp = Blackjack()
env = mdp_to_env(mdp)

prior = lambda: NormalPrior(0, 5, 1)
QL = Qlearning(env)
PS = PosteriorSampling(mdp, p_reward=prior)
TSPS = TwoStepPosteriorSampling(mdp, p_reward=prior)

algos = [("QL", QL), ("PS", PS), ("TSPS", TSPS)]

num_episodes = 25
num_test_episodes = 500

We can also solve the MDP using value iteration:

opt_val, opt_pol = value_iteration(mdp)
max_val = env_value(env, opt_val)

Now run the learning paths using the reward_path function

paths = [(name, reward_path(env, algo, num_episodes, num_test_episodes=num_test_episodes)) 
         for name, algo in algos]

Finally, we create plots of the results

canvas = tp.Canvas(900, 300)
for i, (name, path) in enumerate(paths):
    nepi, lr, sd_lr, tr, sd_tr = zip(*path)
    
    algo = algos[i][1]
    policy_val = env_value(env, policy_iteration(mdp, algo.optimal_policy))

    axes = canvas.cartesian(label="Learning using {}".format(name), 
                            xlabel="episode",
                            ylabel="reward",
                            xmin=0,
                            ymin=-1,
                            ymax=1.5*max_val,
                            grid=(1,len(paths),i))
    
    # plot optimal value and value of final policy
    axes.hlines([max_val], opacity=0.5)
    axes.hlines([policy_val], opacity=0.8)

    axes.plot(nepi, lr, color="red")
    axes.plot(nepi, tr, color="blue")
    
    for sgn in [1, -1]:
        axes.plot(nepi, [x + 2*sgn*s for x, s in zip(lr, sd_lr)], 
                  style={"stroke":"red", "stroke-dasharray":"2, 2"},
                 opacity=0.5)
        axes.plot(nepi, [x + 2*sgn*s for x, s in zip(tr, sd_tr)], 
                  style={"stroke":"blue", "stroke-dasharray":"2, 2"},
                 opacity=0.5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing Learning Algorithms

How to compare learning algorithms

Uh oh!

Clone this wiki locally