diff --git a/.gitignore b/.gitignore index 372c13e..8ae8ce5 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,2 @@ __pycache__/ - +.idea/ diff --git a/Cattelle/ExperienceReplay.py b/Cattelle/ExperienceReplay.py new file mode 100644 index 0000000..a33f674 --- /dev/null +++ b/Cattelle/ExperienceReplay.py @@ -0,0 +1,92 @@ +from collections import deque + +import numpy as np + +from utils import process_screen + + +class ExperienceReplay: + """ + This class defines a handy structure to store and handle the experience replay memory + + It provides the following actions: + * Process the screen (convert to grayscale, downscale, crop) + * Update the underlying experience replay array + * Sample randomly the array to yield a minibatch + * Append new state to the array + """ + + def __init__(self, size, history_length=4, minibatch_size=32): + """ + Initialise the underlying array holding the experience replay memory + + One sample is defined as a (s,a,r,s',d) tuple where one state (s) corresponds to the history_length last + frames stacked together. Each state is implemented as deque which prevents having to handle the maximum + length and speeds up access time to both ends of the queue. + + The memory is implemented as a deque as well, and is filled from left to right. The right-most sample is + thus always the newest one. + + Args: + size (int): Total number of state + history_length (int): Number of frame to keep in one state (stacked together). Default is 4 + minibatch_size (int): Number of samples in a minibatch + """ + self.memory = deque(maxlen=size) + self.history_length = history_length + self.size = size + self.minibatch_size = minibatch_size + + def append_sample(self, s, a, r, s_new, d): + """ + Append a new sample to the ER memory. + + The screen states will be processed and appended to the correct stacks + Args: + s (np.ndarray): Raw (unprocessed) screen state + a (int): Action taken at state s leading to state s_new + r (float): Reward for taking action a at state s + s_new (np.ndarray): Raw (unprocessed) screen state + d (bool): True if the game is in a terminal state (game over), False otherwise + """ + + if len(self.memory) == 0 or self.memory[-1][4] is True: + # We handle the initial insertion or the first one after a terminal differently + # The initial state in this case is 4 times the same frame + s = process_screen(s) + state = deque([s] * self.history_length, maxlen=self.history_length) # state = [s, s, ..., s] + + state_new = state.copy() + state_new.append(process_screen(s_new)) # state_new = [s, s, ..., s_new] + + self.memory.append((state, a, r, state_new, d)) + + else: + # Grab the last sample recorded + last_sample = self.memory[-1] + + # Build the new state (stack) + new_state = last_sample[3].copy() + new_state.append(process_screen(s_new)) + + # And append to the memory + self.memory.append((last_sample[3], a, r, new_state, d)) + + def minibatch(self): + """ + Randomly samples a minibatch of size minibatch_size and returns it + Returns: + minibatch (np.ndarray): Randomly sampled minibatch + """ + + # Get the current size of the memory + size = len(self.memory) + + if size < self.minibatch_size: + raise IndexError(f'minibatch_size ({self.minibatch_size}) is larger than the current size of the ER ' + f'memory ({size})') + + # er.memory is not 1D thus we cannot sample it directly, instead we sample indices and build back and array + indices = np.random.choice(size, self.minibatch_size) + + return np.array([self.memory[i] for i in indices]).T diff --git a/Cattelle/FlappyAgent.py b/Cattelle/FlappyAgent.py new file mode 100644 index 0000000..fb234b7 --- /dev/null +++ b/Cattelle/FlappyAgent.py @@ -0,0 +1,28 @@ +import numpy as np +from keras import models + +from config import Config as config +from utils import StateHolder + +# Initialise dqn +dqn = models.load_model(config.MODEL_FILENAME) + +stateholder = StateHolder() + + +def FlappyPolicy(_, screen): + """ + Main game policy, define the behaviour of the agent + Args: + _ (dict) : The state vector of the simulator, ignored here + screen (numpy.ndarray): Current state of the screen (RGB matrix) + + Returns: + action (int): The action to take + """ + stateholder.append(screen) + state = stateholder.get_dqn_input() + + Q = dqn.predict(state) # Expect a (no_samples, history_length, 84, 84) input + + return np.argmax(Q) * 119 # argmax is either 0 or 1 with convention 0: no-op; 1: flap diff --git a/Cattelle/config.py b/Cattelle/config.py new file mode 100644 index 0000000..b95509f --- /dev/null +++ b/Cattelle/config.py @@ -0,0 +1,37 @@ +class Config: + # Experience replay settings + ER_SIZE = 20000 # Total number of samples to keep in the Experience Replay memory + HISTORY_LENGTH = 4 # Number of frames to keep in each state + MINIBATCH_SIZE = 32 # Number of samples in a single minibatch, must be < MIN_ER_SIZE + + # Network settings + OPTIMISER = 'rmsprop' + LEARNING_RATE = 1e-6 + DECAY = 0.9 + MOMENTUM = 0.95 + + # Learning settings + INITIALISATION_STD = 0.1 # Standard deviation used for initialising weights of the conv2d layers + TIMESTEPS = 100000 # Number of timesteps used for the learning, one action is taken during one step. + INITIAL_EPS = 1.0 # Initial value for the exploratory parametr epsilon + DISCOUNT_RATE = 0.95 # Parameter for the gamma discount rate + MIN_ER_SIZE = 3000 # Minimum number of samples in the ER to begin learning + TEST_DELTA = 10000 # Number of timesteps between two successive tests of the network + NUM_TEST_TRIALS = 10 # Number of trials to conduct during each test session + PROB_FLAP = 1 / 4 # Probability of action "flap" (119) when taking a random action during the exploration + + # Simulator settings + REWARD_ALIVE = 0.1 # Reward granted for each timestep where the player remains alive (except if it passes a pipe) + + # Misc. settings + MODEL_FILENAME = 'dqn.h5' + SAVE_DELTA = 5000 # Number of timesteps between two successive saves of the network, must be > MIN_ER_SIZE + + +class DebugConfig(Config): + TIMESTEPS = 100 + MIN_ER_SIZE = 10 + MINIBATCH_SIZE = 5 + ER_SIZE = 50 + SAVE_DELTA = 50 + TEST_DELTA = 25 diff --git a/Cattelle/dqn.h5 b/Cattelle/dqn.h5 new file mode 100644 index 0000000..6e998d6 Binary files /dev/null and b/Cattelle/dqn.h5 differ diff --git a/Cattelle/retrainer.py b/Cattelle/retrainer.py new file mode 100644 index 0000000..8b9cbab --- /dev/null +++ b/Cattelle/retrainer.py @@ -0,0 +1,222 @@ +import keras +import numpy as np +from ple import PLE +from ple.games import FlappyBird +from tqdm import trange + +from ExperienceReplay import ExperienceReplay +from config import Config as config +from utils import StateHolder + + +class Retrainer: + """ + Retraining class, used to further train an existing Keras model + """ + + def __init__(self, model_file): + """ + Load the existing model file + Args: + model_file (str): Path to the model file (h5 file) + """ + + self.model = keras.models.load_model(model_file) + self.er = ExperienceReplay(config.ER_SIZE, config.HISTORY_LENGTH, config.MINIBATCH_SIZE) + self.stateholder = StateHolder() + + game = FlappyBird(graphics="fixed") + p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True) + # Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for + # learning, just for display purposes. + p.init() + + self.game = game + self.p = p + + self._gen_er_samples() + + def train_network(self): + """" + Train the DQN according to the settings defined in config.py + """ + print(len(self.er.memory)) + print('Starting training') + + p = self.p + screen = p.getScreenRGB() + + # Main training loop, runs for NB_TIMESTEPS iterations + for i in trange(config.TIMESTEPS): + action = self.eps_greedy(i) + + reward = p.act(action) + # Shape reward to include rewardAlive + # This is awarded when the agent survives for one timestep (without passing a pipe) + if reward == 0.0: + reward = config.REWARD_ALIVE + # Clip negative rewards so that the reward space remains in [-1,1] + if reward < 0: + reward = -1.0 + + new_screen = p.getScreenRGB() + done = p.game_over() + + self.er.append_sample(screen, action, reward, new_screen, done) + + state, a, r, new_state, D = self.er.minibatch() + + state = self._unpack_state(state) + new_state = self._unpack_state(new_state) + + Q = self.model.predict(state) # shape (minibatch_size, 2) + new_Q = self.model.predict(new_state) # shape (minibatch_size, 2) + + # row-wise maximum, shape (minibatch_size, ) + max_new_Q = new_Q.max(1).reshape((config.MINIBATCH_SIZE,)) + + update = r + (1 - D) * (config.DISCOUNT_RATE * max_new_Q) + + Q[:, (a // 119).astype(int)] = update.reshape(config.MINIBATCH_SIZE, 1) + + # Incremental training + self.model.train_on_batch(x=state, y=Q) + + if i % config.TEST_DELTA == 0 and i > 0: + print('Testing the network...') + mean_score, max_score = self.eval_network(config.NUM_TEST_TRIALS) + print('Current scores for the network:\n', + f'\tmean -> {mean_score}' + f'\tmax -> {max_score}') + + if i % config.SAVE_DELTA == 0 and i > config.MIN_ER_SIZE: + print('Saving network...') + self._write_network(config.MODEL_FILENAME) + + if done: + p.reset_game() + + screen = p.getScreenRGB() + + print('Training done, saving final weights') + self._write_network(config.MODEL_FILENAME) + + def eps_greedy(self): + """ + Epsilon-greedy explorator (GLIE). Takes a random action with probability epsilon (linearly decreasing + from 1.0 to 0.1 over all timesteps), otherwise the greedy action from the current Q-network + Returns: + action (int)): The next action to make + """ + # The epsilon parameter decreases linearly over all timesteps + epsilon = 0.1 # Fixed eps during retraining + + if np.random.rand() <= epsilon: + # Take random action, either None (0) or flap (119) + action = np.random.choice([0, 119], p=[1 - config.PROB_FLAP, config.PROB_FLAP]) + else: + state = self.er.memory[-1][3] + state = self._unpack_state(state).reshape((1, config.HISTORY_LENGTH, 84, 84)) + # reshape necessary since dqn.predict expect a list of samples (in this case only a single sample) + action_array = self.model.predict(state) + action = action_array.argmax() + action *= 119 # the argmax is either 0 or 1, whereas the correct actions are either 0 or 119 + + return action + + def eval_network(self, trials=20): + """ + Evaluate current performances of network. + Args: + trials: Number of trials to perform. One trial is one full game, from initialisation to game over + + Returns: + results (tuple): Tuple of (mean score, max score). The mean score is averaged over all trials + """ + + scores = np.zeros(trials) + + # Create a local copy of the simulator to prevent messing up the training simulator + game = FlappyBird(graphics="fixed") + p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True) + # Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for + # learning, just for display purposes. + p.init() + + for i in range(trials): + p.reset_game() + screen = p.getScreenRGB() + holder = StateHolder() + holder.append(screen) + + while not p.game_over(): + action_array = self.model.predict(holder.get_dqn_input()) + action = action_array.argmax() * 119 + + scores[i] += p.act(action) + holder.append(p.getScreenRGB()) + + return scores.mean(), scores.max() + + def _write_network(self, filename='weights.dqn'): + """ + Save the full model (architecture + weights + status of the optimiser) to the HDF5 archive located at + "filename" + Args: + filename (str): Location of saved model (path) + """ + self.model.save(filename) + + def _unpack_state(self, state): + """ + Unroll the state array (array of deques) along its 1st axis (i.e. the dequeue axis) + + Args: + state (np.ndarray): State of deques to unpack, shape (n,) + + Returns: + unpacked (np.ndarray): Unpacked state, ready to be fed to the DQN, shape (n, history_length, 84, 84) + """ + return np.array([np.array(elt) for elt in state]) + + def _gen_er_samples(self): + """ + Use the existing model to generate enough sample to start training (i.e. MIN_ER_SIZE samples according to + the config file) + """ + print(f"Generating {config.MIN_ER_SIZE} samples using the existing model") + + self.stateholder.append(self.p.getScreenRGB()) + + for i in trange(config.MIN_ER_SIZE): + screen = self.p.getScreenRGB() + + action = self.model.predict(self.stateholder.get_dqn_input()) + action = action.argmax() * 119 + + reward = self.p.act(action) + + # Shape reward exactly as we do during training + if reward == 0.0: + reward = config.REWARD_ALIVE + if reward < 0.0: + reward = -1.0 + + new_screen = self.p.getScreenRGB() + done = self.p.game_over() + + # Append to the stateholder + self.stateholder.append(new_screen) + + # Append to the ER + self.er.append_sample(screen, action, reward, new_screen, done) + + if done: + self.p.reset_game() + + print(f'Successfully generated {config.MIN_ER_SIZE} samples') + + +if __name__ == '__main__': + retrainer = Retrainer(config.MODEL_FILENAME) + retrainer.train_network() diff --git a/RandomBird/run.py b/Cattelle/run.py similarity index 97% rename from RandomBird/run.py rename to Cattelle/run.py index 39b5801..9697db9 100644 --- a/RandomBird/run.py +++ b/Cattelle/run.py @@ -26,4 +26,4 @@ cumulated[i] = cumulated[i] + reward average_score = np.mean(cumulated) -max_score = np.max(cumulated) +max_score = np.max(cumulated) \ No newline at end of file diff --git a/Cattelle/run_modified.py b/Cattelle/run_modified.py new file mode 100644 index 0000000..7d5b61f --- /dev/null +++ b/Cattelle/run_modified.py @@ -0,0 +1,42 @@ +import numpy as np +from ple import PLE +# You're not allowed to change this file +from ple.games.flappybird import FlappyBird + +from Cattelle.FlappyAgent import FlappyPolicy + +game = FlappyBird( + graphics="fixed") # use "fancy" for full background, random bird color and random pipe color, use "fixed" ( +# default) for black background and constant bird and pipe colors. +p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True) +# Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for +# learning, just for display purposes. + +p.init() +reward = 0.0 + +nb_games = 100 +cumulated = np.zeros(nb_games) + +try: + for i in range(nb_games): + p.reset_game() + + while not p.game_over(): + state = game.getGameState() + screen = p.getScreenRGB() + action = FlappyPolicy(state, screen) # Your job is to define this function. + + reward = p.act(action) + cumulated[i] = cumulated[i] + reward + +except Exception as e: + # No matter what the program stopped, we still want our top and average score for debugging purposes + # print(f'Received exception "{e}"') + raise e + +average_score = np.mean(cumulated) +max_score = np.max(cumulated) + +print("Average score:", average_score) +print("Max score:", max_score) diff --git a/Cattelle/train.py b/Cattelle/train.py new file mode 100644 index 0000000..19a63d7 --- /dev/null +++ b/Cattelle/train.py @@ -0,0 +1,220 @@ +# Training file +# Initialise and train the DQN +import numpy as np +from keras import optimizers, initializers +from keras.layers import Dense, Conv2D, Flatten +from keras.models import Sequential +from ple import PLE +from ple.games.flappybird import FlappyBird +from tqdm import trange + +from ExperienceReplay import ExperienceReplay +from config import DebugConfig as config +from utils import StateHolder + + +class Trainer: + + def __init__(self): + """ + Initialise DQN architecture and weights, experience replay memory and the simulation environment + """ + + dqn = Sequential() + initialiser = initializers.TruncatedNormal(mean=0, stddev=config.INITIALISATION_STD) + # 1st layer + dqn.add(Conv2D(filters=32, kernel_size=(8, 8), strides=4, activation="relu", + input_shape=(config.HISTORY_LENGTH, 84, 84), data_format="channels_first", + kernel_initializer=initialiser)) + # Input shape is (no_samples, history_length, 84, 84), thus channels first + # 2nd layer + dqn.add(Conv2D(filters=64, kernel_size=(4, 4), strides=2, activation="relu", kernel_initializer=initialiser)) + # 3rd layer + dqn.add(Conv2D(filters=64, kernel_size=(3, 3), strides=1, activation="relu", kernel_initializer=initialiser)) + dqn.add(Flatten()) + # 3rd layer + dqn.add(Dense(units=512, activation="relu")) + # output layer + dqn.add(Dense(units=2, activation="linear")) + + optimizer = optimizers.RMSprop(lr=config.LEARNING_RATE, decay=config.DECAY) + dqn.compile(optimizer=optimizer, loss="mean_squared_error") + + self.dqn = dqn + + self.er = ExperienceReplay(config.ER_SIZE, config.HISTORY_LENGTH, config.MINIBATCH_SIZE) + + game = FlappyBird(graphics="fixed") + p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True) + # Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for + # learning, just for display purposes. + p.init() + + self.game = game + self.p = p + + def train_network(self): + """" + Train the DQN according to the settings defined in config.py + """ + p = self.p + screen = p.getScreenRGB() + # IDEA: Implement target network to stabilise learning + # IDEA: Implement frame skipping if needed (equivalent to training every n loops and paste the same actions n + # times ?) + + # Main training loop, runs for NB_TIMESTEPS iterations + for i in trange(config.TIMESTEPS): + action = self.eps_greedy(i) + + reward = p.act(action) + # Shape reward to include rewardAlive + # This is awarded when the agent survives for one timestep (without passing a pipe) + if reward == 0.0: + reward = config.REWARD_ALIVE + # Clip negative rewards so that the reward space remains in [-1,1] + if reward < 0: + reward = -1.0 + + new_screen = p.getScreenRGB() + done = p.game_over() + + self.er.append_sample(screen, action, reward, new_screen, done) + + if i >= config.MIN_ER_SIZE: + if i == config.MIN_ER_SIZE: + print('Minimum size for ER memory reached, learning process started') + + state, a, r, new_state, D = self.er.minibatch() + + state = self._unpack_state(state) + new_state = self._unpack_state(new_state) + + Q = self.dqn.predict(state) # shape (minibatch_size, 2) + new_Q = self.dqn.predict(new_state) # shape (minibatch_size, 2) + + # row-wise maximum, shape (minibatch_size, ) + max_new_Q = new_Q.max(1).reshape((config.MINIBATCH_SIZE,)) + + update = r + (1 - D) * (config.DISCOUNT_RATE * max_new_Q) + + Q[:, (a // 119).astype(int)] = update.reshape(config.MINIBATCH_SIZE, 1) + + # Incremental training + self.dqn.train_on_batch(x=state, y=Q) + + if i % config.TEST_DELTA == 0 and i > 0: + print('Testing the network...') + mean_score, max_score = self.eval_network(config.NUM_TEST_TRIALS) + print('Current scores for the network:\n', + f'\tmean -> {mean_score}' + f'\tmax -> {max_score}') + + if i % config.SAVE_DELTA == 0 and i > config.MIN_ER_SIZE: + print('Saving network...') + self._write_network(config.MODEL_FILENAME) + + if done: + p.reset_game() + + screen = p.getScreenRGB() + + print('Training done, saving final weights') + self._write_network(config.MODEL_FILENAME) + + def eps_greedy(self, step): + """ + Epsilon-greedy explorator (GLIE). Takes a random action with probability epsilon (linearly decreasing + from 1.0 to 0.1 over all timesteps), otherwise the greedy action from the current Q-network + Args: + step (int): Current step number + + Returns: + action (int)): The next action to make + """ + # The epsilon parameter decreases linearly over all timesteps + epsilon = 1.0 - (0.90 / config.TIMESTEPS) * step + + if np.random.rand() <= epsilon: + # Take random action, either None (0) or flap (119) + action = np.random.choice([0, 119], p=[1 - config.PROB_FLAP, config.PROB_FLAP]) + else: + state = self.er.memory[-1][3] + state = self._unpack_state(state).reshape((1, config.HISTORY_LENGTH, 84, 84)) + # reshape necessary since dqn.predict expect a list of samples (in this case only a single sample) + action_array = self.dqn.predict(state) + action = action_array.argmax() + action *= 119 # the argmax is either 0 or 1, whereas the correct actions are either 0 or 119 + + return action + + # def test_er(self): + # er = self.er + # + # for i in trange(10): + # screen = self.p.getScreenRGB() + # action = 0 + # reward = self.p.act(action) + # new_screen = self.p.getScreenRGB() + # er.append_sample(screen, action, reward, new_screen, self.p.game_over()) + + def eval_network(self, trials=20): + """ + Evaluate current performances of network. + Args: + trials: Number of trials to perform. One trial is one full game, from initialisation to game over + + Returns: + results (tuple): Tuple of (mean score, max score). The mean score is averaged over all trials + """ + + scores = np.zeros(trials) + + # Create a local copy of the simulator to prevent messing up the training simulator + game = FlappyBird(graphics="fixed") + p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=True, display_screen=True) + # Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for + # learning, just for display purposes. + p.init() + + for i in range(trials): + p.reset_game() + screen = p.getScreenRGB() + holder = StateHolder() + holder.append(screen) + + while not p.game_over(): + action_array = self.dqn.predict(holder.get_dqn_input()) + action = action_array.argmax() * 119 + + scores[i] += p.act(action) + holder.append(p.getScreenRGB()) + + return scores.mean(), scores.max() + + def _write_network(self, filename='weights.dqn'): + """ + Save the full model (architecture + weights + status of the optimiser) to the HDF5 archive located at + "filename" + Args: + filename (str): Location of saved model (path) + """ + self.dqn.save(filename) + + def _unpack_state(self, state): + """ + Unroll the state array (array of deques) along its 1st axis (i.e. the dequeue axis) + + Args: + state (np.ndarray): State of deques to unpack, shape (n,) + + Returns: + unpacked (np.ndarray): Unpacked state, ready to be fed to the DQN, shape (n, history_length, 84, 84) + """ + return np.array([np.array(elt) for elt in state]) + + +if __name__ == '__main__': + trainer = Trainer() + # trainer.test_er() + trainer.train_network() diff --git a/Cattelle/utils.py b/Cattelle/utils.py new file mode 100644 index 0000000..da6b1a6 --- /dev/null +++ b/Cattelle/utils.py @@ -0,0 +1,70 @@ +from collections import deque + +import numpy as np +from PIL import Image + +from config import Config as config + + +def process_screen(screen): + """ + Process the screen state to a simpler version that can be fed to the DQN predictor + + The processing is as follows: + 1. Convert to grayscale + 2. Crop to 405x288 + 3. Downscale and rescale to 84x84 + 4. Normalise pixel values from [0,255] to [0,1] + Args: + screen (np.ndarray): A RGB matrix + + Returns: + im (np.ndarray): Processed screen + """ + + # Indexing convention varies between PIL and numpy + screen = np.swapaxes(screen, 0, 1) + # Load the array in PIL + im = Image.fromarray(screen, 'RGB') + # Convert to grayscale + im = im.convert(mode='L') + # Crop + im = im.crop((0, 0, 288, 405)) + # Downscale and resize + im = im.resize((84, 84)) + # Normalise + im = np.array(im) / 255 + + return im + + +class StateHolder: + """ + A simple class designed to keep track of the previous frames in order to build a valid input for the DQN ( + input shape of (no_sample, history_length, 84, 84) + """ + state = deque(maxlen=config.HISTORY_LENGTH) + + def append(self, screen): + """ + Append a new frame to the holder. Handles the initial insertion case gracefully. + Args: + screen (np.ndarray): The current frame + """ + if len(self.state) == 0: + # Initial insertion + # No need to handle terminal cases as we don't restart from a game over, we just start a whole new game + self.state = deque([process_screen(screen)] * 4, maxlen=config.HISTORY_LENGTH) + + else: + self.state.append(process_screen(screen)) + + def get_dqn_input(self): + """ + Return a numpy array ready to be fed to the DQN + Returns: + input_arr (np.ndarray): Input array for the DQN + """ + state_arr = np.array([elt for elt in self.state]) + + return state_arr.reshape((1, config.HISTORY_LENGTH, 84, 84)) diff --git a/README.md b/README.md index ce4894f..6359eda 100644 --- a/README.md +++ b/README.md @@ -1,49 +1,61 @@ # RL challenge -Your challenge is to learn to play [Flappy Bird](https://en.wikipedia.org/wiki/Flappy_Bird)! +Auteur : Thomas Cattelle -Flappybird is a side-scrolling game where the agent must successfully nagivate through gaps between pipes. Only two actions in this game: at each time step, either you click and the bird flaps, or you don't click and gravity plays its role. +# Introduction -There are three levels of difficulty in this challenge: -- Learn an optimal policy with hand-crafted features -- Learn an optimal policy with raw variables -- Learn an optimal policy from pixels. +Ce document a pour objectif de préciser les techniques d'apprentissage utilisées pour résoudre le problème ainsi que les possibilités d'amélioration futures. -# Your job +# Configuration du modèle +La plupart des hyperparamètres du modèle sont modifiables librement dans le fichier `config.py`. -Your job is to: -
    -
  1. fork the project at [https://github.com/SupaeroDataScience/RLchallenge](https://github.com/SupaeroDataScience/RLchallenge) on your own github (yes, you'll need one). -
  2. rename the "RandomBird" folder into "YourLastName". -
  3. modify 'FlappyPolicy.py' in order to implement the function `FlappyPolicy(state,screen)` used below. You're free to add as many extra files as you need. However, you're not allowed to change 'run.py'. -
  4. you are encouraged, however, to copy-paste the contents of 'run.py' as a basis for your learning algorithm. -
  5. add any useful material (comments, text files, analysis, etc.) -
  6. make a pull request on the original repository when you're done (please don't make a pull request before you think your work is ready to be merged on the original repository). -
+# Utilisation +* Pour apprendre sans avoir de modèle pré-existant: `python train.py` +* Pour compléter l'apprentissage à partir d'un modèle pré-existant: `python retrainer.py` (le modèle existant étant le fichier pointé dans `config.MODEL_FILENAME`) +* Pour tester le modèle: `python run.py` -**All the files you create must be placed inside the directory "YourLastName".** +# Apprentissage -`FlappyPolicy(state,screen)` takes both the game state and the screen as input. It gives you the choice of what you base your policy on: - +L'apprentissage du jeu est basé **uniquement sur les pixels** de l'écran (variable `screen`). Le vecteur d'état n'est jamais utilisé, ni pendant l'apprentissage, ni pendant la phase de test. -Recall that the evaluation will start by running 'run.py' on our side, so 'FlappyPolicy' should call an already trained policy, otherwise we will be evaluating your agent during learning, which is not the goal. Of course, we will check your learning code and we will greatly appreciate insightful comments and additional material like (documentation, discussion, comparisons, perspectives, state-of-the-art...). +## DQN +Le modèle est entrainé par le biais d'un DQN de la structure suivante: -# Installation +### Couches +1. Convolution 2D, 32 filters, kernel (8,8), strides 4, suivie d'une couche d'activation ReLu +2. Convolution 2D, 64 filters, kernel (4,4), strides 2, suivie d'une couche d'activation ReLu +3. Convolution 2D, 64 filters, kernel (3,3), strides 1, suivie d'une couche d'activation ReLu +4. Couche de 512 neurones complètementement connectés, suivie d'une couche d'activation ReLu +5. Couche de sortie de 2 neurones (2 actions) complètement connectée, couche d'activation linéaire -You will need to install a few things to get started. -First, you will need PyGame. +### Initialisation +Les trois couches convolutives sont initialisées par un tirage selon la loi normale avec une moyenne à 0 et un écart type de 0.1. Les tirages qui se situent à plus de deux écarts-types de la moyenne sont tirés à nouveau (`keras.initializers.TruncatedNormal`) -``` -pip install pygame -``` +### Optimiseur +RMSProp avec un taux d'apprentissage initial de 1e-6 et un decay de 0.9. -And you will need [PLE (PyGame Learning Environment)](https://github.com/ntasfi/PyGame-Learning-Environment) which is already present in this repository (the above link is only given for your information). To install it: -``` -cd PyGame-Learning-Environment/ -pip install -e . -``` -Note that this version of FlappyBird in PLE has been slightly changed to make the challenge a bit easier: the background is turned to plain black, the bird and pipe colors are constant (red and green respectively). +### Pré-traitement +L'écran (288*512 par défaut) est converti en niveaux de gris, rogné de (0,0) jusqu'à (405,288) (afin d'éliminer les textures statiques du sol), puis enfin réduit et redimensionné en (84,84). Ce pré-traitement est réalisé à l'aide de la bibliothèque Pillow. + +### Entrée +On fournit au DQN une entrée de la taille (`batch_size`, `history_length`, 84, 84) où `batch_size` correspond à la taille d'une minibatch d'entrainement (32 par défaut) et `history_length` correspond au nombre d'écrans (*frame*) que l'on conserve dans un état (4 par défaut). + +### Paramètres d'entraînement +Par défaut, le modèle génère des échantillons pour la mémoire d'Experience Replay pendant 3000 itérations puis apprend pendant 597000 itérations (600000 itérations en tout). + +### Reward shaping +Les récompenses sont modifiées dans la façon suivante: +* -1.0 **exactement** si l'oiseau meurt (la récompense par défaut n'est pas fixe dans le simulateur de base) +* +0.1 si l'oiseau survit pendant une frame +* +1.0 si l'oiseau passe un pipe + +# Résultats +Les résultats sont peu encourageants à l'heure actuelle et le modèle a toujours du **mal à généraliser** afin d'obtenir la politique optimale (qui consiste généralement à rester vers le bas de l'écran et à voler juste avant un tuyau). Le modèle témoigne toutefois d'un véritable apprentissage en cela qu'il essaie souvent de "viser" l'espace du tuyau. + +En particulier, l'oiseau se cogne de façon récurrente juste après avoir passé un pipe. Une solution envisageable sera l'implémentation d'une méthode de **frame skipping** qui consiste à apprendre et à prendre une décision à l'aide du DQN tous les `n` frames plutôt qu'à chaque frame. Cette piste n'a pas pu être explorée à temps en raison du long temps d'apprentissage du modèle. + +Enfin, certaines trajectoires aberrantes suivies par l'oiseau (e.g. monter directement jusqu'en haut de l'écran en tout début de partie) suggère l'existence de boucles de rétroaction néfastes dans le modèle lors de l'entraînement. Une solution possible serait alors de **dédoubler le modèle d'apprentissage en deux DQN**: +* Un DQN primaire qui apprendrait à chaque itération comme c'est le cas actuellement +* Un DQN secondaire mis à jour avec les poids du DQN primaire toutes les `n` itérations et qui est chargé d'effectuer les prédictions d'actions. Ce DQN secondaire est alors celui utilisé en phase de test. + +Cette architecture bicéphale permettrait possiblement de stabiliser l'apprentissage. \ No newline at end of file diff --git a/RandomBird/FlappyAgent.py b/RandomBird/FlappyAgent.py deleted file mode 100644 index 9f3ec84..0000000 --- a/RandomBird/FlappyAgent.py +++ /dev/null @@ -1,9 +0,0 @@ -import numpy as np - -def FlappyPolicy(state, screen): - action=None - if(np.random.randint(0,2)<1): - action=119 - return action - - diff --git a/environment.yml b/environment.yml new file mode 100644 index 0000000..373320a --- /dev/null +++ b/environment.yml @@ -0,0 +1,47 @@ +name: challengerl +channels: + - http://conda.anaconda.org/gurobi + - defaults +dependencies: + - backports=1.0=py36h81696a8_1 + - backports.weakref=1.0rc1=py36_0 + - bleach=1.5.0=py36_0 + - certifi=2018.1.18=py36_0 + - html5lib=0.9999999=py36_0 + - icc_rt=2017.0.4=h97af966_0 + - intel-openmp=2018.0.0=hd92c6cd_8 + - libprotobuf=3.5.1=he0781b1_0 + - markdown=2.6.11=py36_0 + - mkl=2018.0.1=h2108138_4 + - numpy=1.14.0=py36h4a99626_1 + - pip=9.0.1=py36h226ae91_4 + - protobuf=3.5.1=py36h6538335_0 + - python=3.6.4=h6538335_1 + - setuptools=38.4.0=py36_0 + - six=1.11.0=py36h4db2310_1 + - vc=14=h0510ff6_3 + - vs2015_runtime=14.0.25420=0 + - werkzeug=0.14.1=py36_0 + - wheel=0.30.0=py36h6c3ec14_1 + - wincertstore=0.2=py36h7fe50ca_0 + - zlib=1.2.11=h8395fce_2 + - pip: + - absl-py==0.1.10 + - astor==0.6.2 + - enum34==1.1.6 + - gast==0.2.0 + - grpcio==1.10.0 + - keras==2.1.4 + - olefile==0.45.1 + - pillow==5.0.0 + - ple==0.0.1 + - pygame==1.9.3 + - pyyaml==3.12 + - scipy==1.0.0 + - tensorboard==1.6.0 + - tensorflow==1.5.0 + - tensorflow-tensorboard==1.5.1 + - termcolor==1.1.0 + - tqdm==4.19.6 +prefix: C:\Program Files\Miniconda3\envs\challengerl +