Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions Meliani/FlappyAgent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import numpy as np
from keras.models import Sequential, load_model
model = load_model("best_model.dqf")
def FlappyPolicy(state, screen):
q = model.predict(np.array(list(state.values())).reshape(1,len(state)))
# q = self.model.predict(screen.reshape(1, screen.shape[0], screen.shape[1], screen.shape[2]))
# print(q)

return(np.argmax(q)*119)
# return np.random.randint(0,1)*119


Binary file added Meliani/best_model.dqf
Binary file not shown.
95 changes: 95 additions & 0 deletions Meliani/q_learn_state.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
import numpy as np
from ple.games.flappybird import FlappyBird
from ple import PLE
import numpy as np
from keras import optimizers
from keras.models import Sequential, load_model
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import RMSprop, sgd
from keras.layers.recurrent import LSTM
import numpy as np
import random
import sys
from sklearn.preprocessing import StandardScaler as scl
import time
model = Sequential()

file_path = "test_part_"


model.add(Dense(512, init='lecun_uniform', input_shape=(8,)))

model.add(Activation('relu'))
model.add(Dense(2, init='lecun_uniform'))
model.add(Activation('linear'))
model.compile(loss='mse', optimizer=optimizers.Adam(lr=1e-4))

gamma = 0.99 # discount factor
epsilon = 1 # epsilon-greddy
batchSize = 256 # mini batch size

jeu = FlappyBird()
p= PLE(jeu, fps=30, frame_skip=1, num_steps=1,force_fps=True, display_screen=True)
p.init()

i=0

while (True):
p.reset_game()
state = jeu.getGameState()
state = np.array(list(state.values()))
while(not jeu.game_over()):



qval = model.predict(state.reshape(1,len(state)), batch_size=batchSize) #Learn Q (Q-learning) / model initialise avant (neural-network)
if (random.random() < epsilon): # exploration exploitation strategy
action = np.random.randint(0,2)
else: #choose best action from Q(s,a) values
qval_av_action = [-9999]*2

for ac in range(0,2):
qval_av_action[ac] = qval[0][ac]
action = (np.argmax(qval_av_action))
#Take action, observe new state S'
#Observe reward
reward = p.act(119*action)
if reward == 1:
reaward = 1
elif reward == -5:
reward = -500
new_state = jeu.getGameState()
new_state = np.array(list(new_state.values()))
# choose new reward values


#Get max_Q(S',a)
newQ = model.predict(new_state.reshape(1,len(state)), batch_size=batchSize)
maxQ = np.max(newQ)
y = np.zeros((1,2))
y[:] = qval[:]
if reward != -5: #non-terminal state
update = (reward + gamma * maxQ)
else:
update = reward
y[0][action] = update
print("Game #: %s" % (i,))
model.fit(state.reshape(1, len(state)), y, batch_size=batchSize, nb_epoch=2, verbose=0)
state = new_state


# update exploitation / exploration strategy
if epsilon > 0.1:
epsilon -= (1.0/10000)

# save the model every 1000 epochs
if i==100:
model.save(file_path+"0.dqf")
if i%1000 == 0 and i!=0:
model.save(file_path+str(i/1000)+".dqf")
time.sleep(60)
if i == 100000:
break

i=i+1
model.save(file_path+"final.dqf")
2 changes: 1 addition & 1 deletion RandomBird/run.py → Meliani/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import numpy as np
from FlappyAgent import FlappyPolicy

game = FlappyBird(graphics="fixed") # use "fancy" for full background, random bird color and random pipe color, use "fixed" (default) for black background and constant bird and pipe colors.
game = FlappyBird() # use "fancy" for full background, random bird color and random pipe color, use "fixed" (default) for black background and constant bird and pipe colors.
p = PLE(game, fps=30, frame_skip=1, num_steps=1, force_fps=False, display_screen=True)
# Note: if you want to see you agent act in real time, set force_fps to False. But don't use this setting for learning, just for display purposes.

Expand Down
61 changes: 17 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,22 @@
# RL challenge

Your challenge is to learn to play [Flappy Bird](https://en.wikipedia.org/wiki/Flappy_Bird)!
My challenge is to learn to play [Flappy Bird](https://en.wikipedia.org/wiki/Flappy_Bird)!

Flappybird is a side-scrolling game where the agent must successfully nagivate through gaps between pipes. Only two actions in this game: at each time step, either you click and the bird flaps, or you don't click and gravity plays its role.

There are three levels of difficulty in this challenge:
- Learn an optimal policy with hand-crafted features
- Learn an optimal policy with raw variables
- Learn an optimal policy from pixels.

# Your job

Your job is to:
<ol>
<li> fork the project at [https://github.com/SupaeroDataScience/RLchallenge](https://github.com/SupaeroDataScience/RLchallenge) on your own github (yes, you'll need one).
<li> rename the "RandomBird" folder into "YourLastName".
<li> modify 'FlappyPolicy.py' in order to implement the function `FlappyPolicy(state,screen)` used below. You're free to add as many extra files as you need. However, you're not allowed to change 'run.py'.
<li> you are encouraged, however, to copy-paste the contents of 'run.py' as a basis for your learning algorithm.
<li> add any useful material (comments, text files, analysis, etc.)
<li> make a pull request on the original repository <i>when you're done</i> (please don't make a pull request before you think your work is ready to be merged on the original repository).
</ol>

**All the files you create must be placed inside the directory "YourLastName".**

`FlappyPolicy(state,screen)` takes both the game state and the screen as input. It gives you the choice of what you base your policy on:
<ul>
<li> If you use the state variables vector and perform some handcrafted feature engineering, you're playing in the "easy" league. If your agent reaches an average score of 15, you're sure to have a grade of at least 10/20 (possibly more if you implement smart stuff and/or provide a smart discussion).
<li> If you use the state variables vector without altering it (no feature engineering), you're playing in the "good job" league. If your agent reaches an average score of 15, you're sure to have at least 15/20 (possibly more if you implement smart stuff and/or provide a smart discussion).
<li> If your agent uses only the raw pixels from the image, you're playing in the "Deepmind" league. If your agent reaches an average score of 15, you're sure to have at the maximum grade (plus possible additional benefits).
</ul>

Recall that the evaluation will start by running 'run.py' on our side, so 'FlappyPolicy' should call an already trained policy, otherwise we will be evaluating your agent during learning, which is not the goal. Of course, we will check your learning code and we will greatly appreciate insightful comments and additional material like (documentation, discussion, comparisons, perspectives, state-of-the-art...).

# Installation

You will need to install a few things to get started.
First, you will need PyGame.

```
pip install pygame
```

And you will need [PLE (PyGame Learning Environment)](https://github.com/ntasfi/PyGame-Learning-Environment) which is already present in this repository (the above link is only given for your information). To install it:
```
cd PyGame-Learning-Environment/
pip install -e .
```
Note that this version of FlappyBird in PLE has been slightly changed to make the challenge a bit easier: the background is turned to plain black, the bird and pipe colors are constant (red and green respectively).


# DEEP Q-LEARNING

For this project, I decided to use deep Q-learning. For that, I used Keras and Tensorflow libraries. I worked on the state vector and created a neural network to learn the Q-function.
The Neural Network is only one layer. But we already know that this is theoretically enough to reproduce any mathematical function.

A lot of the project time has been spent on choosing the right optimizers, loss functions and activation functions. These seem to be very important to build a good neural network.

Moreover, the reward, although it sounded at first counter-intuitive, but it played a very big role on the quality of the learning. a very big difference in rewards leads to much better results.

# RESULTS

The model I kept, is a simple one-layer model, with no experience replay. It hits regularly 30 average score and can peak up to 190 of score.


9 changes: 0 additions & 9 deletions RandomBird/FlappyAgent.py

This file was deleted.