You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/chapters/06/01.md
+16-7Lines changed: 16 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,12 +7,14 @@
7
7
</div>
8
8
<div class="meta-item">
9
9
<i class="fas fa-file-alt"></i>
10
-
1444 words
10
+
1475 words
11
11
</div>
12
12
</div>
13
13
14
14
15
-
The section provides a succinct reminder of several concepts in reinforcement learning (RL). It also disambiguates various often conflated terms such as rewards, values and utilities. The section ends with a discussion around distinguishing the concept of objectives that a reinforcement learning system might pursue from what it is being rewarded for. Readers who are already familiar with the basics can skip directly to section 1.2.
15
+
!!! warning "This is meant as a recap. If you are already familiar with the basics you can skip directly to the next section."
16
+
17
+
The section provides a succinct reminder of several concepts in reinforcement learning (RL). It also disambiguates various often conflated terms such as rewards, values and utilities. The section ends with a discussion around distinguishing the concept of objectives that a reinforcement learning system might pursue from what it is being rewarded for.
16
18
17
19
## 6.1.1 Primer {: #01 }
18
20
@@ -24,13 +26,20 @@ The section provides a succinct reminder of several concepts in reinforcement le
<figcaptionmarkdown="1"><b>Video 6.1:</b> Optional video showcasing robotic hand trained using reinforcement learning.</figcaption>
33
+
</figure>
34
+
35
+
27
36
Some examples of real-world applications of RL include:
28
37
29
-
-**Robotic systems**: RL has been applied to tasks such as controlling physical robots in real-time, and enabling them to learn more complicated movements (OpenAI 2018 “[Learning Dexterity](https://www.youtube.com/watch?v=jwSbzNHGflM)”). RL can enable robotic systems to learn complex tasks and adapt to changing environments.
38
+
-**Robotic systems**: RL has been applied to tasks such as controlling physical robots in real-time, and enabling them to learn more complicated movements. RL can enable robotic systems to learn complex tasks and adapt to changing environments.
30
39
31
40
-**Recommender Systems**: RL can be applied to recommender systems, which interact with billions of users and aim to provide personalized recommendations. RL algorithms can learn to optimize the recommendation policy based on user feedback and improve the overall user experience.
32
41
33
-
-**Game playing systems: **In the early 2010s RL-based systems started to beat humans at a few very simple Atari games, like Pong and Breakout. Over the years, there have been many models that have utilized RL to defeat world masters in both board and video games. These include models like [AlphaGo](https://www.deepmind.com/research/highlighted-research/alphago) (2016), [AlphaZero](https://www.deepmind.com/blog/alphazero-shedding-new-light-on-chess-shogi-and-go) (2018), [OpenAI Five](https://openai.com/research/openai-five-defeats-dota-2-world-champions) (2019), [AlphaStar](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii) (2019), [MuZero](https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules) (2020) and [EfficientZero](https://github.com/YeWR/EfficientZero) (2021).
42
+
-**Game playing systems: **In the early 2010s RL-based systems started to beat humans at a few very simple Atari games, like Pong and Breakout. Over the years, there have been many models that have utilized RL to defeat world masters in both board and video games. These include models like AlphaGo ([DeepMind, 2016](https://www.deepmind.com/research/highlighted-research/alphago)), AlphaZero ([DeepMind, 2018](https://www.deepmind.com/blog/alphazero-shedding-new-light-on-chess-shogi-and-go)), OpenAI Five ([OpenAI, 2019](https://openai.com/research/openai-five-defeats-dota-2-world-champions)), AlphaStar ([DeepMind, 2019](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii)), MuZero ([DeepMind, 2020](https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules)) and EfficientZero ([Ye et al., 2021](https://arxiv.org/abs/2111.00210)).
34
43
35
44
RL is different from supervised learning as it begins with a high-level description of "what" to do but allows the agent to experiment and learn from experience the best "how". In RL, the agent learns through interaction with an environment and receives feedback in the form of rewards or punishments based on its actions. RL is focused on learning a set of rules that recommend the best action to take in a given state to maximize long-term rewards. In contrast, supervised learning typically involves learning from explicitly provided labels or correct answers for each input.
36
45
@@ -48,13 +57,13 @@ t
48
57
49
58
A history is the sequence of past observations, actions and rewards that have been taken up until time $t: h_t = (a_1,o_1,r_1, \ldots, a_t,o_t,r_t)$ . The state of the world is generally some function of the history: $s_t = f(h_t)$. The World State is the full true state of the world used to determine how the world generates the next observation and reward. The agent might either get the entire world state as an observation $o_t$, or some partial subset.
50
59
51
-
The word goes from one state $s_t$ to the next $s_{t+1}$ either based on natural environmental dynamics, or the agent's actions. State transitions can be both deterministic or stochastic.
60
+
The world goes from one state $s_t$ to the next $s_{t+1}$ either based on natural environmental dynamics, or the agent's actions. State transitions can be both deterministic or stochastic.
52
61
53
62
This loop continues until a terminal condition is reached or can run indefinitely. Following is a diagram that succinctly captures the RL process:
54
63
55
64
<figuremarkdown="span">
56
-
{ loading=lazy }
Copy file name to clipboardExpand all lines: docs/chapters/06/02.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@
7
7
</div>
8
8
<div class="meta-item">
9
9
<i class="fas fa-file-alt"></i>
10
-
763 words
10
+
775 words
11
11
</div>
12
12
</div>
13
13
@@ -35,8 +35,8 @@ This notion initially stems from the work of Charles Goodhart in economic theory
35
35
To illustrate this concept, the following is a story of a Soviet nail factory. The factory received instructions to produce as many nails as possible, with rewards for high output and penalties for low output. Within a few years, the factory had significantly increased its nail production—tiny nails that were essentially thumbtacks and proved impractical for their intended purpose. Consequently, the planners shifted the incentives: they decided to reward the factory based on the total weight of the nails produced. Within a few years, the factory began producing large, heavy nails—essentially lumps of steel—that were equally ineffective for nailing things.
36
36
37
37
<figuremarkdown="span">
38
-
{ loading=lazy }
{ loading=lazy }
39
+
<figcaptionmarkdown="1"><b>Figure 6.2:</b> Graphic image showcasing the difficulty of specification while avoiding goodhart's law. ([Epicural, 2021](https://epicural.com/2021/04/27/goodharts-law/))</figcaption>
40
40
</figure>
41
41
42
42
A measure is not something that is optimized, whereas a target is something that is optimized. When we specify a target for optimization, it is reasonable to expect it to be correlated with what we want. Initially the measure might lead to the kind of actions that are truly desired. However, once the measure itself becomes the target, optimizing that target then starts diverging away from our desired states.
0 commit comments