Skip to content

Commit 9668ff6

Browse files
committed
fixed specification sources
1 parent e733d99 commit 9668ff6

30 files changed

+124
-102
lines changed

docs/chapters/06/01.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,14 @@
77
</div>
88
<div class="meta-item">
99
<i class="fas fa-file-alt"></i>
10-
1444 words
10+
1475 words
1111
</div>
1212
</div>
1313

1414

15-
The section provides a succinct reminder of several concepts in reinforcement learning (RL). It also disambiguates various often conflated terms such as rewards, values and utilities. The section ends with a discussion around distinguishing the concept of objectives that a reinforcement learning system might pursue from what it is being rewarded for. Readers who are already familiar with the basics can skip directly to section 1.2.
15+
!!! warning "This is meant as a recap. If you are already familiar with the basics you can skip directly to the next section."
16+
17+
The section provides a succinct reminder of several concepts in reinforcement learning (RL). It also disambiguates various often conflated terms such as rewards, values and utilities. The section ends with a discussion around distinguishing the concept of objectives that a reinforcement learning system might pursue from what it is being rewarded for.
1618

1719
## 6.1.1 Primer {: #01 }
1820

@@ -24,13 +26,20 @@ The section provides a succinct reminder of several concepts in reinforcement le
2426

2527

2628

29+
30+
<figure class="video-figure" markdown="span">
31+
<iframe style="width: 100%; aspect-ratio: 16 / 9;" frameborder="0" allowfullscreen src="https://www.youtube.com/embed/jwSbzNHGflM"></iframe>
32+
<figcaption markdown="1"><b>Video 6.1:</b> Optional video showcasing robotic hand trained using reinforcement learning.</figcaption>
33+
</figure>
34+
35+
2736
Some examples of real-world applications of RL include:
2837

29-
- **Robotic systems**: RL has been applied to tasks such as controlling physical robots in real-time, and enabling them to learn more complicated movements (OpenAI 2018 “[Learning Dexterity](https://www.youtube.com/watch?v=jwSbzNHGflM)”). RL can enable robotic systems to learn complex tasks and adapt to changing environments.
38+
- **Robotic systems**: RL has been applied to tasks such as controlling physical robots in real-time, and enabling them to learn more complicated movements. RL can enable robotic systems to learn complex tasks and adapt to changing environments.
3039

3140
- **Recommender Systems**: RL can be applied to recommender systems, which interact with billions of users and aim to provide personalized recommendations. RL algorithms can learn to optimize the recommendation policy based on user feedback and improve the overall user experience.
3241

33-
- **Game playing systems: **In the early 2010s RL-based systems started to beat humans at a few very simple Atari games, like Pong and Breakout. Over the years, there have been many models that have utilized RL to defeat world masters in both board and video games. These include models like [AlphaGo](https://www.deepmind.com/research/highlighted-research/alphago) (2016), [AlphaZero](https://www.deepmind.com/blog/alphazero-shedding-new-light-on-chess-shogi-and-go) (2018), [OpenAI Five](https://openai.com/research/openai-five-defeats-dota-2-world-champions) (2019), [AlphaStar](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii) (2019), [MuZero](https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules) (2020) and [EfficientZero](https://github.com/YeWR/EfficientZero) (2021).
42+
- **Game playing systems: **In the early 2010s RL-based systems started to beat humans at a few very simple Atari games, like Pong and Breakout. Over the years, there have been many models that have utilized RL to defeat world masters in both board and video games. These include models like AlphaGo ([DeepMind, 2016](https://www.deepmind.com/research/highlighted-research/alphago)), AlphaZero ([DeepMind, 2018](https://www.deepmind.com/blog/alphazero-shedding-new-light-on-chess-shogi-and-go)), OpenAI Five ([OpenAI, 2019](https://openai.com/research/openai-five-defeats-dota-2-world-champions)), AlphaStar ([DeepMind, 2019](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii)), MuZero ([DeepMind, 2020](https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules)) and EfficientZero ([Ye et al., 2021](https://arxiv.org/abs/2111.00210)).
3443

3544
RL is different from supervised learning as it begins with a high-level description of "what" to do but allows the agent to experiment and learn from experience the best "how". In RL, the agent learns through interaction with an environment and receives feedback in the form of rewards or punishments based on its actions. RL is focused on learning a set of rules that recommend the best action to take in a given state to maximize long-term rewards. In contrast, supervised learning typically involves learning from explicitly provided labels or correct answers for each input.
3645

@@ -48,13 +57,13 @@ t
4857

4958
A history is the sequence of past observations, actions and rewards that have been taken up until time $t: h_t = (a_1,o_1,r_1, \ldots, a_t,o_t,r_t)$ . The state of the world is generally some function of the history: $s_t = f(h_t)$. The World State is the full true state of the world used to determine how the world generates the next observation and reward. The agent might either get the entire world state as an observation $o_t$, or some partial subset.
5059

51-
The word goes from one state $s_t$ to the next $s_{t+1}$ either based on natural environmental dynamics, or the agent's actions. State transitions can be both deterministic or stochastic.
60+
The world goes from one state $s_t$ to the next $s_{t+1}$ either based on natural environmental dynamics, or the agent's actions. State transitions can be both deterministic or stochastic.
5261

5362
This loop continues until a terminal condition is reached or can run indefinitely. Following is a diagram that succinctly captures the RL process:
5463

5564
<figure markdown="span">
56-
![Enter image alt description](Images/ht2_Image_1.png){ loading=lazy }
57-
<figcaption markdown="1"><b>Figure 6.1:</b> Emma Brunskill (Winter 2022) “[Stanford CS234 : RL](https://web.stanford.edu/class/cs234/CS234Win2022/modules.html) - Lecture 1”</figcaption>
65+
![Enter image alt description](Images/yoB_Image_1.png){ loading=lazy }
66+
<figcaption markdown="1"><b>Figure 6.1:</b> ([Brunskill, 2022](https://web.stanford.edu/class/archive/cs/cs234/cs234.1224/))</figcaption>
5867
</figure>
5968

6069
## 6.1.3 Policies {: #03 }

docs/chapters/06/02.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
</div>
88
<div class="meta-item">
99
<i class="fas fa-file-alt"></i>
10-
763 words
10+
775 words
1111
</div>
1212
</div>
1313

@@ -35,8 +35,8 @@ This notion initially stems from the work of Charles Goodhart in economic theory
3535
To illustrate this concept, the following is a story of a Soviet nail factory. The factory received instructions to produce as many nails as possible, with rewards for high output and penalties for low output. Within a few years, the factory had significantly increased its nail production—tiny nails that were essentially thumbtacks and proved impractical for their intended purpose. Consequently, the planners shifted the incentives: they decided to reward the factory based on the total weight of the nails produced. Within a few years, the factory began producing large, heavy nails—essentially lumps of steel—that were equally ineffective for nailing things.
3636

3737
<figure markdown="span">
38-
![Enter image alt description](Images/gGq_Image_2.png){ loading=lazy }
39-
<figcaption markdown="1"><b>Figure 6.2:</b> ([Source](https://lwfiles.mycourse.app/networkcapitalinsider-public/cc478b844a27de3f4f79f3dc0f9e0fde.jpeg))</figcaption>
38+
![Enter image alt description](Images/p5L_Image_2.png){ loading=lazy }
39+
<figcaption markdown="1"><b>Figure 6.2:</b> Graphic image showcasing the difficulty of specification while avoiding goodhart's law. ([Epicural, 2021](https://epicural.com/2021/04/27/goodharts-law/))</figcaption>
4040
</figure>
4141

4242
A measure is not something that is optimized, whereas a target is something that is optimized. When we specify a target for optimization, it is reasonable to expect it to be correlated with what we want. Initially the measure might lead to the kind of actions that are truly desired. However, once the measure itself becomes the target, optimizing that target then starts diverging away from our desired states.

0 commit comments

Comments
 (0)