Commit 8f9bed3
committed
Adding basic elastic training
Added guards to only use fast-resume if the proxy backend is used.
Added the changes to the jobset for elastic training
Temporary changes to the configuration to decrease batch size
Adding a stop_trace to cancel any ongoing traces
Changing the batch size to match the chip count and the checkpoint step interval to avoid any checkpoints for testing1 parent 9e739e3 commit 8f9bed3
File tree
3 files changed
+42
-7
lines changed- axlearn
- cloud/gcp
- common
- experiments/text/gpt
3 files changed
+42
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
320 | 320 | | |
321 | 321 | | |
322 | 322 | | |
| 323 | + | |
| 324 | + | |
323 | 325 | | |
324 | 326 | | |
325 | 327 | | |
| |||
581 | 583 | | |
582 | 584 | | |
583 | 585 | | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
584 | 594 | | |
585 | 595 | | |
586 | 596 | | |
587 | 597 | | |
588 | | - | |
589 | | - | |
590 | | - | |
591 | | - | |
| 598 | + | |
592 | 599 | | |
593 | 600 | | |
594 | 601 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
147 | 147 | | |
148 | 148 | | |
149 | 149 | | |
150 | | - | |
151 | | - | |
152 | | - | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
153 | 179 | | |
154 | 180 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
249 | 249 | | |
250 | 250 | | |
251 | 251 | | |
| 252 | + | |
252 | 253 | | |
253 | 254 | | |
254 | 255 | | |
| |||
380 | 381 | | |
381 | 382 | | |
382 | 383 | | |
| 384 | + | |
383 | 385 | | |
384 | 386 | | |
385 | 387 | | |
| |||
0 commit comments