ENH: Improve torch.compile support in MetaMath #2900

BenjaminBossan · 2025-11-06T11:23:41Z

The MetaMathQA benchmark already had support to enable torch.compile but it was not very well implemented. The new changes are:

call compile after applying PEFT, not before
compile with dynamic=True
avoid model.eval() + model.train() calls

These changes prevent graph breaks and recompiles. A context manager is now used to ensure that those don't happen.

Some unrelated changes:

improve some type annotations
use dtype argument instead of deprecated torch_dtype

The MetaMathQA benchmark already had support to enable torch.compile but it was not very well implemented. The new changes are: - call compile after applying PEFT, not before - compile with dynamic=True - avoid model.eval() + model.train() calls These changes prevent graph breaks and recompiles. A context manager is now used to ensure that those don't happen. Some unrelated changes: - improve some type annotations - use dtype argument instead of deprecated torch_dtype

HuggingFaceDocBuilderDev · 2025-11-06T11:27:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2025-11-20T14:32:03Z

Note to self: How to deal with dropout? It's not deactivated by torch.inference_mode().

BenjaminBossan · 2025-11-24T12:55:16Z

After some testing: When running evaluation, we would generally want to put the model into eval mode (dropout). However, this triggers a re-compile when the model is put back into train mode (i.e. a total of two compiles happen). We could skip this train/eval toggle during training to avoid the re-compile. This would mean that the model is in train mode when evaluating, but arguably that is not a very big deal. Obviously, when it comes to the test set, we do put the model in eval mode first.

Here are the numbers for no compile, compile with train/eval switch, and compile without train/eval switch:

metric	no compile	compile w/ train/eval switch	compile w/o switch
reserved mem max	22.3 GB	16.4 GB	16.4 GB
reserved mem avg	14.4 GB	11.2 GB	11.2 GB
reserved mem 99th	20.1 GB	14.6 GB	14.6 GB
number of compiles	0	1	2
train time / step	~29 sec	~24 sec	~24 sec
final train loss (sanity check)	0.60717	0.60710	0.60695

Validation accuracy varies a bit, but that's to be expected with a rather small validation set size and measuring on generations.

I tried a couple of mitigations, like running eval with torch.compiler.disable or compiling the eval function separately (which wouldn't really save time, as we replace a recompile with another compile), but nothing I tried helps.

@githubnemo What would you prefer: Live with the recompilation or avoid the train/eval switch?

githubnemo · 2025-11-25T11:22:21Z

Thanks for investigating!

If I understand correctly there's almost no time penalty for using the more correct (recompilation) variant, so I'd opt for that since dropout is only one potential candidate for train/eval mismatches.

BenjaminBossan · 2025-11-25T11:42:28Z

If I understand correctly there's almost no time penalty for using the more correct (recompilation) variant, so I'd opt for that since dropout is only one potential candidate for train/eval mismatches.

Yes, we could do that. It means, however, that we have to remove the error_on_recompile context, which could prevent us from detecting other recompilation issues. LMK if that sounds acceptable.

BenjaminBossan requested a review from githubnemo November 6, 2025 12:37

BenjaminBossan added 2 commits November 10, 2025 17:08

Disable torch compile caching

ad32d3a

Merge branch 'main' into enh-metamath-improve-torch-compile

b69f9a5

BenjaminBossan marked this pull request as draft November 20, 2025 14:32

BenjaminBossan marked this pull request as ready for review November 25, 2025 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Improve torch.compile support in MetaMath #2900

ENH: Improve torch.compile support in MetaMath #2900

Uh oh!

BenjaminBossan commented Nov 6, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 6, 2025

Uh oh!

BenjaminBossan commented Nov 20, 2025

Uh oh!

BenjaminBossan commented Nov 24, 2025

Uh oh!

githubnemo commented Nov 25, 2025

Uh oh!

BenjaminBossan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ENH: Improve torch.compile support in MetaMath #2900

Are you sure you want to change the base?

ENH: Improve torch.compile support in MetaMath #2900

Uh oh!

Conversation

BenjaminBossan commented Nov 6, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 6, 2025

Uh oh!

BenjaminBossan commented Nov 20, 2025

Uh oh!

BenjaminBossan commented Nov 24, 2025

Uh oh!

githubnemo commented Nov 25, 2025

Uh oh!

BenjaminBossan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants