AMSGrad implementation differs from PyTorch/TensorFlow

Hello,

We noticed that models trained using the AMSGrad optimizer in Optax tend to yield slightly poorer results compared to the same models trained using PyTorch. The two implementations differ as follows:

Current Optax implementation:
![equation](https://latex.codecogs.com/png.image?\dpi{120}\bg{white}$$\hat{\nu}_t\leftarrow\nu_t/(1-\beta_2^t)$$$$\hat{\nu}_t^{\text{max}}\leftarrow\max(\hat{\nu}_{t-1}^{\text{max}},\hat{\nu}_t)$$$$\theta_{t}\leftarrow\theta_{t-1}-\gamma\hat{m_t}(\sqrt{\hat{\nu}_t^{\text{max}}}+\epsilon)$$)

PyTorch implementation (the implementation in TensorFlow follows the same algorithm):
![equation](https://latex.codecogs.com/png.image?\dpi{120}\bg{white}$$\nu_t^{\text{max}}\leftarrow\max(\nu_{t-1}^{\text{max}},\nu_t)$$$$\hat{\nu_t}\leftarrow\nu_t^{\text{max}}/(1-\beta_2^t)$$$$\theta_{t}\leftarrow\theta_{t-1}-\gamma\hat{m_t}(\sqrt{\hat{\nu_t}}&plus;\epsilon)$$)

Would it be possible to align the Optax implementation with the PyTorch/TensorFlow version? This would improve consistency across the different ML frameworks and possibly improve performance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMSGrad implementation differs from PyTorch/TensorFlow #1389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AMSGrad implementation differs from PyTorch/TensorFlow #1389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions