Skip to content

My trainig process is frozen #107

@whansk50

Description

@whansk50

Hello,
training process is initiated without problem, but when some times left, it is frozen like:

image

it doesn't show any change on console

and I check GPUs at that time and what I got is GPU-Util(not memory) is full when the process is frozen (that I think this is a clue of this problem):

image

I fixed parameter like batch_size, worker, etc, but it doesn't help

Can anyone help?

my env is on miniconda3, and using CUDA 11.8, so version is:
PyTorch 2.0.0
PyTorch Lightning 2.0.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions