Skip to content

RAM Requirements for Checkpointing #209

@Grace-Chang2

Description

@Grace-Chang2

Does anyone know the minimum RAM required for each model?

I ran checkpointing with model=llama8b on a 128GB client and during the read phase of the default behavior it hung due to RAM limitations. Was able to run the checkpointing workload with 2 clients at 128GB each (256GB RAM total)

Is there any resource pointing to the RAM needed for the llama70b, 405b, 1T models?
I know others have been able to run llama70b with 2TB RAM (4 clients with 512GB), and llama405b with 4TB RAM (16 clients with 256GB each)

Is there some equation for it? Would having more clients decrease the RAM needed for each server. For example could I have used 2 or 3 64GB clients to run the llama8b workload.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions