RAM Requirements for Checkpointing

Does anyone know the minimum RAM required for each model? 

I ran checkpointing with model=llama8b on a 128GB client and during the read phase of the default behavior it hung due to RAM limitations. Was able to run the checkpointing workload with 2 clients at 128GB each (256GB RAM total)

Is there any resource pointing to the RAM needed for the llama70b, 405b, 1T models? 
I know others have been able to run llama70b with 2TB RAM (4 clients with 512GB), and llama405b with 4TB RAM (16 clients with 256GB each)

Is there some equation for it? Would having more clients decrease the RAM needed for each server.  For example could I have used 2 or 3 64GB clients to run the llama8b workload. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAM Requirements for Checkpointing #209

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RAM Requirements for Checkpointing #209

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions