-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Through comparative experiments, we found that what really reduces GPU memory is "torch.set_default_dtype(torch.float16)" and deepspeed. We used LLaMA-7B to conduct experiments, using
{
"zero_optimization":{
"stage": 0
},
"gradient_accumulation_steps": 1,
"steps_per_print": 2000,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
configuration to cancel the deepspeed function. When we do not enable mixed precision, the output of the model is fp16. This result is obviously abnormal. After our check, we found that “torch.set_default_dtype(torch.float16)” played a key role! When we remove deepspeed and “torch.set_default_dtype(torch.float16)”, according to the default configuration on wic datasets, out of memory on the 80G A100 card!. After adding "“torch.set_default_dtype(torch.float16)”, the memory is directly reduced to about 35G. According to normal mixed precision training, the author's LOMO still out of memory on the 80G A100 card!