训练过程中的模型保存,save_pretrained方法解析

检查点存储流程主要在函数save_pretrained中(/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/peft/peft_model.py)

调用栈:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
(Pdb) bt
/home/dell/Downloads/LLaMA-Factory/src/llamafactory/launcher.py(23)<module>()
-> launch()
/home/dell/Downloads/LLaMA-Factory/src/llamafactory/launcher.py(19)launch()
-> run_exp()
/home/dell/Downloads/LLaMA-Factory/src/llamafactory/train/tuner.py(50)run_exp()
-> run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
/home/dell/Downloads/LLaMA-Factory/src/llamafactory/train/sft/workflow.py(96)run_sft()
-> train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(1938)train()
-> return inner_training_loop(
/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(2356)_inner_training_loop()
-> self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(2808)_maybe_log_save_evaluate()
-> self._save_checkpoint(model, trial, metrics=metrics)
/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(2887)_save_checkpoint()
-> self.save_model(output_dir, _internal_call=True)
/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(3442)save_model()
-> self._save(output_dir, state_dict=state_dict)
/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(3526)_save()
-> self.model.save_pretrained(
> /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/peft/peft_model.py(305)save_pretrained()

id_tensor_storage(tensor)方法,返回了三元组(存储设备,存储指针(唯一标识符),存储大小):

1
2
3
4
5
6
7
8
def id_tensor_storage(tensor: torch.Tensor) -> Tuple[torch.device, int, int]:
...
try:
storage_ptr = tensor.untyped_storage().data_ptr()
storage_size = tensor.untyped_storage().nbytes()
...

return tensor.device, storage_ptr, storage_size

模型的存储格式,参数名字加上了存储三元组:

1
2
(Pdb) p shared_ptrs
{(device(type='cpu'), 1545360512, 65536): ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight'], (device(type='cpu'), 1545145344, 65536): ['base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight']}

safe_save_file方法:

1
2
3
4
5
safe_save_file(
output_state_dict,
os.path.join(output_dir, SAFETENSORS_WEIGHTS_NAME),
metadata={"format": "pt"},
)

里面又封装了一层,到serialize_file方法

1
2
3
4
5
6
def save_file(
tensors: Dict[str, torch.Tensor],
filename: Union[str, os.PathLike],
metadata: Optional[Dict[str, str]] = None,
):
serialize_file(_flatten(tensors), filename, metadata=metadata)

训练过程中的模型保存,save_pretrained方法解析
http://example.com/2024/11/11/模型的保存/
Beitragsautor
John Doe
Veröffentlicht am
November 11, 2024
Urheberrechtshinweis