检查点存储流程主要在函数save_pretrained中(/home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/peft/peft_model.py)
调用栈:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| (Pdb) bt /home/dell/Downloads/LLaMA-Factory/src/llamafactory/launcher.py(23)<module>() -> launch() /home/dell/Downloads/LLaMA-Factory/src/llamafactory/launcher.py(19)launch() -> run_exp() /home/dell/Downloads/LLaMA-Factory/src/llamafactory/train/tuner.py(50)run_exp() -> run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) /home/dell/Downloads/LLaMA-Factory/src/llamafactory/train/sft/workflow.py(96)run_sft() -> train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(1938)train() -> return inner_training_loop( /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(2356)_inner_training_loop() -> self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(2808)_maybe_log_save_evaluate() -> self._save_checkpoint(model, trial, metrics=metrics) /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(2887)_save_checkpoint() -> self.save_model(output_dir, _internal_call=True) /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(3442)save_model() -> self._save(output_dir, state_dict=state_dict) /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py(3526)_save() -> self.model.save_pretrained( > /home/dell/anaconda3/envs/llama_factory/lib/python3.11/site-packages/peft/peft_model.py(305)save_pretrained()
|
id_tensor_storage(tensor)方法,返回了三元组(存储设备,存储指针(唯一标识符),存储大小):
1 2 3 4 5 6 7 8
| def id_tensor_storage(tensor: torch.Tensor) -> Tuple[torch.device, int, int]: ... try: storage_ptr = tensor.untyped_storage().data_ptr() storage_size = tensor.untyped_storage().nbytes() ...
return tensor.device, storage_ptr, storage_size
|
模型的存储格式,参数名字加上了存储三元组:
1 2
| (Pdb) p shared_ptrs {(device(type='cpu'), 1545360512, 65536): ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight'], (device(type='cpu'), 1545145344, 65536): ['base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight']}
|
safe_save_file方法:
1 2 3 4 5
| safe_save_file( output_state_dict, os.path.join(output_dir, SAFETENSORS_WEIGHTS_NAME), metadata={"format": "pt"}, )
|
里面又封装了一层,到serialize_file方法
1 2 3 4 5 6
| def save_file( tensors: Dict[str, torch.Tensor], filename: Union[str, os.PathLike], metadata: Optional[Dict[str, str]] = None, ): serialize_file(_flatten(tensors), filename, metadata=metadata)
|