deepspeed下transformers加载模型from_preprained方法解析

一、前言 from_pretrained结构

由于显存不够，利用deepspeed的zero3-offload可以将模型参数存到cpu内存里。

在未使用deepspeed时，模型加载这一步就已经out of memory了，而使用了deepspeed后，模型能够成功加载。说明从加载模型，deepspeed开始起了作用。

llama_factory项目中的代码（LLaMA-Factory/src/llamafactory/model/loader.py(160行左右)），这里就是加载模型的地方：

1 2	`load_class = AutoModelForCausalLM model = load_class.from_pretrained(**init_kwargs)`

pdb从from_pretrained方法s命令进到函数定义，就到transformers包里的内容了

然后来到（python3.11/site-packages/transformers/models/auto/auto_factory.py（566））：

model_class = _get_model_class(config, cls._model_mapping)
            pdb.set_trace()
            return model_class.from_pretrained(
                pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
            )

把model_class打印出来看：

1 2	`(Pdb) p model_class <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>`

继续进from_pretrained函数定义（site-packages/transformers/modeling_ut
ils.py（2894行）），代码行数看着比较多，但是很多代码可能只是一些定义、初始化，关键的代码就是三个函数，分别实现了这三个功能：

检查点文件加载到缓存
模型类的实例化，为模型参数在内存分配空间
参数的赋值

二、get_checkpoint_shard_files

# resolved_archive_file becomes a list of files that point to the different checkpoint shards in this case.
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
    pretrained_model_name_or_path,
    resolved_archive_file,
    cache_dir=cache_dir,
    force_download=force_download,
    proxies=proxies,
    resume_download=resume_download,
    local_files_only=local_files_only,
    token=token,
    user_agent=user_agent,
    revision=revision,
    subfolder=subfolder,
    _commit_hash=commit_hash,
)

这里执行完可以把这两个变量都打印出来看：

(Pdb) p resolved_archive_file
['/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors', '/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors', '/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors', '/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors']


(Pdb) p sharded_metadata
{'total_size': 16060522496, 'all_checkpoint_keys': ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.post_attention_layernorm.weight',
... ...
'model.layers.9.post_attention_layernorm.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.k_proj.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.o_proj.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.q_proj.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.v_proj.weight': 'model-00002-of-00004.safetensors', 'model.norm.weight': 'model-00004-of-00004.safetensors'

]}

相当于这个函数完成了把检查点文件加载到缓存中，并且得到了模型具体哪个参数位于哪个检查点文件的信息。

三、model类的实例化

deepspeed设置了一下上下文，然后进行类的实例化：

init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
with ContextManagers(init_contexts):
    # Let's make sure we don't run the init function of buffer modules
    model = cls(config, *model_args, **model_kwargs)

deepspeed.zero.Init方法中值得关注的地方在于local_device和remote_device两个概念，这里pdb打印出来可以看到， local_device对应cuda装置，remote_device对应cpu，它作为上下文来实例化模型类，那模型加载如何使用cpu和gpu的代码必然在cls里面了。

把cls打印出来看看，cls就是这个类的初始化方法。

1 2	`(Pdb) p cls <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>`

pdb s命令进入cls，python用得少的话可能看不懂这里，这是wrapper装饰器，相当于封装了一层，实际会执行f函数，这套代码里面会有比较多的这种用法，看习惯就好了：

@functools.wraps(f)
    def wrapper(module, *args, **kwargs):
    ... ...
        f(module, *args, **kwargs)
    ... ...

    return wrapper

f函数就是类的初始化了：

1 2	`(Pdb) p f <function LlamaForCausalLM.__init__ at 0x7233bc15e8e0>`

经过几层的封装，来到了这个函数（deepspeed/runtime/zero/partition_parameters.py（1076）），dist.broadcast应该是用于进程之间通信，这里一张显卡可能还不涉及到：

def _post_init_method(self, module):
    ...
    for name, param in module.named_parameters(recurse=False):
    ...
        self._zero_init_param(param)
    ...

def _zero_init_param(self, param):
    self._convert_to_deepspeed_param(param)
    if dist.get_world_group() == self.get_dp_process_group():
        dist.broadcast(param.data, 0, self.get_dp_process_group())
    else:
        dist.broadcast(param.data, dist.get_global_rank(self.get_dp_process_group(), 0),
                        self.get_dp_process_group())
    param.partition()

又经过了几层的函数封装，到了分配内存的函数，这里把param.device打印出来看是cuda，把partitioned_tensor.device打印出来看是cpu，这个函数做的事情是，把param里的东西写到了param.ds_tensor（ds是deepspeed缩写），然后释放了param内存(gpu)，tensor.pin_memory分配了内存（cpu）。这里不确定的地方在于，参数是直接从缓存写到了cpu内存，还是先从缓存写到显存，再写到cpu内存。

@instrument_w_nvtx
def _partition_param(self, param, buffer=None, has_been_updated=False):
   
        tensor_size = self._aligned_size(param)
        partition_size = tensor_size // self.num_partitions
        if param.ds_tensor is None:
            final_location = None
            if self.remote_device == OffloadDeviceEnum.nvme and self.param_swapper.swappable_tensor(
                    numel=partition_size):
                ...

            else:
                if param.ds_persist:
                    device = self.local_device
                elif self.remote_device == OffloadDeviceEnum.nvme:
                    device = OffloadDeviceEnum.cpu
                else:
                    device = self.remote_device

                partitioned_tensor = torch.empty(partition_size, dtype=param.dtype, device=device)
                

                if device == OffloadDeviceEnum.cpu and self.pin_memory:
                    partitioned_tensor = get_accelerator().pin_memory(partitioned_tensor)

            partitioned_tensor.requires_grad = False
            param.ds_tensor = partitioned_tensor
            param.ds_tensor.ds_numel = partition_size
            param.ds_tensor.status = PartitionedParamStatus.AVAILABLE
            param.ds_tensor.final_location = final_location
            param.ds_numel_aligned = tensor_size

partition_size = tensor_size // self.num_partitions相当于是参数分区，这里self.num_partitions值为1，相当于所有参数都会存到cpu。如果是不用cpu，而是用几张显卡，可能会将参数分区到不同显卡中。

四、参数赋值

既然已经在cpu分配好内存了，把param打印出来看会发现都是0，后面的操作肯定是赋值了。

_load_pretrained_model函数(transformers/modeling_utils.py(4381))里面有一段代码就是遍历每个检查点文件，将参数赋值的过程：

for shard_file in resolved_archive_file:
        # Skip the load for shards that only contain disk-offloaded weights when using safetensors for the offload.
        if shard_file in disk_only_shard_files:
            continue
        state_dict = load_state_dict(shard_file, is_quantized=is_quantized)

        # Mistmatched keys contains tuples key/shape1/shape2 of weights in the checkpoint that have a shape not
        # matching the weights in the model.
        mismatched_keys += _find_mismatched_keys(
            state_dict,
            model_state_dict,
            original_loaded_keys,
            add_prefix_to_model,
            remove_prefix_from_model,
            ignore_mismatched_sizes,
        )
        if low_cpu_mem_usage:
            if is_fsdp_enabled() and not is_local_dist_rank_0() and not is_quantized:
                for key, param in model_to_load.state_dict().items():
                    if param.device == torch.device("meta"):
                        set_module_tensor_to_device(
                            model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype)
                        )
            else:
                new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                    model_to_load,
                    state_dict,
                    loaded_keys,
                    start_prefix,
                    expected_keys,
                    device_map=device_map,
                    offload_folder=offload_folder,
                    offload_index=offload_index,
                    state_dict_folder=state_dict_folder,
                    state_dict_index=state_dict_index,
                    dtype=dtype,
                    hf_quantizer=hf_quantizer,
                    is_safetensors=is_safetensors,
                    keep_in_fp32_modules=keep_in_fp32_modules,
                    unexpected_keys=unexpected_keys,
                )
                error_msgs += new_error_msgs
        else:
            # Sharded checkpoint or whole but low_cpu_mem_usage==True
            if assign_to_params_buffers is None:
                assign_to_params_buffers = check_support_param_buffer_assignment(
                    model_to_load, state_dict, start_prefix
                )
            error_msgs += _load_state_dict_into_model(
                model_to_load, state_dict, start_prefix, assign_to_params_buffers
            )

        # force memory release
        del state_dict
        gc.collect()

deepspeed下transformers加载模型from_preprained方法解析

http://example.com/2024/11/09/deepspeed下transformers加载模型from-preprained方法解析/

Beitragsautor

John Doe

Veröffentlicht am

November 9, 2024

Urheberrechtshinweis

训练过程中的模型保存，save_pretrained方法解析 Vorheriger

vim开发环境配置与vim使用 Nächster