deepspeed下transformers加载模型from_preprained方法解析

一、前言 from_pretrained结构

由于显存不够,利用deepspeed的zero3-offload可以将模型参数存到cpu内存里。

在未使用deepspeed时,模型加载这一步就已经out of memory了,而使用了deepspeed后,模型能够成功加载。说明从加载模型,deepspeed开始起了作用。

llama_factory项目中的代码(LLaMA-Factory/src/llamafactory/model/loader.py(160行左右)),这里就是加载模型的地方:

1
2
load_class = AutoModelForCausalLM
model = load_class.from_pretrained(**init_kwargs)

pdb从from_pretrained方法s命令进到函数定义,就到transformers包里的内容了

然后来到(python3.11/site-packages/transformers/models/auto/auto_factory.py(566)):

1
2
3
4
5
model_class = _get_model_class(config, cls._model_mapping)
pdb.set_trace()
return model_class.from_pretrained(
pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
)

把model_class打印出来看:

1
2
(Pdb) p model_class
<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>

继续进from_pretrained函数定义(site-packages/transformers/modeling_ut
ils.py(2894行)),代码行数看着比较多,但是很多代码可能只是一些定义、初始化,关键的代码就是三个函数,分别实现了这三个功能:

  1. 检查点文件加载到缓存
  2. 模型类的实例化,为模型参数在内存分配空间
  3. 参数的赋值

二、get_checkpoint_shard_files

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# resolved_archive_file becomes a list of files that point to the different checkpoint shards in this case.
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
pretrained_model_name_or_path,
resolved_archive_file,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
local_files_only=local_files_only,
token=token,
user_agent=user_agent,
revision=revision,
subfolder=subfolder,
_commit_hash=commit_hash,
)

这里执行完可以把这两个变量都打印出来看:

1
2
3
4
5
6
7
8
9
10
(Pdb) p resolved_archive_file
['/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors', '/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors', '/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors', '/home/dell/sdb/.cache/Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors']


(Pdb) p sharded_metadata
{'total_size': 16060522496, 'all_checkpoint_keys': ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.post_attention_layernorm.weight',
... ...
'model.layers.9.post_attention_layernorm.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.k_proj.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.o_proj.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.q_proj.weight': 'model-00002-of-00004.safetensors', 'model.layers.9.self_attn.v_proj.weight': 'model-00002-of-00004.safetensors', 'model.norm.weight': 'model-00004-of-00004.safetensors'

]}

相当于这个函数完成了把检查点文件加载到缓存中,并且得到了模型具体哪个参数位于哪个检查点文件的信息。

三、model类的实例化

deepspeed设置了一下上下文,然后进行类的实例化:

1
2
3
4
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
with ContextManagers(init_contexts):
# Let's make sure we don't run the init function of buffer modules
model = cls(config, *model_args, **model_kwargs)

deepspeed.zero.Init方法中值得关注的地方在于local_device和remote_device两个概念,这里pdb打印出来可以看到, local_device对应cuda装置,remote_device对应cpu,它作为上下文来实例化模型类,那模型加载如何使用cpu和gpu的代码必然在cls里面了。

把cls打印出来看看,cls就是这个类的初始化方法。

1
2
(Pdb) p cls
<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>

pdb s命令进入cls,python用得少的话可能看不懂这里,这是wrapper装饰器,相当于封装了一层,实际会执行f函数,这套代码里面会有比较多的这种用法,看习惯就好了:

1
2
3
4
5
6
7
@functools.wraps(f)
def wrapper(module, *args, **kwargs):
... ...
f(module, *args, **kwargs)
... ...

return wrapper

f函数就是类的初始化了:

1
2
(Pdb) p f
<function LlamaForCausalLM.__init__ at 0x7233bc15e8e0>

经过几层的封装,来到了这个函数(deepspeed/runtime/zero/partition_parameters.py(1076)),dist.broadcast应该是用于进程之间通信,这里一张显卡可能还不涉及到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def _post_init_method(self, module):
...
for name, param in module.named_parameters(recurse=False):
...
self._zero_init_param(param)
...

def _zero_init_param(self, param):
self._convert_to_deepspeed_param(param)
if dist.get_world_group() == self.get_dp_process_group():
dist.broadcast(param.data, 0, self.get_dp_process_group())
else:
dist.broadcast(param.data, dist.get_global_rank(self.get_dp_process_group(), 0),
self.get_dp_process_group())
param.partition()

又经过了几层的函数封装,到了分配内存的函数,这里把param.device打印出来看是cuda,把partitioned_tensor.device打印出来看是cpu,这个函数做的事情是,把param里的东西写到了param.ds_tensor(ds是deepspeed缩写),然后释放了param内存(gpu),tensor.pin_memory分配了内存(cpu)。这里不确定的地方在于,参数是直接从缓存写到了cpu内存,还是先从缓存写到显存,再写到cpu内存。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@instrument_w_nvtx
def _partition_param(self, param, buffer=None, has_been_updated=False):

tensor_size = self._aligned_size(param)
partition_size = tensor_size // self.num_partitions
if param.ds_tensor is None:
final_location = None
if self.remote_device == OffloadDeviceEnum.nvme and self.param_swapper.swappable_tensor(
numel=partition_size):
...

else:
if param.ds_persist:
device = self.local_device
elif self.remote_device == OffloadDeviceEnum.nvme:
device = OffloadDeviceEnum.cpu
else:
device = self.remote_device

partitioned_tensor = torch.empty(partition_size, dtype=param.dtype, device=device)


if device == OffloadDeviceEnum.cpu and self.pin_memory:
partitioned_tensor = get_accelerator().pin_memory(partitioned_tensor)

partitioned_tensor.requires_grad = False
param.ds_tensor = partitioned_tensor
param.ds_tensor.ds_numel = partition_size
param.ds_tensor.status = PartitionedParamStatus.AVAILABLE
param.ds_tensor.final_location = final_location
param.ds_numel_aligned = tensor_size


partition_size = tensor_size // self.num_partitions相当于是参数分区,这里self.num_partitions值为1,相当于所有参数都会存到cpu。如果是不用cpu,而是用几张显卡,可能会将参数分区到不同显卡中。

四、参数赋值

既然已经在cpu分配好内存了,把param打印出来看会发现都是0,后面的操作肯定是赋值了。

_load_pretrained_model函数(transformers/modeling_utils.py(4381))里面有一段代码就是遍历每个检查点文件,将参数赋值的过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
for shard_file in resolved_archive_file:
# Skip the load for shards that only contain disk-offloaded weights when using safetensors for the offload.
if shard_file in disk_only_shard_files:
continue
state_dict = load_state_dict(shard_file, is_quantized=is_quantized)

# Mistmatched keys contains tuples key/shape1/shape2 of weights in the checkpoint that have a shape not
# matching the weights in the model.
mismatched_keys += _find_mismatched_keys(
state_dict,
model_state_dict,
original_loaded_keys,
add_prefix_to_model,
remove_prefix_from_model,
ignore_mismatched_sizes,
)
if low_cpu_mem_usage:
if is_fsdp_enabled() and not is_local_dist_rank_0() and not is_quantized:
for key, param in model_to_load.state_dict().items():
if param.device == torch.device("meta"):
set_module_tensor_to_device(
model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype)
)
else:
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
model_to_load,
state_dict,
loaded_keys,
start_prefix,
expected_keys,
device_map=device_map,
offload_folder=offload_folder,
offload_index=offload_index,
state_dict_folder=state_dict_folder,
state_dict_index=state_dict_index,
dtype=dtype,
hf_quantizer=hf_quantizer,
is_safetensors=is_safetensors,
keep_in_fp32_modules=keep_in_fp32_modules,
unexpected_keys=unexpected_keys,
)
error_msgs += new_error_msgs
else:
# Sharded checkpoint or whole but low_cpu_mem_usage==True
if assign_to_params_buffers is None:
assign_to_params_buffers = check_support_param_buffer_assignment(
model_to_load, state_dict, start_prefix
)
error_msgs += _load_state_dict_into_model(
model_to_load, state_dict, start_prefix, assign_to_params_buffers
)

# force memory release
del state_dict
gc.collect()



deepspeed下transformers加载模型from_preprained方法解析
http://example.com/2024/11/09/deepspeed下transformers加载模型from-preprained方法解析/
Beitragsautor
John Doe
Veröffentlicht am
November 9, 2024
Urheberrechtshinweis