huggingface训练代码

huggingface训练代码

https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

https://huggingface.co/transformers/v4.1.1/_modules/transformers/training_args.html

https://finance.sina.com.cn/tech/2021-01-17/doc-ikftpnnx8354067.shtml

https://huggingface.co/docs/transformers/model_doc/bert

https://www.cnblogs.com/wwj99/p/12283799.html

https://huggingface.co/docs/transformers/training

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb#scrollTo=JEA1ju653l-p

https://zhuanlan.zhihu.com/p/363014957

https://zhuanlan.zhihu.com/p/360988428

https://www.yanxishe.com/columnDetail/26409

Trainer:https://huggingface.co/docs/transformers/main_classes/trainer

用pretained LM作为基础模型,finetune自有数据训练LM

路径:10.22.24.2:/home/yelong/data/transformers/run_mlm.py

训练,进入loop训练:

从train_result = trainer.train(resume_from_checkpoint=checkpoint)跳入/home/yelong/data/miniconda3/envs/wenet/lib/python3.8/site-packages/transformers/trainer.py

(或者/data_local/yelong/transformers/src/transformers/trainer.py,这个是我在transformers的git路径下进行了编译pip install -e . 导致transformers路径变了,不是原来的pip install transformers的路径了)

inner_training_loop

再跳入函数 _inner_training_loop 中的 for step, inputs in enumerate(epoch_iterator)

loss计算:tr_loss_step = self.training_step(model, inputs),跳到def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:

然后loss = self.compute_loss(model, inputs)只计算masked token的交叉熵,没masked的不计算loss(实现过程:把label非masked部分置为-100 )

模型为:/data_local/yelong/transformers/src/transformers/models/bert/modeling_bert.py:class BertModel(BertPreTrainedModel):

class BertForMaskedLM(BertPreTrainedModel):

run_mlm.py中:model:BertForMaskedLM

修改配置

见: /home/yelong/data/transformers/src/transformers/training_args.py

可以在run_mlm.py中传入参数

GPU、CPU

想要用CPU:

export CUDA_VISIBLE_DEVICES=8,指定一个没有的卡,就能用CPU了

想要指定某张GPU:

在最上面import os后接os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘5’

现在用7G数据训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
python run_mlm.py \
--model_name_or_path roberta-base \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--do_train \
--do_eval \
--output_dir ./test-mlm

# 自己数据:



python run_mlm.py \
--model_name_or_path /home/yelong/data/edu-bert/models/TAL-EduBERT \
--train_file data/train/data.txt \
--validation_file data/dev/data.txt \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--do_train \
--do_eval \
--output_dir test-mlm2 \
--num_train_epochs 5

注意,虽然是device: cuda:0,但其实指的是gpu的5卡

当只使用一块GPU的时候,不管你设置的是服务器上的几号GPU,在代码运行中,都是当做GPU0

当使用多块GPU的时候,不管你设置的使用服务器上的哪几块GPU,在代码运行中,都是按照GPU0,GPU1…进行编号

CUDA_VISIBLE_DEVICES=2,0,3 只有编号为0,2,3的GPU对程序是可见的,但是在代码中gpu[0]指的是第2块儿,gpu[1]指的是第0块儿,gpu[2]指的是第3块儿

测试wer:

1
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_checkpoint-1849500 > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0
1
2
3
4
5
6
7
8
9
Traceback (most recent call last):
File "run_mlm_pppl.py", line 28, in <module>
logit = scorer.score_sentences([words[1]])[0]
File "/data_local/yelong/mlm-scoring/src/mlm/scorers.py", line 167, in score_sentences
return self.score(corpus, **kwargs)[0]
File "/data_local/yelong/mlm-scoring/src/mlm/scorers.py", line 757, in score
out = out[list(range(split_size)), token_masked_ids]
IndexError: too many indices for tensor of dimension 1