huggingface训练代码

https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

https://huggingface.co/transformers/v4.1.1/_modules/transformers/training_args.html

https://finance.sina.com.cn/tech/2021-01-17/doc-ikftpnnx8354067.shtml

https://huggingface.co/docs/transformers/model_doc/bert

https://www.cnblogs.com/wwj99/p/12283799.html

https://huggingface.co/docs/transformers/training

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb#scrollTo=JEA1ju653l-p

https://zhuanlan.zhihu.com/p/363014957

https://zhuanlan.zhihu.com/p/360988428

https://www.yanxishe.com/columnDetail/26409

Trainer：https://huggingface.co/docs/transformers/main_classes/trainer

用pretained LM作为基础模型，finetune自有数据训练LM

路径：10.22.24.2：/home/yelong/data/transformers/run_mlm.py

训练，进入loop训练：

从train_result = trainer.train(resume_from_checkpoint=checkpoint)跳入/home/yelong/data/miniconda3/envs/wenet/lib/python3.8/site-packages/transformers/trainer.py

（或者/data_local/yelong/transformers/src/transformers/trainer.py，这个是我在transformers的git路径下进行了编译pip install -e . 导致transformers路径变了，不是原来的pip install transformers的路径了）

inner_training_loop

再跳入函数 _inner_training_loop 中的 for step, inputs in enumerate(epoch_iterator)

loss计算：tr_loss_step = self.training_step(model, inputs)，跳到def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:

然后loss = self.compute_loss(model, inputs)只计算masked token的交叉熵，没masked的不计算loss（实现过程：把label非masked部分置为-100 ）

模型为：/data_local/yelong/transformers/src/transformers/models/bert/modeling_bert.py：class BertModel(BertPreTrainedModel):

class BertForMaskedLM(BertPreTrainedModel):

run_mlm.py中：model:BertForMaskedLM

修改配置

见： /home/yelong/data/transformers/src/transformers/training_args.py

可以在run_mlm.py中传入参数

GPU、CPU

想要用CPU：

export CUDA_VISIBLE_DEVICES=8，指定一个没有的卡，就能用CPU了

想要指定某张GPU：

在最上面import os后接os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘5’

现在用7G数据训练

python run_mlm.py \
    --model_name_or_path roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir ./test-mlm 
    
# 自己数据：



python run_mlm.py \
    --model_name_or_path /home/yelong/data/edu-bert/models/TAL-EduBERT \
    --train_file data/train/data.txt \
    --validation_file data/dev/data.txt \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --do_eval \
    --output_dir test-mlm2 \
    --num_train_epochs 5

注意，虽然是device: cuda:0，但其实指的是gpu的5卡

当只使用一块GPU的时候，不管你设置的是服务器上的几号GPU，在代码运行中，都是当做GPU0

当使用多块GPU的时候，不管你设置的使用服务器上的哪几块GPU，在代码运行中，都是按照GPU0，GPU1…进行编号

CUDA_VISIBLE_DEVICES=2,0,3 只有编号为0,2,3的GPU对程序是可见的，但是在代码中gpu[0]指的是第2块儿，gpu[1]指的是第0块儿，gpu[2]指的是第3块儿

测试wer：

1	python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_checkpoint-1849500 > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0

Traceback (most recent call last):
  File "run_mlm_pppl.py", line 28, in <module>
    logit = scorer.score_sentences([words[1]])[0]
  File "/data_local/yelong/mlm-scoring/src/mlm/scorers.py", line 167, in score_sentences
    return self.score(corpus, **kwargs)[0]
  File "/data_local/yelong/mlm-scoring/src/mlm/scorers.py", line 757, in score
    out = out[list(range(split_size)), token_masked_ids]
IndexError: too many indices for tensor of dimension 1