huggingface使用

https://github.com/huggingface/transformers

https://github.com/tal-tech/edu-bert

李理 Huggingface Transformer教程(一)

Tokenizer：https://huggingface.co/docs/transformers/tokenizer_summary

用huggingface调用好未来预训练模型

1
2
3

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/home/yelong/data/edu-bert/models/TAL-EduBERT")

模型文件夹要放：
- 配置文件，config.json
- 词典文件，vocab.txt
- 预训练模型文件，pytorch_model.bin
- 额外的文件，tokenizer_config.json、special_tokens_map.json，这是tokenizer需要使用的文件，如果出现的话，也需要保存下来。没有的话，就不必在意。

huggingface的transformers框架主要有三个类model类、configuration类、tokenizer类，这三个类，所有相关的类都衍生自这三个类，他们都有from_pretained()方法和save_pretrained()方法。

from_pretrained方法的第一个参数都是pretrained_model_name_or_path，这个参数设置为我们下载的文件目录即可。

用huggingface调用model hub的预训练模型

1	classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

除了通过名字来制定model参数，我们也可以传给model一个包含模型的目录的路径，也可以传递一个模型对象。如果我们想传递模型对象，那么也需要传入tokenizer。

我们需要两个类，一个是AutoTokenizer，我们将使用它来下载和加载与模型匹配的Tokenizer。另一个是AutoModelForSequenceClassification(如果用TensorFlow则是TFAutoModelForSequenceClassification)。注意：模型类是与任务相关的，我们这里是情感分类的分类任务，所以是AutoModelForSequenceClassification。

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

使用

Tokenizer和模型

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenizer

其中，Tokenizer的作用大致就是分词，然后把词变成的整数ID，当然有些模型会使用subword。但是不管怎么样，最终的目的是把一段文本变成ID的序列。当然它也必须能够反过来把ID序列变成文本。

细节见：https://huggingface.co/docs/transformers/tokenizer_summary

Tokenizer对象是callable，因此可以直接传入一个字符串，返回一个dict。最主要的是ID的list，同时也会返回attention mask：

我们也可以一次传入一个batch的字符串，这样便于批量处理。这时我们需要指定padding为True并且设置最大的长度：

pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

模型

Tokenizer的处理结果可以输入给模型，对于TensorFlow来说直接输入就行，而对于PyTorch则需要使用**来展开参数：

# PyTorch
pt_outputs = pt_model(**pt_batch)
# TensorFlow
tf_outputs = tf_model(tf_batch)

Transformers的所有输出都是tuple，即使只有一个结果也会是长度为1的tuple：

分类任务：

1
2
3

>>> print(pt_outputs)
(tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>),)

Transformers的模型默认返回logits，如果需要概率，可以自己加softmax：

1 2	>>> import torch.nn.functional as F >>> pt_predictions = F.softmax(pt_outputs[0], dim=-1)

得到和前面一样的结果：

1
2
3

>>> print(pt_predictions)
tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)

如果我们有输出分类对应的标签，那么也可以传入，这样它除了会计算logits还会loss：

1 2	>>> import torch >>> pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))

输出为：

1 2	SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833, 4.3364], [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

语言模型 MLM【没用上，可以不看】

和前面的任务相比，语言模型本身一般很少作为一个独立的任务。它的作用通常是用来预训练基础的模型，然后也可以使用领域的未标注数据来fine-tuning语言模型。比如我们的任务是一个文本分类任务，我们可以基于基础的BERT模型在我们的分类数据上fine-tuning模型。但是BERT的基础模型是基于wiki这样的语料库进行预训练的，不一定和我们的任务很match。而且标注的成本通常很高，我们的分类数据量通常不大。但是领域的未标注数据可能不少。这个时候我们我们可以用领域的未标注数据对基础的BERT用语言模型这个任务进行再次进行pretraining，然后再用标注的数据fine-tuning分类任务。

如果我们需要fine-tuning MLM，可以参考 run_mlm.py。下面是用pipeline的例子：

>>> from transformers import pipeline
>>> nlp = pipeline("fill-mask")
>>> from pprint import pprint
>>> pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))
[{'score': 0.1792745739221573,
  'sequence': '<s>HuggingFace is creating a tool that the community uses to '
              'solve NLP tasks.</s>',
  'token': 3944,
  'token_str': 'Ġtool'},
 {'score': 0.11349421739578247,
  'sequence': '<s>HuggingFace is creating a framework that the community uses '
              'to solve NLP tasks.</s>',
  'token': 7208,
  'token_str': 'Ġframework'},
 {'score': 0.05243554711341858,
  'sequence': '<s>HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.</s>',
  'token': 5560,
  'token_str': 'Ġlibrary'},
 {'score': 0.03493533283472061,
  'sequence': '<s>HuggingFace is creating a database that the community uses '
              'to solve NLP tasks.</s>',
  'token': 8503,
  'token_str': 'Ġdatabase'},
 {'score': 0.02860250137746334,
  'sequence': '<s>HuggingFace is creating a prototype that the community uses '
              'to solve NLP tasks.</s>',
  'token': 17715,
  'token_str': 'Ġprototype'}]

上面会用到nlp.tokenizer.mask_token，它就是特殊的这个token。我们也可以自己构造Tokenizer和模型，步骤为：

构造Tokenizer和模型。比如可以使用DistilBERT从checkpoint加载预训练的模型
构造输入序列，把需要mask的词替换成tokenizer.mask_token
用tokenizer把输入变成ID list
获取预测的结果，它的size是词典大小，表示预测某个词的概率
获取topk个概率最大的词

代码如下：

>>> from transformers import AutoModelWithLMHead, AutoTokenizer
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
>>> model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")

>>> sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

>>> input = tokenizer.encode(sequence, return_tensors="pt")
>>> mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

>>> token_logits = model(input).logits
>>> mask_token_logits = token_logits[0, mask_token_index, :]

>>> top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

注意这里需要使用AutoModelWithLMHead构造模型。

输出结果：

>>> for token in top_5_tokens:
...     print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

语言模型

huggingface获取生成模型的输出概率

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model1 = AutoModelForCausalLM.from_pretrained("/home/yelong/data/edu-bert/models/TAL-EduBERT", return_dict_in_generate=True,is_decoder=True)

input

输入一般会包含开头的[CLS]和结尾的[SEP]，如果不想要这两个，可以通过添加add_special_tokens=False来去掉

tokenizer = BertTokenizer.from_pretrained(path_to_TAL_EduBERT)

inputs = tokenizer("today is a nice day", return_tensors="pt")  # [1, 7]
inputs = tokenizer("today is a nice day", return_tensors="pt", add_special_tokens=False)  # [1, 5]

输出语言模型概率

我们要用到每个单词的概率，以及整句话的概率分数，输出的不是概率，最后一层没有softmax，要加softmax才是概率（没有log的概率，纯概率值）

model = BertForMaskedLM.from_pretrained(path_to_TAL_EduBERT)
inputs = tokenizer("today is a nice day", return_tensors="pt", add_special_tokens=False) 
outputs = model(**inputs).logits	# [1,3,21128]

# 详细：
from transformers import BertTokenizer,BertForMaskedLM
path_to_TAL_EduBERT = "/home/yelong/data/edu-bert/models/TAL-EduBERT"
tokenizer = BertTokenizer.from_pretrained(path_to_TAL_EduBERT)
model = BertForMaskedLM.from_pretrained(path_to_TAL_EduBERT)
inputs=tokenizer("today is a good day", return_tensors="pt", add_special_tokens=False)
outputs=model(**inputs).logits
logit = 0
for i in range(len(outputs[0])):
    # print(outputs[0][i][inputs['input_ids'][0][i]])
    logit = logit + outputs[0][i][inputs['input_ids'][0][i]].item()
print(logit)

联调

和CTC decoder输出的句子进行联调，看wer改进

CTC出来n-best文本，经过lm选lm概率大的一条路径，见10.22.24.2：/home/yelong/data/wenet/examples/aishell/s0/compute-wer-lm.py
CTC出来n-best文本以及分数，经过lm选lm概率+CTC分数加权求和最大的一条路径

用自有数据pretrain LM

/home/yelong/data/transformers/examples/pytorch/language-modeling/run_mlm.py

https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling

官方例子：

python run_mlm.py \
    --model_name_or_path roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm

要在您自己的训练和验证文件上运行，请使用以下命令：

python run_mlm.py \
    --model_name_or_path roberta-base \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm

具体例子为：

python run_mlm.py \
    --model_name_or_path /home/yelong/data/edu-bert/models/TAL-EduBERT \
    --train_file data/train/data.txt \
    --validation_file data/dev/data.txt \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --do_eval \
    --output_dir test-mlm2 \
    --num_train_epochs 5

其中，data/train/data.txt就是纯文本