4000万条做语言模型G1,中英700万条做语言模型G2,模型融合,融合比例通过测试集整体的ppl来确定
G1、G2分别对测试集计算ppl
- ==用词做==
1 | 700万条中英文训练LM: |
结果:
1 | python compute_wer_lm_logprob.py --char=1 --v=1 data/xueyuan/text exp/seewo/conformer/test_xueyuan2/ppl_ngram_7g_train_eng_word > exp/seewo/conformer/test_xueyuan2/wer_am_lm_alpha_1 |
==用字做==
用花哥的声学模型的字典,来作为语言模型的字典,一般来说是语言模型字典大于声学模型,包含了声学模型所有字,以防OOV,并且可以覆盖以后的字,但是我这边先用一样的好了,或者更多,训练集所有字,总之英文的bpe要换一下,先bpe搞一波
路径:(port 60000)10.22.24.4:/home/data/yelong/kaldi/egs/librispeech1/s5、(port 51720)10.22.22.2:/home/yelong/data/wenet/examples/multi_cn/s0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
5410.22.24.2:~/data/wenet/examples/multi_cn/s0/data_4000_add_we/test_7g_train/text_space 这个把句子用空格分开了,但是英文还没有bpe分词
[不靠谱!]tools/text2token.py -s 0 -n 1 data_4000_add_we/test_7g_train/text.... --trans_type phn > data_4000_add_we/test_7g_train/text_space
去除行首空格:sed -i 's/^ *//' ....
然后,把英文之间的空格变成▁ !![新] 用 split_sentence.py 的话,不需要把英文之间的空格变成▁!!!就是下面这步不用做了
sed 's/\([A-Z]\) \([A-Z]\)/\1▁\2/g' data_4000_add_we/test_7g_train/text_space_eng | sed 's/\([A-Z]\) \([A-Z]\)/\1▁\2/g' | sed 's/\xEF\xBB\xBF//' > data_4000_add_we/test_7g_train/text_space_eng_add_line
做bpe:
tools/text2token.py -s 0 -n 1 -m /home/yelong/data/wenet/examples/aishell/s0/exp/aban-c007/100-101_unigram5000.model #data_4000_add_we/test_7g_train/text_space_eng_add_line --trans_type cn_char_en_bpe > data_4000_add_we/test_7g_train/text_space_eng_bpe_100-101_unigram5000
[旧]tools/text2token.py -s 0 -n 1 -m /home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model data_4000_add_we/test_7g_train/text_space_eng_add_line --trans_type cn_char_en_bpe > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009
去除行首空格:sed -i 's/^ *//' ....
[更新]用tools/text2token.py不靠谱,还是用自己写的split_sentence.py!
[新]python split_sentence.py > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009
有一些space,乱码,要删除
sed -i '/space/d' data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009
然后和中文一起
cat data_4000_add_we/test_7g_train/text_space_cn data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009 > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009_and_cn
测试集同样也要做;
700万条中英文训练LM(英文用bpe子词、中文用字):
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon_eng/lexicon ngram_7g_train_hua_lexicon_eng/text_space_eng_bpe_aban_c009 ngram_7g_train_hua_lexicon_eng/lm
测试ppl:
这里 ngram_testset_char/text 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm1.ppl
reading 12358 1-grams
reading 5639846 2-grams
reading 31839516 3-grams
0 zeroprobs, logprob= -1052204 ppl= 54.52272 ppl1= 60.16961
4000万中文、中英文训练LM(英文用bpe子词、中文用字):
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon/lexicon ngram_7g_train_hua_lexicon/text_space_eng_bpe_aban_c009_and_cn ngram_7g_train_hua_lexicon/lm
测试ppl:
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm2.ppl
在xueyuan test_20best(已经是带▁的) 上测试ppl(为了rescore)
首先对得到的带bpe的解码结果(没有空格)进行加空格,并且bpe分词:
tools/text2token.py -s 0 -n 1 -m /home/data/yelong/wenet/examples/aishell/s0/aban-c009/bpe.model /ngram_testset_char/aban_c009_xueyuan/text --trans_type cn_char_en_bpe > /ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009
去除行首空格:sed -i 's/^ *//' ....
计算测试集ppl:
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon_eng/lm/ppl_aban_c009_xueyuan
计算wer
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/ppl_value_aban_c009_xueyuan_test > exp/aban-c009/test_xueyuan/wer_am_lm_alpha_0.1split_sentence.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40import re
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
#sp.load("/home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model")
sp.load("/home/data/yelong/wenet/examples/aishell/s0/aban-c009/bpe.model")
def __tokenize_by_bpe_model(sp, txt):
tokens = []
# CJK(China Japan Korea) unicode range is [U+4E00, U+9FFF], ref:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
pattern = re.compile(r'([\u4e00-\u9fff])')
# Example:
# txt = "你好 ITS'S OKAY 的"
# chars = ["你", "好", " ITS'S OKAY ", "的"]
chars = pattern.split(txt.upper())
mix_chars = [w for w in chars if len(w.strip()) > 0]
for ch_or_w in mix_chars:
# ch_or_w is a single CJK charater(i.e., "你"), do nothing.
if pattern.fullmatch(ch_or_w) is not None:
tokens.append(ch_or_w)
# ch_or_w contains non-CJK charaters(i.e., " IT'S OKAY "),
# encode ch_or_w using bpe_model.
else:
for p in sp.encode_as_pieces(ch_or_w):
tokens.append(p)
return tokens
# txt="你好▁LET'S▁GO你 好"
# print(" ".join(__tokenize_by_bpe_model(sp,txt)))
#src_file='data_4000_add_we/test_7g_train/text_space_eng_add_line'
#src_file='data_4000_add_we/test_7g_train/text_space_eng'
#src_file='ngram_testset_char/text'
#src_file='ngram_testset_char/text_add_line'
#src_file='ngram_testset_char/aban_c009_xueyuan/text'
src_file='ngram_testset_char/aban_c009_xueyuan/text_bpe'
with open(src_file, "r", encoding="utf8") as fs:
for line in fs:
line = line.strip()
print(" ".join(__tokenize_by_bpe_model(sp,line)))结果分析:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26对比加不加ngram rescore:
baseline:aban-c009 ctc prefix beam search
Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
预期能达到最好的情况(rescore能改善的空间):
python compute_wer_best.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/text > exp/aban-c009/test_xueyuan/wer_20best
Overall -> 2.56 % N=591032 C=578565 S=9711 D=2756 I=2678
Mandarin -> 2.43 % N=587425 C=575731 S=9235 D=2459 I=2571
English -> 24.27 % N=3601 C=2834 S=475 D=292 I=107
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
rescore结果:
ngram_7g_train_hua_lexicon_eng/text2token/lm:
ngram_7g_train_hua_lexicon_eng_lm_text2token/wer_am_lm_alpha_0.2
Overall -> 4.12 % N=591032 C=571202 S=14610 D=5220 I=4515
Mandarin -> 3.93 % N=587425 C=568694 S=14035 D=4696 I=4364
English -> 34.55 % N=3601 C=2508 S=574 D=519 I=151
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0上面的比如 LET’S没有分开,优点问题,现在重新做了一次,用的split_sentence.py,现在是新的结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29700万条中英文训练LM(英文用bpe子词、中文用字):
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon_eng/lexicon ngram_7g_train_hua_lexicon_eng/text_space_eng_bpe_aban_c009 ngram_7g_train_hua_lexicon_eng/lm
测试ppl:
这里 ngram_testset_char/text 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm1.ppl
reading 12358 1-grams
reading 5641167 2-grams
reading 31842760 3-grams
0 zeroprobs, logprob= -1052185 ppl= 54.49828 ppl1= 60.1414
4000万中文、中英文训练LM(英文用bpe子词、中文用字):
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon/lexicon ngram_7g_train_hua_lexicon/text_space_eng_bpe_aban_c009_and_cn ngram_7g_train_hua_lexicon/lm
测试ppl:
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm2.ppl
reading 12358 1-grams
reading 6916323 2-grams
reading 50028207 3-grams
0 zeroprobs, logprob= -1036973 ppl= 51.43724 ppl1= 56.6826
计算测试集ppl:
python word_sentence.py > ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon_eng/lm/ppl_aban_c009_xueyuan
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon/lm/ppl_aban_c009_xueyuan结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14ngram_7g_train_hua_lexicon_eng_lm:
wer_am_lm_alpha_0.2
Overall -> 4.12 % N=591032 C=571208 S=14605 D=5219 I=4515
Mandarin -> 3.93 % N=587425 C=568695 S=14034 D=4696 I=4364
English -> 34.41 % N=3601 C=2513 S=570 D=518 I=151
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
ngram_7g_train_hua_lexicon_lm:
wer_am_lm_alpha_0.18
Overall -> 4.10 % N=591032 C=571249 S=14659 D=5124 I=4453
Mandarin -> 3.91 % N=587425 C=568747 S=14077 D=4601 I=4303
English -> 34.68 % N=3601 C=2502 S=581 D=518 I=150
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
==中文用字,英文用词,这样就不用因为bpe.model而每次要重新训练LM==
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44把测试集句子分成中文空格分开的字、英文空格分开的词
python split_sentence_nobpe.py > ngram_testset_char/text_space
700万条中英文训练LM(英文用词、中文用字):
local/train_lms_1gram.sh ngram_7g_train_char/lexicon ngram_7g_train_char/text_space_eng ngram_7g_train_char/lm_eng
测试ppl:
这里 ngram_testset_char/text_space 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space > ngram_testset_char/ppl_char/lm1.ppl
reading 34111 1-grams
reading 6114244 2-grams
reading 31739092 3-grams
0 zeroprobs, logprob= -1051788 ppl= 54.54516 ppl1= 60.19802
4000万中文、中英文训练LM(英文用bpe子词、中文用字):
local/train_lms_1gram.sh ngram_7g_train_char/lexicon ngram_7g_train_char/text_space ngram_7g_train_char/lm_eng_cn
测试ppl:
这里 ngram_testset_char/text_space 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng_cn/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space > ngram_testset_char/ppl_char/lm2.ppl
reading 34111 1-grams
reading 7389346 2-grams
reading 49923410 3-grams
0 zeroprobs, logprob= -1036779 ppl= 51.5195 ppl1= 56.77883
计算测试集ppl(20best)(29w条)
python split_sentence_nobpe.py > ngram_testset_char/aban_c009_xueyuan/text_space
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space > ngram_7g_train_char/lm_eng/ppl_aban_c009_xueyuan
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng_cn/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space > ngram_7g_train_char/lm_eng_cn/ppl_aban_c009_xueyuan
有些地方是没有输出结果,计算ngram没有返回值,这些句子要手动空行
qq是只有一行的y空行
sed -i '5600 r qq' ppl_value_aban_c009_xueyuan
sed -i '11812 r qq' ppl_value_aban_c009_xueyuan
sed -i '35862 r qq' ppl_value_aban_c009_xueyuan
sed -i '58228 r qq' ppl_value_aban_c009_xueyuan
sed -i '107511 r qq' ppl_value_aban_c009_xueyuan
sed -i '117260 r qq' ppl_value_aban_c009_xueyuan
sed -i '121642 r qq' ppl_value_aban_c009_xueyuan
sed -i '208564 r qq' ppl_value_aban_c009_xueyuan
sed -i '245827 r qq' ppl_value_aban_c009_xueyuan
sed -i '290460 r qq' ppl_value_aban_c009_xueyuan结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30对比加不加ngram rescore:
baseline:aban-c009 ctc prefix beam search
Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
预期能达到最好的情况(rescore能改善的空间):
python compute_wer_best.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/text > exp/aban-c009/test_xueyuan/wer_20best
Overall -> 2.56 % N=591032 C=578565 S=9711 D=2756 I=2678
Mandarin -> 2.43 % N=587425 C=575731 S=9235 D=2459 I=2571
English -> 24.27 % N=3601 C=2834 S=475 D=292 I=107
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
ngram_7g_train_char/lm_eng:
ngram_7g_train_char_lm_eng/wer_am_lm_alpha_0.2
Overall -> 4.12 % N=591032 C=571210 S=14609 D=5213 I=4518
Mandarin -> 3.93 % N=587425 C=568703 S=14030 D=4692 I=4363
English -> 34.68 % N=3601 C=2507 S=578 D=516 I=155
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
英文没有bpe分词的好,说明按英文word单词数还是有点太多,但是只变差一丢丢,计算方便了很多,所以用词直接建模也是OK的!!!!!!!!!!!;
ngram_7g_train_char/lm_eng_cn:
Overall -> 4.10 % N=591032 C=571281 S=14582 D=5169 I=4470
Mandarin -> 3.91 % N=587425 C=568779 S=13999 D=4647 I=4319
English -> 34.71 % N=3601 C=2502 S=582 D=517 I=151
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0==小结==:
英文直接用词做LM没有bpe分词的好,说明按英文word单词数还是有点太多,但是只变差一丢丢,计算方便了很多,所以用词直接建模也是OK的!!!!!!!!!!!
但是,英文直接用词,会有可能出现测试集里有词典没有的词,OOV的出现!!用bpe就没有这个问题!!
训练LM纯中文文本多了(700中英 -> 4000纯中文+中英),中文部分结果会变好,英文只变差了一丢丢;
结果
- 字建模的ngram LM比词建模的ngram LM的wer更低
- 只用3500小时对应的文本不够,7G垂直领域文本有用,比其他领域更多的中英数据还有用;
- wer改善其实很小,对于英文的改善更小;
- 做按字的分词估计有点问题
原因分析
为什么字建模比词建模效果好?
- 先排除oov的干扰,确认不是词的oov后,ngram建模出发来看,2gram的概率公式为$\large P(w_i|w_{i-1})=\frac{count(w_{i-1}w_i)}{count(w_{i-1})}$,分子是同时发生的频次,分母是前一个词发生的频次,因此,以词建模的ngram有一下几个问题:
- 词数多(几十万个),在比如“我”后面,可以接的词太多,概率都很低?
- 训练数据不够多,统计量分到词组少,就很少了,造成概率都偏低,nbest区分不明显;
- 错误词分词拆成字,反而概率更高:比如“黝黑”logprob-4,但是“呦 嘿”的logprob反而更大,造成错别字的ppl更小
- oov的影响:oov较多,概率不准确:ngram直接计算oov,oov对应logprob是-inf,但计算句子的logprob时这个-inf并没有加上,会造成错误
- 先排除oov的干扰,确认不是词的oov后,ngram建模出发来看,2gram的概率公式为$\large P(w_i|w_{i-1})=\frac{count(w_{i-1}w_i)}{count(w_{i-1})}$,分子是同时发生的频次,分母是前一个词发生的频次,因此,以词建模的ngram有一下几个问题:
(花哥教)从识别出nbest的角度进行解释:LM可以判断一句话像不像话,他的“顺序”判别能力 优于错别字的判别能力;而识别出来的nbest可能只是错一两个字,语序的错误并不多,因此难判断,还有就是比如“中国人民共和国”是一个词,如果nbest里错了一个字,则不是一个词了,该词的概率就没有了,就分词分成更小的了,这样它的概率优势就凸显不了了;
原理出发,来衡量n-best哪句话最像话,是否可以用ppl来衡量?
ppl适合来比较语言模型对某个数据集的匹配程度,能比较出来,比如这个语言模型比起另一个语言模型更匹配某个数据集;
可能它不适合在同一个语言模型下,比较两句话的ppl,来衡量哪句话更像话;
ngram -unk -map-unk “
用更多数据
用贤祥给的2400万条中英文本,还有7g_train一共700万条中英文本,一共3100万条(ngram_7g_train_30w_2017_seewo_eng_word/text)24.4 /home/data/yelong/kaldi/egs/librispeech1/s5/
琳姐的词典,做一个LM
1 | cat ngram_7g_train_eng_word/text 30w_2017_seewo/30w_2017_seewo_nopunc_freq_cn_en > ngram_7g_train_30w_2017_seewo_eng_word/text |
结果
1 | baseline:aban-c009 ctc prefix beam search |
对比三种词建模在1.4w条的ppl:
1 | ngram -unk -map-unk "嫫" -order 3 -debug 2 -lm ngram_7g_train_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word > ngram_testset/ppl/lm1.ppl |
一些脚本
把句子拆成空格间隔,bpe分词
(用text2token进行bpe分词,LET’S是拆不开的)
split_sentence.py
1 | import re |
把句子拆成空格间隔,英文也拆开,没有bpe分词(假设没有标点符号)
split_sentence_nobpe.py
1 | import re |