Ngram LM实验(二)

4000万条做语言模型G1,中英700万条做语言模型G2,模型融合,融合比例通过测试集整体的ppl来确定

G1、G2分别对测试集计算ppl

  • ==用词做==
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# 700万条中英文训练LM:
local/train_lms_1gram.sh ngram_7g_train_eng_word/lexicon ngram_7g_train_eng_word/text ngram_7g_train_eng_word/lm
# 测试ppl:
# ngram_testset/text_word是1.4万条文本,进行jieba分词
ngram -unk -map-unk "嫫" -order 3 -debug 2 -lm ngram_7g_train_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word > ngram_testset/ppl/lm1.ppl
# reading 2316918 1-grams
# reading 61085443 2-grams
# reading 50367347 3-grams
# 0 zeroprobs, logprob= -979696.3 ppl= 536.1027 ppl1= 699.4325

# 4000万中文、中英文训练LM:(词典来自ngram_train_word/lexicon_word)
ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz
# 测试ppl:
ngram -unk -map-unk "嫫" -order 3 -debug 2 -lm ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word > ngram_testset/ppl/lm2.ppl
# reading 767195 1-grams
# reading 81671836 2-grams
# reading 91884948 3-grams
# 0 zeroprobs, logprob= -971651.7 ppl= 509.14 ppl1= 662.8064


# 计算合并比例
compute-best-mix ngram_testset/ppl/* > ngram_testset/ppl/best_mix
# ngram_testset/ppl/best_mix:
# 3596414 non-oov words, best lambda (0.447615 0.552385)
# pairwise cumulative lambda (1 0.552385)

# 合并模型
ngram -order 3 -lm ngram_7g_train_eng_word/lm/srilm/srilm.o3g.kn.gz -lambda 0.447615 -mix-l`m ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -write-lm ngram_merge_from_7g_train_eng_word_and_7g_train_word/merge_lm.arpa

# 合并后的模型,测试ppl
ngram -unk -map-unk "龢" -order 3 -debug 2 -lm ngram_merge_from_7g_train_eng_word_and_7g_train_word/merge_lm.arpa -ppl ngram_testset/text_word > ngram_testset/ppl/lm_merge.ppl
# 0 zeroprobs, logprob= -1.013367e+07 ppl= 657.2279 ppl1= 864.3976

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
python compute_wer_lm_logprob.py --char=1 --v=1 data/xueyuan/text exp/seewo/conformer/test_xueyuan2/ppl_ngram_7g_train_eng_word > exp/seewo/conformer/test_xueyuan2/wer_am_lm_alpha_1


# wer_am_lm_alpha_0.7
Overall -> 4.23 % N=591056 C=569560 S=15493 D=6003 I=3529
Mandarin -> 3.77 % N=587425 C=568766 S=14212 D=4447 I=3497
English -> 78.88 % N=3608 C=794 S=1278 D=1536 I=32
Other -> 100.00 % N=23 C=0 S=3 D=20 I=0

# 原来
# wer_am
Overall -> 4.59 % N=591056 C=568410 S=18137 D=4509 I=4499
Mandarin -> 4.15 % N=587425 C=567509 S=16695 D=3221 I=4449
English -> 76.41 % N=3608 C=901 S=1437 D=1270 I=50
Other -> 100.00 % N=23 C=0 S=5 D=18 I=0

  • ==用字做==

    用花哥的声学模型的字典,来作为语言模型的字典,一般来说是语言模型字典大于声学模型,包含了声学模型所有字,以防OOV,并且可以覆盖以后的字,但是我这边先用一样的好了,或者更多,训练集所有字,总之英文的bpe要换一下,先bpe搞一波

    路径:(port 60000)10.22.24.4:/home/data/yelong/kaldi/egs/librispeech1/s5、(port 51720)10.22.22.2:/home/yelong/data/wenet/examples/multi_cn/s0

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    # 10.22.24.2:~/data/wenet/examples/multi_cn/s0/data_4000_add_we/test_7g_train/text_space	这个把句子用空格分开了,但是英文还没有bpe分词
    [不靠谱!]tools/text2token.py -s 0 -n 1 data_4000_add_we/test_7g_train/text.... --trans_type phn > data_4000_add_we/test_7g_train/text_space
    # 去除行首空格:sed -i 's/^ *//' ....



    # 然后,把英文之间的空格变成▁ !![新] 用 split_sentence.py 的话,不需要把英文之间的空格变成▁!!!就是下面这步不用做了
    sed 's/\([A-Z]\) \([A-Z]\)/\1▁\2/g' data_4000_add_we/test_7g_train/text_space_eng | sed 's/\([A-Z]\) \([A-Z]\)/\1▁\2/g' | sed 's/\xEF\xBB\xBF//' > data_4000_add_we/test_7g_train/text_space_eng_add_line

    # 做bpe:
    #tools/text2token.py -s 0 -n 1 -m /home/yelong/data/wenet/examples/aishell/s0/exp/aban-c007/100-101_unigram5000.model #data_4000_add_we/test_7g_train/text_space_eng_add_line --trans_type cn_char_en_bpe > data_4000_add_we/test_7g_train/text_space_eng_bpe_100-101_unigram5000
    [旧]tools/text2token.py -s 0 -n 1 -m /home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model data_4000_add_we/test_7g_train/text_space_eng_add_line --trans_type cn_char_en_bpe > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009
    # 去除行首空格:sed -i 's/^ *//' ....
    [更新]用tools/text2token.py不靠谱,还是用自己写的split_sentence.py!
    [新]python split_sentence.py > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009


    #有一些space,乱码,要删除
    sed -i '/space/d' data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009

    # 然后和中文一起
    cat data_4000_add_we/test_7g_train/text_space_cn data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009 > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009_and_cn

    # 测试集同样也要做;


    # 700万条中英文训练LM(英文用bpe子词、中文用字):
    local/train_lms_1gram.sh ngram_7g_train_hua_lexicon_eng/lexicon ngram_7g_train_hua_lexicon_eng/text_space_eng_bpe_aban_c009 ngram_7g_train_hua_lexicon_eng/lm

    # 测试ppl:
    # 这里 ngram_testset_char/text 是1.4万条文本
    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm1.ppl
    # reading 12358 1-grams
    # reading 5639846 2-grams
    # reading 31839516 3-grams
    # 0 zeroprobs, logprob= -1052204 ppl= 54.52272 ppl1= 60.16961

    # 4000万中文、中英文训练LM(英文用bpe子词、中文用字):
    local/train_lms_1gram.sh ngram_7g_train_hua_lexicon/lexicon ngram_7g_train_hua_lexicon/text_space_eng_bpe_aban_c009_and_cn ngram_7g_train_hua_lexicon/lm
    # 测试ppl:
    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm2.ppl


    # 在xueyuan test_20best(已经是带▁的) 上测试ppl(为了rescore)
    # 首先对得到的带bpe的解码结果(没有空格)进行加空格,并且bpe分词:
    tools/text2token.py -s 0 -n 1 -m /home/data/yelong/wenet/examples/aishell/s0/aban-c009/bpe.model /ngram_testset_char/aban_c009_xueyuan/text --trans_type cn_char_en_bpe > /ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009
    # 去除行首空格:sed -i 's/^ *//' ....

    # 计算测试集ppl:
    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon_eng/lm/ppl_aban_c009_xueyuan

    # 计算wer
    python compute_wer_lm.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/ppl_value_aban_c009_xueyuan_test > exp/aban-c009/test_xueyuan/wer_am_lm_alpha_0.1

    split_sentence.py:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    import re
    import sentencepiece as spm
    sp = spm.SentencePieceProcessor()
    #sp.load("/home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model")
    sp.load("/home/data/yelong/wenet/examples/aishell/s0/aban-c009/bpe.model")

    def __tokenize_by_bpe_model(sp, txt):
    tokens = []
    # CJK(China Japan Korea) unicode range is [U+4E00, U+9FFF], ref:
    # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    pattern = re.compile(r'([\u4e00-\u9fff])')
    # Example:
    # txt = "你好 ITS'S OKAY 的"
    # chars = ["你", "好", " ITS'S OKAY ", "的"]
    chars = pattern.split(txt.upper())
    mix_chars = [w for w in chars if len(w.strip()) > 0]
    for ch_or_w in mix_chars:
    # ch_or_w is a single CJK charater(i.e., "你"), do nothing.
    if pattern.fullmatch(ch_or_w) is not None:
    tokens.append(ch_or_w)
    # ch_or_w contains non-CJK charaters(i.e., " IT'S OKAY "),
    # encode ch_or_w using bpe_model.
    else:
    for p in sp.encode_as_pieces(ch_or_w):
    tokens.append(p)

    return tokens
    # txt="你好▁LET'S▁GO你 好"
    # print(" ".join(__tokenize_by_bpe_model(sp,txt)))
    #src_file='data_4000_add_we/test_7g_train/text_space_eng_add_line'
    #src_file='data_4000_add_we/test_7g_train/text_space_eng'
    #src_file='ngram_testset_char/text'
    #src_file='ngram_testset_char/text_add_line'
    #src_file='ngram_testset_char/aban_c009_xueyuan/text'
    src_file='ngram_testset_char/aban_c009_xueyuan/text_bpe'
    with open(src_file, "r", encoding="utf8") as fs:
    for line in fs:
    line = line.strip()
    print(" ".join(__tokenize_by_bpe_model(sp,line)))

    结果分析:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    # 对比加不加ngram rescore:

    # baseline:aban-c009 ctc prefix beam search
    Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
    Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
    English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

    # 预期能达到最好的情况(rescore能改善的空间):
    python compute_wer_best.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/text > exp/aban-c009/test_xueyuan/wer_20best

    Overall -> 2.56 % N=591032 C=578565 S=9711 D=2756 I=2678
    Mandarin -> 2.43 % N=587425 C=575731 S=9235 D=2459 I=2571
    English -> 24.27 % N=3601 C=2834 S=475 D=292 I=107
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0


    # rescore结果:
    # ngram_7g_train_hua_lexicon_eng/text2token/lm:
    # ngram_7g_train_hua_lexicon_eng_lm_text2token/wer_am_lm_alpha_0.2
    Overall -> 4.12 % N=591032 C=571202 S=14610 D=5220 I=4515
    Mandarin -> 3.93 % N=587425 C=568694 S=14035 D=4696 I=4364
    English -> 34.55 % N=3601 C=2508 S=574 D=519 I=151
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0


    上面的比如 LET’S没有分开,优点问题,现在重新做了一次,用的split_sentence.py,现在是新的结果

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    # 700万条中英文训练LM(英文用bpe子词、中文用字):
    local/train_lms_1gram.sh ngram_7g_train_hua_lexicon_eng/lexicon ngram_7g_train_hua_lexicon_eng/text_space_eng_bpe_aban_c009 ngram_7g_train_hua_lexicon_eng/lm

    # 测试ppl:
    # 这里 ngram_testset_char/text 是1.4万条文本
    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm1.ppl
    # reading 12358 1-grams
    # reading 5641167 2-grams
    # reading 31842760 3-grams
    # 0 zeroprobs, logprob= -1052185 ppl= 54.49828 ppl1= 60.1414



    # 4000万中文、中英文训练LM(英文用bpe子词、中文用字):
    local/train_lms_1gram.sh ngram_7g_train_hua_lexicon/lexicon ngram_7g_train_hua_lexicon/text_space_eng_bpe_aban_c009_and_cn ngram_7g_train_hua_lexicon/lm
    # 测试ppl:
    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 > ngram_testset_char/ppl/lm2.ppl
    # reading 12358 1-grams
    # reading 6916323 2-grams
    # reading 50028207 3-grams
    # 0 zeroprobs, logprob= -1036973 ppl= 51.43724 ppl1= 56.6826


    # 计算测试集ppl:
    python word_sentence.py > ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009

    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon_eng/lm/ppl_aban_c009_xueyuan

    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon/lm/ppl_aban_c009_xueyuan

    结果:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # ngram_7g_train_hua_lexicon_eng_lm:
    # wer_am_lm_alpha_0.2
    Overall -> 4.12 % N=591032 C=571208 S=14605 D=5219 I=4515
    Mandarin -> 3.93 % N=587425 C=568695 S=14034 D=4696 I=4364
    English -> 34.41 % N=3601 C=2513 S=570 D=518 I=151
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

    # ngram_7g_train_hua_lexicon_lm:
    # wer_am_lm_alpha_0.18
    Overall -> 4.10 % N=591032 C=571249 S=14659 D=5124 I=4453
    Mandarin -> 3.91 % N=587425 C=568747 S=14077 D=4601 I=4303
    English -> 34.68 % N=3601 C=2502 S=581 D=518 I=150
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

  • ==中文用字,英文用词,这样就不用因为bpe.model而每次要重新训练LM==

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    # 把测试集句子分成中文空格分开的字、英文空格分开的词
    python split_sentence_nobpe.py > ngram_testset_char/text_space

    # 700万条中英文训练LM(英文用词、中文用字):
    local/train_lms_1gram.sh ngram_7g_train_char/lexicon ngram_7g_train_char/text_space_eng ngram_7g_train_char/lm_eng

    # 测试ppl:
    # 这里 ngram_testset_char/text_space 是1.4万条文本
    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space > ngram_testset_char/ppl_char/lm1.ppl
    # reading 34111 1-grams
    # reading 6114244 2-grams
    # reading 31739092 3-grams
    # 0 zeroprobs, logprob= -1051788 ppl= 54.54516 ppl1= 60.19802


    # 4000万中文、中英文训练LM(英文用bpe子词、中文用字):
    local/train_lms_1gram.sh ngram_7g_train_char/lexicon ngram_7g_train_char/text_space ngram_7g_train_char/lm_eng_cn

    # 测试ppl:
    # 这里 ngram_testset_char/text_space 是1.4万条文本
    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng_cn/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space > ngram_testset_char/ppl_char/lm2.ppl
    # reading 34111 1-grams
    # reading 7389346 2-grams
    # reading 49923410 3-grams
    # 0 zeroprobs, logprob= -1036779 ppl= 51.5195 ppl1= 56.77883

    # 计算测试集ppl(20best)(29w条)
    python split_sentence_nobpe.py > ngram_testset_char/aban_c009_xueyuan/text_space

    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space > ngram_7g_train_char/lm_eng/ppl_aban_c009_xueyuan

    ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng_cn/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space > ngram_7g_train_char/lm_eng_cn/ppl_aban_c009_xueyuan
    # 有些地方是没有输出结果,计算ngram没有返回值,这些句子要手动空行
    # qq是只有一行的y空行
    sed -i '5600 r qq' ppl_value_aban_c009_xueyuan
    sed -i '11812 r qq' ppl_value_aban_c009_xueyuan
    sed -i '35862 r qq' ppl_value_aban_c009_xueyuan
    sed -i '58228 r qq' ppl_value_aban_c009_xueyuan
    sed -i '107511 r qq' ppl_value_aban_c009_xueyuan
    sed -i '117260 r qq' ppl_value_aban_c009_xueyuan
    sed -i '121642 r qq' ppl_value_aban_c009_xueyuan
    sed -i '208564 r qq' ppl_value_aban_c009_xueyuan
    sed -i '245827 r qq' ppl_value_aban_c009_xueyuan
    sed -i '290460 r qq' ppl_value_aban_c009_xueyuan

    结果:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    # 对比加不加ngram rescore:

    # baseline:aban-c009 ctc prefix beam search
    Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
    Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
    English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

    # 预期能达到最好的情况(rescore能改善的空间):
    python compute_wer_best.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/text > exp/aban-c009/test_xueyuan/wer_20best

    Overall -> 2.56 % N=591032 C=578565 S=9711 D=2756 I=2678
    Mandarin -> 2.43 % N=587425 C=575731 S=9235 D=2459 I=2571
    English -> 24.27 % N=3601 C=2834 S=475 D=292 I=107
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

    # ngram_7g_train_char/lm_eng:
    # ngram_7g_train_char_lm_eng/wer_am_lm_alpha_0.2
    Overall -> 4.12 % N=591032 C=571210 S=14609 D=5213 I=4518
    Mandarin -> 3.93 % N=587425 C=568703 S=14030 D=4692 I=4363
    English -> 34.68 % N=3601 C=2507 S=578 D=516 I=155
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
    # 英文没有bpe分词的好,说明按英文word单词数还是有点太多,但是只变差一丢丢,计算方便了很多,所以用词直接建模也是OK的!!!!!!!!!!!;

    # ngram_7g_train_char/lm_eng_cn:
    Overall -> 4.10 % N=591032 C=571281 S=14582 D=5169 I=4470
    Mandarin -> 3.91 % N=587425 C=568779 S=13999 D=4647 I=4319
    English -> 34.71 % N=3601 C=2502 S=582 D=517 I=151
    Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

  • ==小结==:

    • 英文直接用词做LM没有bpe分词的好,说明按英文word单词数还是有点太多,但是只变差一丢丢,计算方便了很多,所以用词直接建模也是OK的!!!!!!!!!!!

      但是,英文直接用词,会有可能出现测试集里有词典没有的词,OOV的出现!!用bpe就没有这个问题!!

    • 训练LM纯中文文本多了(700中英 -> 4000纯中文+中英),中文部分结果会变好,英文只变差了一丢丢;

结果

  1. 字建模的ngram LM比词建模的ngram LM的wer更低
  2. 只用3500小时对应的文本不够,7G垂直领域文本有用,比其他领域更多的中英数据还有用;
  3. wer改善其实很小,对于英文的改善更小;
  4. 做按字的分词估计有点问题

原因分析

  1. 为什么字建模比词建模效果好?

    1. 先排除oov的干扰,确认不是词的oov后,ngram建模出发来看,2gram的概率公式为$\large P(w_i|w_{i-1})=\frac{count(w_{i-1}w_i)}{count(w_{i-1})}$,分子是同时发生的频次,分母是前一个词发生的频次,因此,以词建模的ngram有一下几个问题:
      1. 词数多(几十万个),在比如“我”后面,可以接的词太多,概率都很低?
      2. 训练数据不够多,统计量分到词组少,就很少了,造成概率都偏低,nbest区分不明显;
      3. 错误词分词拆成字,反而概率更高:比如“黝黑”logprob-4,但是“呦 嘿”的logprob反而更大,造成错别字的ppl更小
    2. oov的影响:oov较多,概率不准确:ngram直接计算oov,oov对应logprob是-inf,但计算句子的logprob时这个-inf并没有加上,会造成错误
  2. (花哥教)从识别出nbest的角度进行解释:LM可以判断一句话像不像话,他的“顺序”判别能力 优于错别字的判别能力;而识别出来的nbest可能只是错一两个字,语序的错误并不多,因此难判断,还有就是比如“中国人民共和国”是一个词,如果nbest里错了一个字,则不是一个词了,该词的概率就没有了,就分词分成更小的了,这样它的概率优势就凸显不了了;

  3. 原理出发,来衡量n-best哪句话最像话,是否可以用ppl来衡量?

    ppl适合来比较语言模型对某个数据集的匹配程度,能比较出来,比如这个语言模型比起另一个语言模型更匹配某个数据集;

    可能它不适合在同一个语言模型下,比较两句话的ppl,来衡量哪句话更像话;

ngram -unk -map-unk ““ -lm 11/srilm/srilm.o3g.kn.gz -order 2 -ppl 2 -debug 2

用更多数据

用贤祥给的2400万条中英文本,还有7g_train一共700万条中英文本,一共3100万条(ngram_7g_train_30w_2017_seewo_eng_word/text)24.4 /home/data/yelong/kaldi/egs/librispeech1/s5/

琳姐的词典,做一个LM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
cat ngram_7g_train_eng_word/text 30w_2017_seewo/30w_2017_seewo_nopunc_freq_cn_en > ngram_7g_train_30w_2017_seewo_eng_word/text
#再把am的词典的字,合并进琳姐词典中(有600个生字是琳姐词典里没有的),得到ngram_7g_train_30w_2017_seewo_eng_word/lexicon
# 训练LM
local/train_lms_1gram.sh ngram_7g_train_30w_2017_seewo_eng_word/lexicon ngram_7g_train_30w_2017_seewo_eng_word/text ngram_7g_train_30w_2017_seewo_eng_word/lm

## 测试1.4w测试集ppl
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_30w_2017_seewo_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word > ngram_testset/ppl_word/lm1.ppl
# reading 134090 1-grams
# reading 43612597 2-grams
# reading 55833679 3-grams
# 0 zeroprobs, logprob= -1066023 ppl= 514.9295 ppl1= 654.8651

# 计算测试集ppl:
python word_segment_again.py > ngram_testset/aban_c009_xueyuan/text_word
sed -i 's/ '\''/'\''/g' ngram_testset/aban_c009_xueyuan/text_word
sed -i 's/'\'' /'\''/g' ngram_testset/aban_c009_xueyuan/text_word
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_30w_2017_seewo_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/aban_c009_xueyuan/text_word > ngram_7g_train_30w_2017_seewo_eng_word/lm/ppl_aban_c009_xueyuan

for alpha in 0.02 0.005 0.01 0.015 0.018 0.022 0.03 0.04 0.05 0.06 0.07 0.08 0.09; do
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/ngram_7g_train_30w_2017_seewo_eng_word_lm/ppl_value_aban_c009_xueyuan_test $alpha > exp/aban-c009/test_xueyuan/ngram_7g_train_30w_2017_seewo_eng_word_lm/wer_am_lm_alpha_$alpha
done

结果

1
2
3
4
5
6
7
8
9
10
11
12
13
# baseline:aban-c009 ctc prefix beam search
Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

# ngram_7g_train_30w_2017_seewo_eng_word_lm:
# wer_am_lm_alpha_0.007
Overall -> 4.32 % N=591032 C=570190 S=15903 D=4939 I=4702
Mandarin -> 4.14 % N=587425 C=567617 S=15300 D=4508 I=4521
English -> 33.55 % N=3601 C=2573 S=602 D=426 I=180
Other -> 116.67 % N=6 C=0 S=1 D=5 I=1
# 中文提升3%,英文提升6%,加中英数据,英文变好了一点点,但是中文提升没之前多了,但是中文已经很好了

对比三种词建模在1.4w条的ppl:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ngram -unk -map-unk "嫫" -order 3 -debug 2 -lm ngram_7g_train_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word >  ngram_testset/ppl/lm1.ppl
# reading 2316918 1-grams
# reading 61085443 2-grams
# reading 50367347 3-grams
# 0 zeroprobs, logprob= -1099945 ppl= 628.1117 ppl1= 804.9393

ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word > ngram_testset/ppl/lm2.ppl
# reading 767195 1-grams
# reading 81671836 2-grams
# reading 91884948 3-grams
# 0 zeroprobs, logprob= -1087195 ppl= 582.9138 ppl1= 744.8724


ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_30w_2017_seewo_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word > ngram_testset/ppl_word/lm1.ppl
# reading 134090 1-grams
# reading 43612597 2-grams
# reading 55833679 3-grams
# 0 zeroprobs, logprob= -1066023 ppl= 514.9295 ppl1= 654.8651

一些脚本

把句子拆成空格间隔,bpe分词

(用text2token进行bpe分词,LET’S是拆不开的)

split_sentence.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import re
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("/home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model")

def __tokenize_by_bpe_model(sp, txt):
tokens = []
# CJK(China Japan Korea) unicode range is [U+4E00, U+9FFF], ref:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
pattern = re.compile(r'([\u4e00-\u9fff])')
# Example:
# txt = "你好 ITS'S OKAY 的"
# chars = ["你", "好", " ITS'S OKAY ", "的"]
chars = pattern.split(txt.upper())
mix_chars = [w for w in chars if len(w.strip()) > 0]
for ch_or_w in mix_chars:
# ch_or_w is a single CJK charater(i.e., "你"), do nothing.
if pattern.fullmatch(ch_or_w) is not None:
tokens.append(ch_or_w)
# ch_or_w contains non-CJK charaters(i.e., " IT'S OKAY "),
# encode ch_or_w using bpe_model.
else:
for p in sp.encode_as_pieces(ch_or_w):
tokens.append(p)

return tokens
# txt="你好▁LET'S▁GO你 好"
# print(" ".join(__tokenize_by_bpe_model(sp,txt)))
src_file='....'
with open(src_file, "r", encoding="utf8") as fs:
for line in fs:
line = line.strip()
print(" ".join(__tokenize_by_bpe_model(sp,line)))

把句子拆成空格间隔,英文也拆开,没有bpe分词(假设没有标点符号)

split_sentence_nobpe.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import re
#import sentencepiece as spm
#sp = spm.SentencePieceProcessor()
#sp.load("/home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model")

def __tokenize_by_bpe_model(txt):
tokens = []
# CJK(China Japan Korea) unicode range is [U+4E00, U+9FFF], ref:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
pattern = re.compile(r'([\u4e00-\u9fff])')
# Example:
# txt = "你好 ITS'S OKAY 的"
# chars = ["你", "好", " ITS'S OKAY ", "的"]
chars = pattern.split(txt.upper())
mix_chars = [w for w in chars if len(w.strip()) > 0]
for ch_or_w in mix_chars:
# ch_or_w is a single CJK charater(i.e., "你"), do nothing.
#if pattern.fullmatch(ch_or_w) is not None:
tokens.append(ch_or_w)
# ch_or_w contains non-CJK charaters(i.e., " IT'S OKAY "),
# encode ch_or_w using bpe_model.
#else:
#for p in sp.encode_as_pieces(ch_or_w):
# tokens.append(p)

return tokens
# txt="你好▁LET'S▁GO你 好"
# print(" ".join(__tokenize_by_bpe_model(sp,txt)))
#src_file='data_4000_add_we/test_7g_train/text_space_eng_add_line'
#src_file='data_4000_add_we/test_7g_train/text_space_eng'
src_file='ngram_testset_char/text'
with open(src_file, "r", encoding="utf8") as fs:
for line in fs:
line = line.strip()
print(" ".join(__tokenize_by_bpe_model(line)))
#print(" ".join(__tokenize_by_bpe_model(sp,line)))