4000万条做语言模型G1，中英700万条做语言模型G2，模型融合，融合比例通过测试集整体的ppl来确定

G1、G2分别对测试集计算ppl

==用词做==

# 700万条中英文训练LM：
local/train_lms_1gram.sh ngram_7g_train_eng_word/lexicon ngram_7g_train_eng_word/text ngram_7g_train_eng_word/lm
# 测试ppl：
# ngram_testset/text_word是1.4万条文本，进行jieba分词
ngram -unk -map-unk "嫫" -order 3 -debug 2 -lm ngram_7g_train_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word >  ngram_testset/ppl/lm1.ppl
# reading 2316918 1-grams
# reading 61085443 2-grams
# reading 50367347 3-grams
# 0 zeroprobs, logprob= -979696.3 ppl= 536.1027 ppl1= 699.4325

# 4000万中文、中英文训练LM：（词典来自ngram_train_word/lexicon_word）
ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz
# 测试ppl：
ngram -unk -map-unk "嫫" -order 3 -debug 2 -lm ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word >  ngram_testset/ppl/lm2.ppl
# reading 767195 1-grams
# reading 81671836 2-grams
# reading 91884948 3-grams
# 0 zeroprobs, logprob= -971651.7 ppl= 509.14 ppl1= 662.8064


# 计算合并比例
compute-best-mix ngram_testset/ppl/* > ngram_testset/ppl/best_mix
# ngram_testset/ppl/best_mix：
# 3596414 non-oov words, best lambda (0.447615 0.552385)
# pairwise cumulative lambda (1 0.552385)

# 合并模型
ngram -order 3 -lm ngram_7g_train_eng_word/lm/srilm/srilm.o3g.kn.gz -lambda 0.447615 -mix-l`m ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -write-lm ngram_merge_from_7g_train_eng_word_and_7g_train_word/merge_lm.arpa

# 合并后的模型，测试ppl
ngram -unk -map-unk "龢" -order 3 -debug 2 -lm ngram_merge_from_7g_train_eng_word_and_7g_train_word/merge_lm.arpa -ppl ngram_testset/text_word >  ngram_testset/ppl/lm_merge.ppl
# 0 zeroprobs, logprob= -1.013367e+07 ppl= 657.2279 ppl1= 864.3976

结果：

python compute_wer_lm_logprob.py --char=1 --v=1 data/xueyuan/text exp/seewo/conformer/test_xueyuan2/ppl_ngram_7g_train_eng_word > exp/seewo/conformer/test_xueyuan2/wer_am_lm_alpha_1


# wer_am_lm_alpha_0.7
Overall -> 4.23 % N=591056 C=569560 S=15493 D=6003 I=3529
Mandarin -> 3.77 % N=587425 C=568766 S=14212 D=4447 I=3497
English -> 78.88 % N=3608 C=794 S=1278 D=1536 I=32
Other -> 100.00 % N=23 C=0 S=3 D=20 I=0

# 原来
# wer_am
Overall -> 4.59 % N=591056 C=568410 S=18137 D=4509 I=4499
Mandarin -> 4.15 % N=587425 C=567509 S=16695 D=3221 I=4449
English -> 76.41 % N=3608 C=901 S=1437 D=1270 I=50
Other -> 100.00 % N=23 C=0 S=5 D=18 I=0

==用字做==

用花哥的声学模型的字典，来作为语言模型的字典，一般来说是语言模型字典大于声学模型，包含了声学模型所有字，以防OOV，并且可以覆盖以后的字，但是我这边先用一样的好了，或者更多，训练集所有字，总之英文的bpe要换一下，先bpe搞一波

路径：（port 60000）10.22.24.4：/home/data/yelong/kaldi/egs/librispeech1/s5、（port 51720）10.22.22.2：/home/yelong/data/wenet/examples/multi_cn/s0

# 10.22.24.2：~/data/wenet/examples/multi_cn/s0/data_4000_add_we/test_7g_train/text_space	这个把句子用空格分开了，但是英文还没有bpe分词
[不靠谱！]tools/text2token.py -s 0 -n 1 data_4000_add_we/test_7g_train/text.... --trans_type phn > data_4000_add_we/test_7g_train/text_space
# 去除行首空格：sed -i 's/^ *//' ....



# 然后，把英文之间的空格变成▁	！！[新] 用 split_sentence.py 的话，不需要把英文之间的空格变成▁！！！就是下面这步不用做了
sed 's/\([A-Z]\) \([A-Z]\)/\1▁\2/g' data_4000_add_we/test_7g_train/text_space_eng | sed 's/\([A-Z]\) \([A-Z]\)/\1▁\2/g' | sed 's/\xEF\xBB\xBF//' > data_4000_add_we/test_7g_train/text_space_eng_add_line

# 做bpe：
#tools/text2token.py -s 0 -n 1 -m /home/yelong/data/wenet/examples/aishell/s0/exp/aban-c007/100-101_unigram5000.model #data_4000_add_we/test_7g_train/text_space_eng_add_line --trans_type cn_char_en_bpe > data_4000_add_we/test_7g_train/text_space_eng_bpe_100-101_unigram5000
    [旧]tools/text2token.py -s 0 -n 1 -m /home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model data_4000_add_we/test_7g_train/text_space_eng_add_line --trans_type cn_char_en_bpe > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009
# 去除行首空格：sed -i 's/^ *//' ....
[更新]用tools/text2token.py不靠谱，还是用自己写的split_sentence.py！
[新]python split_sentence.py > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009


#有一些space，乱码，要删除
sed -i '/space/d'  data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009

# 然后和中文一起
cat data_4000_add_we/test_7g_train/text_space_cn data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009 > data_4000_add_we/test_7g_train/text_space_eng_bpe_aban_c009_and_cn

# 测试集同样也要做;


# 700万条中英文训练LM（英文用bpe子词、中文用字）：
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon_eng/lexicon ngram_7g_train_hua_lexicon_eng/text_space_eng_bpe_aban_c009 ngram_7g_train_hua_lexicon_eng/lm

# 测试ppl：
# 这里 ngram_testset_char/text 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 >  ngram_testset_char/ppl/lm1.ppl
# reading 12358 1-grams
# reading 5639846 2-grams
# reading 31839516 3-grams
# 0 zeroprobs, logprob= -1052204 ppl= 54.52272 ppl1= 60.16961

# 4000万中文、中英文训练LM（英文用bpe子词、中文用字）：
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon/lexicon ngram_7g_train_hua_lexicon/text_space_eng_bpe_aban_c009_and_cn ngram_7g_train_hua_lexicon/lm
# 测试ppl：
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 >  ngram_testset_char/ppl/lm2.ppl


# 在xueyuan test_20best（已经是带▁的） 上测试ppl（为了rescore）
# 首先对得到的带bpe的解码结果（没有空格）进行加空格，并且bpe分词：
tools/text2token.py -s 0 -n 1 -m /home/data/yelong/wenet/examples/aishell/s0/aban-c009/bpe.model /ngram_testset_char/aban_c009_xueyuan/text --trans_type cn_char_en_bpe > /ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009
# 去除行首空格：sed -i 's/^ *//' ....

# 计算测试集ppl：
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon_eng/lm/ppl_aban_c009_xueyuan

# 计算wer
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/ppl_value_aban_c009_xueyuan_test > exp/aban-c009/test_xueyuan/wer_am_lm_alpha_0.1

split_sentence.py：

import re
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
#sp.load("/home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model")
sp.load("/home/data/yelong/wenet/examples/aishell/s0/aban-c009/bpe.model")

def __tokenize_by_bpe_model(sp, txt):
    tokens = []
    # CJK(China Japan Korea) unicode range is [U+4E00, U+9FFF], ref:
    # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    pattern = re.compile(r'([\u4e00-\u9fff])')
    # Example:
    #   txt   = "你好 ITS'S OKAY 的"
    #   chars = ["你", "好", " ITS'S OKAY ", "的"]
    chars = pattern.split(txt.upper())
    mix_chars = [w for w in chars if len(w.strip()) > 0]
    for ch_or_w in mix_chars:
        # ch_or_w is a single CJK charater(i.e., "你"), do nothing.
        if pattern.fullmatch(ch_or_w) is not None:
            tokens.append(ch_or_w)
        # ch_or_w contains non-CJK charaters(i.e., " IT'S OKAY "),
        # encode ch_or_w using bpe_model.
        else:
            for p in sp.encode_as_pieces(ch_or_w):
                tokens.append(p)

    return tokens
# txt="你好▁LET'S▁GO你 好"
# print(" ".join(__tokenize_by_bpe_model(sp,txt)))
#src_file='data_4000_add_we/test_7g_train/text_space_eng_add_line'
#src_file='data_4000_add_we/test_7g_train/text_space_eng'
#src_file='ngram_testset_char/text'
#src_file='ngram_testset_char/text_add_line'
#src_file='ngram_testset_char/aban_c009_xueyuan/text'
src_file='ngram_testset_char/aban_c009_xueyuan/text_bpe'
with open(src_file, "r", encoding="utf8") as fs:
    for line in fs:
        line = line.strip()
        print(" ".join(__tokenize_by_bpe_model(sp,line)))

结果分析：

# 对比加不加ngram rescore：

# baseline：aban-c009 ctc prefix beam search
Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

# 预期能达到最好的情况（rescore能改善的空间）：
python compute_wer_best.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/text > exp/aban-c009/test_xueyuan/wer_20best

Overall -> 2.56 % N=591032 C=578565 S=9711 D=2756 I=2678
Mandarin -> 2.43 % N=587425 C=575731 S=9235 D=2459 I=2571
English -> 24.27 % N=3601 C=2834 S=475 D=292 I=107
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0


# rescore结果：
# ngram_7g_train_hua_lexicon_eng/text2token/lm：
# ngram_7g_train_hua_lexicon_eng_lm_text2token/wer_am_lm_alpha_0.2
Overall -> 4.12 % N=591032 C=571202 S=14610 D=5220 I=4515
Mandarin -> 3.93 % N=587425 C=568694 S=14035 D=4696 I=4364
English -> 34.55 % N=3601 C=2508 S=574 D=519 I=151
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

上面的比如 LET’S没有分开，优点问题，现在重新做了一次，用的split_sentence.py，现在是新的结果

# 700万条中英文训练LM（英文用bpe子词、中文用字）：
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon_eng/lexicon ngram_7g_train_hua_lexicon_eng/text_space_eng_bpe_aban_c009 ngram_7g_train_hua_lexicon_eng/lm

# 测试ppl：
# 这里 ngram_testset_char/text 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 >  ngram_testset_char/ppl/lm1.ppl
# reading 12358 1-grams
# reading 5641167 2-grams
# reading 31842760 3-grams
# 0 zeroprobs, logprob= -1052185 ppl= 54.49828 ppl1= 60.1414



# 4000万中文、中英文训练LM（英文用bpe子词、中文用字）：
local/train_lms_1gram.sh ngram_7g_train_hua_lexicon/lexicon ngram_7g_train_hua_lexicon/text_space_eng_bpe_aban_c009_and_cn ngram_7g_train_hua_lexicon/lm
# 测试ppl：
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space_eng_bpe_aban_c009 >  ngram_testset_char/ppl/lm2.ppl
# reading 12358 1-grams
# reading 6916323 2-grams
# reading 50028207 3-grams
# 0 zeroprobs, logprob= -1036973 ppl= 51.43724 ppl1= 56.6826


# 计算测试集ppl：
python word_sentence.py > ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009

ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon_eng/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon_eng/lm/ppl_aban_c009_xueyuan

ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_hua_lexicon/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space_eng_bpe_aban_c009 > ngram_7g_train_hua_lexicon/lm/ppl_aban_c009_xueyuan

结果：

# ngram_7g_train_hua_lexicon_eng_lm：
# wer_am_lm_alpha_0.2
Overall -> 4.12 % N=591032 C=571208 S=14605 D=5219 I=4515
Mandarin -> 3.93 % N=587425 C=568695 S=14034 D=4696 I=4364
English -> 34.41 % N=3601 C=2513 S=570 D=518 I=151
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

# ngram_7g_train_hua_lexicon_lm：
# wer_am_lm_alpha_0.18
Overall -> 4.10 % N=591032 C=571249 S=14659 D=5124 I=4453
Mandarin -> 3.91 % N=587425 C=568747 S=14077 D=4601 I=4303
English -> 34.68 % N=3601 C=2502 S=581 D=518 I=150
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

==中文用字，英文用词，这样就不用因为bpe.model而每次要重新训练LM==

# 把测试集句子分成中文空格分开的字、英文空格分开的词
python split_sentence_nobpe.py  > ngram_testset_char/text_space

# 700万条中英文训练LM（英文用词、中文用字）：
local/train_lms_1gram.sh ngram_7g_train_char/lexicon ngram_7g_train_char/text_space_eng ngram_7g_train_char/lm_eng

# 测试ppl：
# 这里 ngram_testset_char/text_space 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space > ngram_testset_char/ppl_char/lm1.ppl
# reading 34111 1-grams
# reading 6114244 2-grams
# reading 31739092 3-grams
# 0 zeroprobs, logprob= -1051788 ppl= 54.54516 ppl1= 60.19802


# 4000万中文、中英文训练LM（英文用bpe子词、中文用字）：
local/train_lms_1gram.sh ngram_7g_train_char/lexicon ngram_7g_train_char/text_space ngram_7g_train_char/lm_eng_cn

# 测试ppl：
# 这里 ngram_testset_char/text_space 是1.4万条文本
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng_cn/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/text_space > ngram_testset_char/ppl_char/lm2.ppl
# reading 34111 1-grams
# reading 7389346 2-grams
# reading 49923410 3-grams
# 0 zeroprobs, logprob= -1036779 ppl= 51.5195 ppl1= 56.77883

# 计算测试集ppl（20best）（29w条）
python split_sentence_nobpe.py > ngram_testset_char/aban_c009_xueyuan/text_space

ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space > ngram_7g_train_char/lm_eng/ppl_aban_c009_xueyuan

ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_char/lm_eng_cn/srilm/srilm.o3g.kn.gz -ppl ngram_testset_char/aban_c009_xueyuan/text_space > ngram_7g_train_char/lm_eng_cn/ppl_aban_c009_xueyuan
# 有些地方是没有输出结果，计算ngram没有返回值，这些句子要手动空行
# qq是只有一行的y空行
sed -i '5600 r qq' ppl_value_aban_c009_xueyuan
sed -i '11812 r qq' ppl_value_aban_c009_xueyuan
sed -i '35862 r qq' ppl_value_aban_c009_xueyuan
sed -i '58228 r qq' ppl_value_aban_c009_xueyuan
sed -i '107511 r qq' ppl_value_aban_c009_xueyuan
sed -i '117260 r qq' ppl_value_aban_c009_xueyuan
sed -i '121642 r qq' ppl_value_aban_c009_xueyuan
sed -i '208564 r qq' ppl_value_aban_c009_xueyuan
sed -i '245827 r qq' ppl_value_aban_c009_xueyuan
sed -i '290460 r qq' ppl_value_aban_c009_xueyuan

结果：

# 对比加不加ngram rescore：

# baseline：aban-c009 ctc prefix beam search
Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

# 预期能达到最好的情况（rescore能改善的空间）：
python compute_wer_best.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/text > exp/aban-c009/test_xueyuan/wer_20best

Overall -> 2.56 % N=591032 C=578565 S=9711 D=2756 I=2678
Mandarin -> 2.43 % N=587425 C=575731 S=9235 D=2459 I=2571
English -> 24.27 % N=3601 C=2834 S=475 D=292 I=107
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

# ngram_7g_train_char/lm_eng：
# ngram_7g_train_char_lm_eng/wer_am_lm_alpha_0.2
Overall -> 4.12 % N=591032 C=571210 S=14609 D=5213 I=4518
Mandarin -> 3.93 % N=587425 C=568703 S=14030 D=4692 I=4363
English -> 34.68 % N=3601 C=2507 S=578 D=516 I=155
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0
# 英文没有bpe分词的好，说明按英文word单词数还是有点太多，但是只变差一丢丢，计算方便了很多，所以用词直接建模也是OK的！！！！！！！！！！！；

# ngram_7g_train_char/lm_eng_cn：
Overall -> 4.10 % N=591032 C=571281 S=14582 D=5169 I=4470
Mandarin -> 3.91 % N=587425 C=568779 S=13999 D=4647 I=4319
English -> 34.71 % N=3601 C=2502 S=582 D=517 I=151
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

==小结==：
- 英文直接用词做LM没有bpe分词的好，说明按英文word单词数还是有点太多，但是只变差一丢丢，计算方便了很多，所以用词直接建模也是OK的！！！！！！！！！！！
  
  但是，英文直接用词，会有可能出现测试集里有词典没有的词，OOV的出现！！用bpe就没有这个问题！！
- 训练LM纯中文文本多了（700中英 -> 4000纯中文+中英），中文部分结果会变好，英文只变差了一丢丢；

结果

字建模的ngram LM比词建模的ngram LM的wer更低
只用3500小时对应的文本不够，7G垂直领域文本有用，比其他领域更多的中英数据还有用；
wer改善其实很小，对于英文的改善更小；
做按字的分词估计有点问题

原因分析

为什么字建模比词建模效果好？
1. 先排除oov的干扰，确认不是词的oov后，ngram建模出发来看，2gram的概率公式为$\large P(w_i|w_{i-1})=\frac{count(w_{i-1}w_i)}{count(w_{i-1})}$，分子是同时发生的频次，分母是前一个词发生的频次，因此，以词建模的ngram有一下几个问题：
  1. 词数多（几十万个），在比如“我”后面，可以接的词太多，概率都很低？
  2. 训练数据不够多，统计量分到词组少，就很少了，造成概率都偏低，nbest区分不明显；
  3. 错误词分词拆成字，反而概率更高：比如“黝黑”logprob-4，但是“呦嘿”的logprob反而更大，造成错别字的ppl更小
2. oov的影响：oov较多，概率不准确：ngram直接计算oov，oov对应logprob是-inf，但计算句子的logprob时这个-inf并没有加上，会造成错误
（花哥教）从识别出nbest的角度进行解释：LM可以判断一句话像不像话，他的“顺序”判别能力优于错别字的判别能力；而识别出来的nbest可能只是错一两个字，语序的错误并不多，因此难判断，还有就是比如“中国人民共和国”是一个词，如果nbest里错了一个字，则不是一个词了，该词的概率就没有了，就分词分成更小的了，这样它的概率优势就凸显不了了；
原理出发，来衡量n-best哪句话最像话，是否可以用ppl来衡量？

ppl适合来比较语言模型对某个数据集的匹配程度，能比较出来，比如这个语言模型比起另一个语言模型更匹配某个数据集；

可能它不适合在同一个语言模型下，比较两句话的ppl，来衡量哪句话更像话；

ngram -unk -map-unk ““ -lm 11/srilm/srilm.o3g.kn.gz -order 2 -ppl 2 -debug 2

用更多数据

用贤祥给的2400万条中英文本，还有7g_train一共700万条中英文本，一共3100万条（ngram_7g_train_30w_2017_seewo_eng_word/text）24.4 /home/data/yelong/kaldi/egs/librispeech1/s5/

琳姐的词典，做一个LM

cat ngram_7g_train_eng_word/text 30w_2017_seewo/30w_2017_seewo_nopunc_freq_cn_en > ngram_7g_train_30w_2017_seewo_eng_word/text
#再把am的词典的字，合并进琳姐词典中（有600个生字是琳姐词典里没有的），得到ngram_7g_train_30w_2017_seewo_eng_word/lexicon
# 训练LM
local/train_lms_1gram.sh ngram_7g_train_30w_2017_seewo_eng_word/lexicon ngram_7g_train_30w_2017_seewo_eng_word/text ngram_7g_train_30w_2017_seewo_eng_word/lm

## 测试1.4w测试集ppl
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_30w_2017_seewo_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word >  ngram_testset/ppl_word/lm1.ppl
# reading 134090 1-grams
# reading 43612597 2-grams
# reading 55833679 3-grams
# 0 zeroprobs, logprob= -1066023 ppl= 514.9295 ppl1= 654.8651

# 计算测试集ppl：
python word_segment_again.py > ngram_testset/aban_c009_xueyuan/text_word
sed -i 's/ '\''/'\''/g' ngram_testset/aban_c009_xueyuan/text_word
sed -i 's/'\'' /'\''/g' ngram_testset/aban_c009_xueyuan/text_word
ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_30w_2017_seewo_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/aban_c009_xueyuan/text_word > ngram_7g_train_30w_2017_seewo_eng_word/lm/ppl_aban_c009_xueyuan

for alpha in 0.02 0.005 0.01 0.015 0.018 0.022 0.03 0.04 0.05 0.06 0.07 0.08 0.09; do
    python compute_wer_lm.py --char=1 --v=1 data/xueyuan/text exp/aban-c009/test_xueyuan/ngram_7g_train_30w_2017_seewo_eng_word_lm/ppl_value_aban_c009_xueyuan_test $alpha > exp/aban-c009/test_xueyuan/ngram_7g_train_30w_2017_seewo_eng_word_lm/wer_am_lm_alpha_$alpha
done

结果

# baseline：aban-c009 ctc prefix beam search
Overall -> 4.47 % N=591030 C=569205 S=17387 D=4438 I=4594
Mandarin -> 4.28 % N=587425 C=566713 S=16692 D=4020 I=4405
English -> 36.01 % N=3599 C=2492 S=694 D=413 I=189
Other -> 100.00 % N=6 C=0 S=1 D=5 I=0

# ngram_7g_train_30w_2017_seewo_eng_word_lm：
# wer_am_lm_alpha_0.007
Overall -> 4.32 % N=591032 C=570190 S=15903 D=4939 I=4702
Mandarin -> 4.14 % N=587425 C=567617 S=15300 D=4508 I=4521
English -> 33.55 % N=3601 C=2573 S=602 D=426 I=180
Other -> 116.67 % N=6 C=0 S=1 D=5 I=1
# 中文提升3%，英文提升6%，加中英数据，英文变好了一点点，但是中文提升没之前多了，但是中文已经很好了

对比三种词建模在1.4w条的ppl：

ngram -unk -map-unk "嫫" -order 3 -debug 2 -lm ngram_7g_train_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word >  ngram_testset/ppl/lm1.ppl
# reading 2316918 1-grams
# reading 61085443 2-grams
# reading 50367347 3-grams
# 0 zeroprobs, logprob= -1099945 ppl= 628.1117 ppl1= 804.9393

ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word >  ngram_testset/ppl/lm2.ppl
# reading 767195 1-grams
# reading 81671836 2-grams
# reading 91884948 3-grams
# 0 zeroprobs, logprob= -1087195 ppl= 582.9138 ppl1= 744.8724


ngram -unk -map-unk "厶" -order 3 -debug 2 -lm ngram_7g_train_30w_2017_seewo_eng_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_testset/text_word >  ngram_testset/ppl_word/lm1.ppl
# reading 134090 1-grams
# reading 43612597 2-grams
# reading 55833679 3-grams
# 0 zeroprobs, logprob= -1066023 ppl= 514.9295 ppl1= 654.8651

一些脚本

把句子拆成空格间隔，bpe分词

（用text2token进行bpe分词，LET’S是拆不开的）

split_sentence.py

import re
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("/home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model")

def __tokenize_by_bpe_model(sp, txt):
    tokens = []
    # CJK(China Japan Korea) unicode range is [U+4E00, U+9FFF], ref:
    # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    pattern = re.compile(r'([\u4e00-\u9fff])')
    # Example:
    #   txt   = "你好 ITS'S OKAY 的"
    #   chars = ["你", "好", " ITS'S OKAY ", "的"]
    chars = pattern.split(txt.upper())
    mix_chars = [w for w in chars if len(w.strip()) > 0]
    for ch_or_w in mix_chars:
        # ch_or_w is a single CJK charater(i.e., "你"), do nothing.
        if pattern.fullmatch(ch_or_w) is not None:
            tokens.append(ch_or_w)
        # ch_or_w contains non-CJK charaters(i.e., " IT'S OKAY "),
        # encode ch_or_w using bpe_model.
        else:
            for p in sp.encode_as_pieces(ch_or_w):
                tokens.append(p)

    return tokens
# txt="你好▁LET'S▁GO你 好"
# print(" ".join(__tokenize_by_bpe_model(sp,txt)))
src_file='....'
with open(src_file, "r", encoding="utf8") as fs:
    for line in fs:
        line = line.strip()
        print(" ".join(__tokenize_by_bpe_model(sp,line)))

把句子拆成空格间隔，英文也拆开，没有bpe分词（假设没有标点符号）

split_sentence_nobpe.py

import re
#import sentencepiece as spm
#sp = spm.SentencePieceProcessor()
#sp.load("/home/yelong/data/wenet/examples/aishell/s0/exp/aban-c009/bpe.model")

def __tokenize_by_bpe_model(txt):
    tokens = []
    # CJK(China Japan Korea) unicode range is [U+4E00, U+9FFF], ref:
    # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    pattern = re.compile(r'([\u4e00-\u9fff])')
    # Example:
    #   txt   = "你好 ITS'S OKAY 的"
    #   chars = ["你", "好", " ITS'S OKAY ", "的"]
    chars = pattern.split(txt.upper())
    mix_chars = [w for w in chars if len(w.strip()) > 0]
    for ch_or_w in mix_chars:
        # ch_or_w is a single CJK charater(i.e., "你"), do nothing.
        #if pattern.fullmatch(ch_or_w) is not None:
        tokens.append(ch_or_w)
        # ch_or_w contains non-CJK charaters(i.e., " IT'S OKAY "),
        # encode ch_or_w using bpe_model.
        #else:
            #for p in sp.encode_as_pieces(ch_or_w):
            #    tokens.append(p)

    return tokens
# txt="你好▁LET'S▁GO你 好"
# print(" ".join(__tokenize_by_bpe_model(sp,txt)))
#src_file='data_4000_add_we/test_7g_train/text_space_eng_add_line'
#src_file='data_4000_add_we/test_7g_train/text_space_eng'
src_file='ngram_testset_char/text'
with open(src_file, "r", encoding="utf8") as fs:
    for line in fs:
        line = line.strip()
        print(" ".join(__tokenize_by_bpe_model(line)))
        #print(" ".join(__tokenize_by_bpe_model(sp,line)))