Ngram LM实验(一)

Ngram LM

https://zhuanlan.zhihu.com/p/273606445

  1. 训练一个ngram LM:数据:train set(2715万条,带英文的212万条)
1
local/train_lms_1gram.sh ngram_test/lang_char.txt.bpe_500_eng1000_chi5200_all6200 ngram_test/text_token_bpe500 ngram_test/lm
  1. 计算句子ppl:

    1. 先把10best的脚本,spm.decode恢复(注意这里不把▁替换回来),然后过一遍bpe.model进行encode,得到encode后的文本(ngram_test/test/text)
    2. 计算1.4w条测试集总体的PPL:

    file ngram_test/test/text: 14504 sentences, 587117 words, 0 OOVs
    0 zeroprobs, logprob= -1166876 ppl= 87.00666 ppl1= 97.15536

1
2
3
4
ngram -debug 2 -lm ngram_test/lm/srilm/srilm.o3g.kn.gz -ppl ngram_test/test/text > ngram_test/test/ppl

# (字按空格分开)
#echo "你 好 啊" | ngram -debug 2 -lm ngram_test/lm/srilm/srilm.o3g.kn.gz -ppl -

ngram计算ppl说明:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ngram
##功能
#用于评估语言模型的好坏,或者是计算特定句子的得分,用于语音识别的识别结果分析。
##参数
#计算得分:
# -order 模型阶数,默认使用3阶
# -lm 使用的语言模型
# -ppl 后跟需要打分的句子(一行一句,已经分词),ppl表示所有单词,ppl1表示除了</s>以外的单词
# -debug 0 只输出整体情况
# -debug 1 具体到句子
# -debug 2 具体每个词的概率
#产生句子:
# -gen 产生句子的个数
# -seed 产生句子用到的random seed
ngram -lm ${lm} -order 2 -ppl ${file} -debug 1 > ${ppl}

计算10best的每句话的ppl:

首先把10best 带 bpe 的句子英文以空格区分开【也可以不用下面这个操作!,直接用 tools/text2token.py -s 1 -n 1 -m data/lang_char/train_unigram500.model exp/seewo/conformer/test_xueyuan1/text_bpe --trans_type cn_char_en_bpe 就行【甚至更好】!】:

(不是bpe格式的加空格方法:tools/text2token.py -s 0 -n 1 text --trans_type phn > ...

add_space.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import sentencepiece as spm
import codecs
import re, sys, unicodedata
spacelist= [' ', '\t', '\r', '\n']
puncts = ['!', ',', '?',
'、', '。', '!', ',', ';', '?',
':', '「', '」', '︰', '『', '』', '《', '》']
def characterize(string) :
res = []
i = 0
while i < len(string):
char = string[i]
if char in puncts:
i += 1
continue
cat1 = unicodedata.category(char)
#https://unicodebook.readthedocs.io/unicode.html#unicode-categories
if cat1 == 'Zs' or cat1 == 'Cn' or char in spacelist: # space or not assigned
i += 1
continue
if cat1 == 'Lo': # letter-other
res.append(char)
i += 1
else:
# some input looks like: <unk><noise>, we want to separate it to two words.
sep = ' '
if char == '<': sep = '>'
j = i+1
while j < len(string):
c = string[j]
if ord(c) >= 128 or (c in spacelist) or (c==sep):
break
j += 1
if j < len(string) and string[j] == '>':
j += 1
res.append(string[i:j])
i = j
return res
sp = spm.SentencePieceProcessor()
sp.Load('data/lang_char/train_unigram500.model')
with codecs.open('exp/seewo/conformer/test_xueyuan1/text_bpe', 'r', 'utf-8') as fh:
for line in fh:
array = characterize(line)
print(' '.join(array))

然后计算逐句ppl,存到文件中

  1. 逐句的PPL:在python脚本里调用:暂时不会

  2. 计算wer

1
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_ngram > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_14

结果最好为: wer_am_lm_alpha_0.1:

1
2
3
4
5
6
7
8
9
10
Overall -> 4.15 % N=57114 C=55071 S=1576 D=467 I=326
Mandarin -> 3.84 % N=56894 C=55018 S=1506 D=370 I=308
English -> 84.09 % N=220 C=53 S=70 D=97 I=18

#原 wer_1best:
Overall -> 4.31 % N=57114 C=54985 S=1685 D=444 I=331
Mandarin -> 4.00 % N=56894 C=54932 S=1604 D=358 I=315
English -> 83.18 % N=220 C=53 S=81 D=86 I=16

#仅仅提升3.7%!提升太少了!!

用5gram:

1
local/train_lms_1gram.sh ngram_test/lang_char.txt.bpe_500_eng1000_chi5200_all6200 ngram_test/text_token_bpe500 ngram_test/lm_5gram

结果

  1. am_logprob + lm_logprob: wer_am_lm_alpha_0.3
1
2
3
Overall -> 4.18 % N=57114 C=55005 S=1557 D=552 I=276
Mandarin -> 3.87 % N=56894 C=54954 S=1490 D=450 I=259
English -> 84.55 % N=220 C=51 S=67 D=102 I=17

比3gram还差,应该是数据更稀疏?但是ppl又有一丢丢提升,说明不是数据稀疏,而是就该用lm的ppl,不该用logprob?

  1. am_logprob + lm_ppl: wer_am_lm_alpha_0.06
1
2
3
4
Overall -> 4.13 % N=57114 C=55078 S=1578 D=458 I=324
Mandarin -> 3.83 % N=56894 C=55023 S=1506 D=365 I=307
English -> 82.73 % N=220 C=55 S=72 D=93 I=17
# 提升 4.17%

增加数据:

用7G test set(1302万条,带英文的437万条):

1
tools/text2token.py -s 1 -n 1 -m data_4000_add_we/lang_char/train_unigram500.model data_4000_add_we_bpe/test/text --trans_type cn_char_en_bpe > data_4000_add_we_bpe/test/text_token_bpe500

总数据就有4018万条,带英文的650万条

分别尝试字典:

  • 英文bpe500,对应1000个分词,中文5200(对应 ngram_test/lang_char.txt.bpe_500_eng1000_chi5200_all6200 )
1
2
3
4
5
6
7
local/train_lms_1gram.sh ngram_test/lang_char.txt.bpe_500_eng1000_chi5200_all6200 ngram_7g_train/text_token_bpe500 ngram_7g_train/lm

# 计算1.4w条ppl:
ngram -debug 2 -lm ngram_7g_train/lm/srilm/srilm.o3g.kn.gz -ppl ngram_test/test/text > ngram_7g_train/lm/ppl_1.4w

#
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_lm_7g_train > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13

Overall -> 3.95 % N=57162 C=55197 S=1486 D=479 I=292
Mandarin -> 3.62 % N=56894 C=55125 S=1406 D=363 I=289
English -> 74.06 % N=266 C=72 S=80 D=114 I=3
Other -> 100.00 % N=2 C=0 S=0 D=2 I=0

# 原来
Overall -> 4.32 % N=57162 C=55001 S=1695 D=466 I=306
Mandarin -> 3.98 % N=56894 C=54932 S=1604 D=358 I=304
English -> 74.81 % N=266 C=69 S=91 D=106 I=2
Other -> 100.00 % N=2 C=0 S=0 D=2 I=0

# 提升8.5%
  • 英文bpe500,对应1000个分词,中文用4018万条去重的所有的(对应ngram_7g_train/lexicon_bpe500)
1
2
3
4
5
6
7
local/train_lms_1gram.sh ngram_7g_train/lexicon_bpe500 ngram_7g_train/text_token_bpe500 ngram_7g_train/lm_all_cn/

# 计算1.4w条ppl:
ngram -debug 2 -lm ngram_7g_train/lm_all_cn/srilm/srilm.o3g.kn.gz -ppl ngram_test/test/text > ngram_7g_train/lm_all_cn/ppl_1.4w

#
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_lm_all_cn_7g_train > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01

结果:

1
2
3
4
5
6
# wer_am_lm_alpha_0.13
Overall -> 3.95 % N=57162 C=55197 S=1486 D=479 I=292
Mandarin -> 3.62 % N=56894 C=55125 S=1406 D=363 I=289
English -> 74.06 % N=266 C=72 S=80 D=114 I=3
Other -> 100.00 % N=2 C=0 S=0 D=2 I=0
# 提升8.5%

==7G数据有用==

可能更多的中文词,对于该测试集刚好是相同的,所以第二个实验和第一个实验结果相同;

ngram LM 的英文用word建模,不用bpe了,中文用字

直接用:/home/yelong/data/wenet/examples/multi_cn/s0/data_4000_add_we/test_7g/text 通过add_space 得到中文字、英文词,空格区分,见 test_7g/text_space

加上上面的训练集

lexicon就用所有的中文字/英文词作为lexicon:

1
2
echo  "<UNK> 0" >> lexicon_cn_char_en_char
cat text_space | tr " " "\n" | sort | uniq | grep -a -v -e '^\s*$' | grep -v '·' | grep -v '“' | grep -v "”" | grep -v "\[" | grep -v "\]" | grep -v "…" | awk '{print $0 " " NR}' >> lexicon_cn_char_en_char

英文有点太多,都2.6万个词了?其实也还好,中文才8千个字

生成ngram LM:

1
local/train_lms_1gram.sh ngram_test/lexicon_cn_char_en_char ngram_test/text_space ngram_test/lm_cn_char_en_char_3gram

计算1.4w条的ppl:

1
ngram -debug 2 -lm ngram_test/lm_cn_char_en_char_3gram/srilm/srilm.o3g.kn.gz -ppl ngram_test/test/text_space >  ngram_test/test/ppl_space

计算wer:

1
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_3gram_space > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01

一般般 WER:4.17%

不太行,肯定要稀疏了?,不然英文用大的bpe(上面用的500,dict是1000),比如1000个建模好了?(实际dict 1500)

训练 3gram LM

ngram LM 的英文用word建模,不用bpe了,中文用词 ,解码nbest先jieba分词,再用ngram计算ppl

离了大谱了,中文有70万个词?,不然先用其他的lexicon:花哥说:先用DaCidian,英文也用公开的词典

Dacidian有50万个词,也很多

  1. 只用训练集,先用自有词典(76万个词),生成ngram LM:
1
2
3
4
5
6
7
8
local/train_lms_1gram.sh ngram_train_word/lexicon_word ngram_train_word/text ngram_train_word/lm

#计算14w条的ppl:
#ngram -debug 2 -lm ngram_train_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_train_word/text_1.4w > ngram_train_word/lm/ppl_1.4w
# 注意 这里分词的text_10best_word是给定word词典的,避免有OOV
ngram -debug 2 -lm ngram_train_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_train_word/text_10best_word > ngram_train_word/lm/ppl_value
# 计算wer:
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_train_word > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01
  • 测试集直接jieba分词:
    • 直接测试:结果不好,计算ppl时分词了太多OOV了,最好jieba还是给定词典再分词比较好【不好,还是很多OOV】
    • 把测试集OOV添加进words.txt中:然后需要重新训练LM!不能不训练,不然LM没变化;
  • 测试集分词,遇到words.txt没有的词,再次分词(脚本word_segment_again.py):
    • 直接拆成字:结果:结果不好
    • 最大匹配算法,再次拆成词:结果:TODO
1
2
3
4
5
python word_segment_again.py  > ngram_train_word/text_10best_word_again

ngram -debug 2 -lm ngram_train_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_train_word/text_10best_word_again > ngram_train_word/lm/ppl_value_again

python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_train_word_again > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01

word_segment_again.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import jieba
#trans_file='/home/yelong/data/wenet/examples/multi_cn/s0/data_4000_add_we/train/split_word/text_bigletter'
#trans_file='/home/data/yelong/kaldi/egs/librispeech1/s5/ngram_train_word/text_10best_smallletter'
#trans_file='/home/data/yelong/kaldi/egs/librispeech1/s5/ngram_train_word/text_10best'
trans_file='/home/data/yelong/kaldi/egs/librispeech1/s5/ngram_testset/text'
# jieba.set_dictionary('lm_seewo/words.txt')
import re
lexicon={}
#for line in open('ngram_train_word/lexicon_word'):
#for line in open('ngram_7g_train_30w_2017_seewo_eng_word/lexicon'):
for line in open('ngram_7g_train_30w_2017_seewo_eng_word/1'):
#for line in open('lm_seewo/words.txt'):
#for line in open('lm_from_wanglin/dict_seewo/20190729_lexicon.txt'):
trans = line.strip().split()
# print(trans)
lexicon[trans[0]] = trans[1]

pattern_zn = re.compile(r'([\u4e00-\u9fa5])')
pattern_en =re.compile(r'([a-zA-Z])')
for line in open(trans_file):
trans = line.strip()
seg_list = jieba.cut(trans) # 默认是精确模式
for i in seg_list:
if i not in lexicon:
chars = pattern_zn.split(i)
chars = [w for w in chars if len(w.strip()) > 0]
print(" ".join(chars), end=' ')
else:
print(i, end = ' ')
print(end='\n')

word_segment_again.old.py:【错!不要分!就要保留oov,这样错误的词概率才低!!】

[2022.7.25]别用jieba了!用lac

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import jieba
# trans_file='/home/yelong/data/wenet/examples/multi_cn/s0/data_4000_add_we/train/split_word/text_bigletter'
trans_file='/home/data/yelong/kaldi/egs/librispeech1/s5/ngram_train_word/text_10best'

# jieba.set_dictionary('lm_seewo/words.txt')
import re
lexicon={}
for line in open('ngram_train_word/lexicon_word'):
trans = line.strip().split()
# print(trans)
lexicon[trans[0]] = trans[1]

pattern_zn = re.compile(r'([\u4e00-\u9fa5])')
pattern_en =re.compile(r'([a-zA-Z])')
for line in open(trans_file):
trans = line.strip()
seg_list = jieba.cut(trans) # 默认是精确模式
for i in seg_list:
if i not in lexicon:
chars = pattern_zn.split(i)
chars = [w for w in chars if len(w.strip()) > 0]
for j in chars:
if j in lexicon:
print(j, end = ' ')
else:
chars_en = pattern_en.split(j)
chars_en = [w for w in chars_en if len(w.strip()) > 0]
for k in chars_en:
if k in lexicon:
print(k, end = ' ')
else:
print("error", end = ' ')
else:
print(i, end = ' ')
print(end='\n')
  1. 用训练集+7G数据:10.22.24.2:multi_cn/s0/data_4000_add_we/test_7g_train/text;或 24.4:librispeech1/s5/ngram_7g_train_word

给定词典:

  • 词典:lm_seewo/words.txt

    需要重新对文本做一个分词,不然太多oov了

    • local/train_lms_1gram.sh lm_seewo/words.txt ngram_7g_train_word/text ngram_7g_train_word/lm_seewo
      
      python word_segment_again.py  > ngram_7g_train_word/text_10best_word_seewo
      
      ngram -debug 2 -lm ngram_7g_train_word/lm/srilm/srilm.o3g.kn.gz -ppl ngram_7g_train_word/text_10best_word_seewo >  ngram_7g_train_word/lm_seewo/ppl_value
      
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13

      -

      - 去重的词典 (暂时先用train去重的词典ngram_train_word/lexicon_word)

      - ```shell
      local/train_lms_1gram.sh ngram_train_word/lexicon_word ngram_7g_train_word/text ngram_7g_train_word/lm_train

      python word_segment_again.py > ngram_7g_train_word/text_10best_word_train

      ngram -debug 2 -lm ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -ppl ngram_7g_train_word/text_10best_word_train > ngram_7g_train_word/lm_train/ppl_value

      python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_lm_7g_train_word > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01
    • 结果:

      1
      2
      3
      4
      5
      6
      # wer_am_lm_alpha_0.005
      Overall -> 4.13 % N=57162 C=55106 S=1575 D=481 I=303
      Mandarin -> 3.79 % N=56894 C=55038 S=1491 D=365 I=300
      English -> 75.56 % N=266 C=68 S=84 D=114 I=3
      Other -> 100.00 % N=2 C=0 S=0 D=2 I=0
      # 提升4.6%
    • 测试文本分词重新做一次,测试文本不word_segment_again到最小单元,只是进行分词,如果分到的词,词典里没有,不管!不要强行拆分!!oov概率低,这句话ppl大!符合我们想要的!!

    • python word_segment.py > ngram_train_word/text_10best_word_train
      
      ngram -debug 2 -lm ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -ppl ngram_train_word/text_10best_word_train  >  ngram_7g_train_word/lm_train/ppl_value_oov
      
      python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_7g_train_word_oov > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01
      
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12

      - 修改 ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz,因为oov的logprob为-inf,都没有累计到!因此把这个oov变成随便一个概率很低的word!

      注意,jieba分词时不要给词典会好一点(不要jieba.set_dictionary('ngram_train_word/lexicon_word'))不然会出现“ 如果 想收 集 画板”这种分词分得很不好的情况

      用概率,不用ppl

      - ```shell
      # 注意,jieba分词时不要给词典会好一点(不要jieba.set_dictionary('ngram_train_word/lexicon_word'))不然会出现“ 如果 想收 集 画板”这种分词分得很不好的情况
      ngram -debug 2 -map-unk ABELL -lm ngram_7g_train_word/lm_train/srilm/srilm.o3g.kn.gz -ppl ngram_7g_train_word/text_10best_word_train > ngram_7g_train_word/lm_train/ppl_value_oov_unk

      python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_7g_train_word_oov_unk > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01
    • 结果:

    • ```shell

      wer_am_lm_alpha_0.6

      Overall -> 4.08 % N=57162 C=55066 S=1489 D=607 I=236
      Mandarin -> 3.74 % N=56894 C=54999 S=1416 D=479 I=233
      English -> 75.94 % N=266 C=67 S=73 D=126 I=3
      Other -> 100.00 % N=2 C=0 S=0 D=2 I=0

      1
      2
      3
      4
      5
      6
      7
      8
      9
        
      -

      **他们的words.txt,他们的TLG里的G是词,有59万个词(9万个英文,50万字中文词)** 路径:24.2:/home/yelong/data/wenet/examples/aishell/s0/lm_seewo

      按他们的words.txt对7G+train set进行分词:

      ```shell
      python word_segment_again.py > /home/yelong/data/wenet/examples/multi_cn/s0/data_4000_add_we/train/split_word_by_seewo_words/2

最好不要有OOV,把测试集的没在词典里的都添加进词典里?

琳姐训练的ngram LM,测试

路径:10.22.24.4:/home/data/data_to_yelong/lm_from_wanglin//bigfat_30w10wlwwCE_merge_20190709_7e-10.lm;词典,英文8千个词,中文12万个词,5gram

1
2
3
4
5
6
python word_segment_again.py  > ngram_lin/text_10best_word_lin	
sed -i "s/\([a-zA-Z]\) ' \([a-zA-Z]\)/\1'\2/g" ngram_lin/text_10best_word_lin # bpe会把'cut成分开的词

ngram -order 5 -debug 2 -lm lm_from_wanglin/bigfat_30w10wlwwCE_merge_20190709_7e-10.lm -ppl ngram_lin/text_10best_word_lin > ngram_lin/ppl_value

python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_lin > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01

结果:

1
2
3
4
5
6
# wer_am_lm_alpha_0.005
Overall -> 4.14 % N=57162 C=55082 S=1597 D=483 I=289
Mandarin -> 3.81 % N=56894 C=55014 S=1507 D=373 I=286
English -> 75.56 % N=266 C=68 S=90 D=108 I=3
Other -> 100.00 % N=2 C=0 S=0 D=2 I=0
# 提升4.3%

ngram LM 的中文英文都用bpe分词 ,解码nbest先jieba分词,再bpe分词,再用ngram计算ppl

预感效果介于字和词之间,不想做了。

ngram LM 的英文用bpe1000建模,中文用词 ,解码nbest先jieba分词,再用ngram计算ppl

ngram LM 的英文用bpe1000建模,中文用词 ,解码nbest先jieba分词,再用ngram计算ppl

最好的CTC rescore后,attention再rescore

1
2
# attention score + ctc*0.5 + \lambda PPL
python compute_wer_lm.py --char=1 --v=1 data/xueyuan/wav/text exp/seewo/conformer/test_xueyuan1/ppl_lm_all_cn_7g_train_attention > exp/seewo/conformer/test_xueyuan1/wer_am_lm_alpha_0.01

结果:

1
2
3
4
5
6
7
8
9
10
11
12
# wer_am_lm_alpha_0.13
Overall -> 3.92 % N=57162 C=55206 S=1462 D=494 I=285
Mandarin -> 3.58 % N=56894 C=55139 S=1378 D=377 I=282
English -> 75.94 % N=266 C=67 S=84 D=115 I=3
Other -> 100.00 % N=2 C=0 S=0 D=2 I=0

# 只提升了一丢丢?(不接attention,只用ctc+ppl是3.95%)
# 不加lm,纯ctc + attention rescore:
Overall -> 4.10 % N=57162 C=55109 S=1564 D=489 I=289
Mandarin -> 3.75 % N=56894 C=55045 S=1478 D=371 I=286
English -> 77.07 % N=266 C=64 S=86 D=116 I=3
Other -> 100.00 % N=2 C=0 S=0 D=2 I=0

词建模的ngram的TLG 进行解码,限制,然后再 NN LM rescore

TODO

词建模的ngram的TLG 进行解码,限制,然后再 ngram LM rescore

10.22.23.17 docker

1
./tools/decode.sh --nj 1 --beam 10.0 --lattice_beam 7.5 --max_active 7000  --blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 --chunk_size -1  --fst_path /home/data/yelong/docker_seewo/seewo/lm_seewo/TLG.fst  /home/data/yelong/docker_seewo/corpus/seewo/wav.scp /home/data/yelong/docker_seewo/corpus/seewo/text /home/data/yelong/docker_seewo/seewo/final.zip  /home/data/yelong/docker_seewo/seewo/lm_seewo/words.txt exp/seewo/conformer/lm_with_runtime

词建模ngram、字建模ngram,连同声学模型分数,三者加权求和,作为最后的分数

旺旺教的

img