Xinzhiyuan recommended1
Xinzhiyuan and Yunqi community joint first release span>
Author: Kun Cheng, Algorithm Engineer, Alibaba Cloud Data Division
< p style="white-space: normal; line-height: 25.6px; max-width: 100%; mi n-height: 1em; color: rgb(62, 62, 62); text-align: center; box-sizing: border-box !important; word-wrap: break-word !important; background-color: rgb(255 , 255, 255);">
< p>On the day of Double Eleven, the overall service volume of Ant Financial’s customer center exceeded 5 million, and more than 94% were resolved through self-service driven by artificial intelligence technology. In the entire self-service, a very important part is the voice-to-text service of the call center, which is a typical telephone voice recognition problem.
Telephone speech recognition is one of the most complex and difficult problems in the field of speech recognition today. In the dialogue process, various disadvantages such as the speaker’s natural style, accent, lack of fluency (repetition, modification of one’s own statement), complex and diverse transmission channels, etc. are concentrated in this scene. With the development of deep learning and other technologies, the accuracy of today’s telephone speech recognition has reached a good level, which was unimaginable a few years ago.
We are using a speech recognition acoustic model based on LC-BLSTM-DNN hybrid. In order to test the effect of this model, we specially invite a technical expert, Students with a good Mandarin accent dialed Alipay’s 95188 customer service hotline to experience the latest voice recognition technology that Alibaba iDST upgraded and launched not long ago. The results are surprising. As far as we know (to the best of our knowledge), this is also the first industrial application of this model structure in the field of speech recognition. This article will introduce the background of this acoustic model and our specific implementation work.
Traditionally, the acoustic model of speech recognition is generally modeled by GMM-HMM. In recent years, with the development of deep learning technology, the modeling method based on DNN-HMM has made great progress. Compared with traditional methods, the accuracy of speech recognition can be increased by 20%-30%. It has replaced the former as academia and Mainstream configuration in industry. The advantage of DNN is that by increasing the number of layers and nodes of the neural network, the network’s abstraction and modeling capabilities for complex data are expanded, but at the same time, DNN also has some shortcomings. For example, DNN generally uses frame composition to consider context-related information. The influence of speech frames is not the best way to reflect the correlation between speech sequences. Autoregressive Neural Network (RNN) solves this problem to a certain extent. It achieves the purpose of using the correlation between sequence data through the self-connection of network nodes. Further, researchers have proposed a long and short-term memory network (LSTM-RNN), which can effectively alleviate the gradient explosion and gradient dissipation problems that simple RNNs are prone to. The researchers then extended the LSTM to use a two-way long and short-term memory network ( BLSTM-RNN) performs acoustic model modeling to fully consider the impact of context information.
BLSTM can effectively improve the accuracy of speech recognition. Compared with the DNN model, the relative performance improvement can reach 15%-20%. But at the same time, BLSTM also has two very important problems:
1. Sentence-level update, the convergence speed of the model is usually faster Slow, and due to the large number of frame-by-frame calculations, the computing power of parallel computing tools such as GPUs cannot be used effectively, and training will be very time-consuming;
2. Because of the need The entire sentence is used to recursively calculate the posterior probability of each frame. The decoding delay and real-time rate cannot be effectively guaranteed, and it is difficult to apply to actual services.
For these two problems, the literature [1] first proposed the Context-Sensitive-Chunk BLSTM (CSC-BLSTM) method to solve them, and then the literature [2 ] Also proposed an improved version of Latency Controlled BLSTM (LC-BLSTM), which alleviates these two problems better and more efficiently. On this basis, we used the LC-BLSTM-DNN hybrid structure with training and optimization methods such as multi-machine multi-card and 16-bit quantization to model the acoustic model, and achieved a relative recognition error rate reduction of about 17-24% compared with the DNN model. . At present, this set of models has been the first to be used in telephone speech recognition, and will continue to be launched in other speech recognition services we support.
The typical LSTM node structure is shown in the figure below, which is simple to activate with general DNN or simple RNN Function nodes are different. LSTM consists of 3 gates: input gate, forget gate, output gate and a cell. There are connections between input, output nodes and cells and each gate; there are also connections between input gate and forget gate and the cell. , There is self-connection inside the cell. In this way, by controlling the states of different gates, better long- and short-term information storage and error propagation can be achieved.
LSTM can be stacked layer by layer like DNN to become Deep LSTM. In order to make better use of context information, BLSTM can also be used to build Deep LSTM layer by layer. Its structure As shown in the figure below, there are two processes of forward and reverse information transmission along the time axis in the network. The calculation of each time frame depends on the calculation results of all previous time frames and all subsequent time frames. For this timing of voice signals Sequence, this model fully considers the influence of context on the current speech frame, which can greatly improve the classification accuracy of phoneme states.
However, since the standard BLSTM models the entire sentence speech data, the training and decoding process has problems such as slow convergence, high delay, and low real-time rate. We use Latency Controlled BLSTM to solve these drawbacks. Unlike the standard BLSTM that uses whole sentences for training and decoding, Latency Control BLSTM uses an update method similar to truncated BPTT, and has its own characteristics in cell intermediate state processing and data use. As shown in the figure below, a small piece of data is used for updating each time during training. The data is composed of the central chunk and the right additional chunk. The right additional chunk is only used for the calculation of the intermediate state of the cell, and the error is only propagated on the central chunk . The network moving forward on the time axis, the cell intermediate state of the previous data segment at the end of the central chunk is used as the initial state of the next data segment, and the network moving in the reverse direction on the time axis will be used at the beginning of each data segment The intermediate state of the cell is set to 0. This method can greatly accelerate the convergence speed of the network and help to obtain better performance. The data processing of the decoding stage is basically the same as that of the training. The difference is that the dimensions of the central chunk and the right additional chunk can be adjusted according to requirements and do not have to use the same configuration as the training.
The general DNN-based speech recognition acoustic model structure is shown in the figure below. The input of DNN generally adopts traditional spectral features and its improved features (such as MFCC, PLP, Filterbank, etc.) through frame splicing. The splicing length is generally selected between 9-15 frames, and the time is about 10ms. The output generally uses various granular phoneme acoustic units, such as monophone, monophone status, and triphone binding status. The label of the output layer is generally obtained by forced-alignment using the GMM-HMM baseline system.
Similar to DNN, we can get Deep LC-BLSTM by stacking LC-BLSTM, but simply using multilayer LC-BLSTM to form an acoustic model is not only computationally complicated It will bring a lot of pressure on the degree, and more importantly, the optimal recognition performance cannot be achieved. After many experiments, we decided to adopt the LC-BLSTM-DNN hybrid model structure. The input speech features were first subjected to the LC-BLSTM transformation with a 3-layer node number of 1000 (forward + reverse), and then a 2-layer node number of 2048. DNN fully connected and softmax layer get the output, as shown in the figure below. The comparison between the recognition results and the best DNN baseline is shown in the following table. At present, the network scale and parameter configuration are still being optimized.
CER% |
Product Line A |
Product Line B |
Product Line C |
DNN < /td> |
15.4 |
11.1 |
12.4 |
LC-BLSTM-DNN |
8.51(23.4%) |
9.4(24.2 %) |
We use multiple machines developed by our internal students The card training tool is used for LC-BLSTM-DNN acoustic model training. When 6 machines and 12 cards are used and the training data set size is 2100 hours, the acceleration effect is shown in the following table:
p>
The time required to process an epoch (hours) |
|
Single card (Baseline)< /p> |
65.6 |
6 machines 12 cards (Middleware)< /p> |
6.5(x10.1) |
It can be seen that the speedup ratio of 6 machines and 12 cards is about 10.1 times, usually after 5-6 epoch iterations, it can converge, that is, a day and a half The training of the model can be completed within a week, and it can be expected that even if the training data reaches 10,000 hours in the future, the model training can be completed within a week. If the number of machine cards is further increased, the speedup ratio will be further improved. It can be said that the high time-consuming training problem of LC-BLSTM-DNN has been well solved by the multi-machine multi-card training tool.
DNN needs to use framing technology, the decoding delay is usually 5-10 frames, about 50- 100ms. The standard BLSTM requires a delay of the whole sentence, but in LC-BLSTM-DNN due to the use of chunk data calculation technology, the decoding delay can be controlled at about 20 frames, about 200ms, for the line with more stringent delay requirements For service tasks, the delay can be controlled to 100ms with a small loss of recognition performance (about 0.2%-0.3% absolute value), which can fully meet the needs of various tasks.
In terms of decoding real-time rate, although the LC-BLSTM-DNN model structure we designed has reduced a lot of computational complexity compared with ordinary Deep BLSTM, it has not been After any optimized LC-BLSTM-DNN still cannot meet the demand, we try to solve this problem from two aspects. On the one hand, we optimize through many engineering techniques, including the use of 16bit quantization technology to process feature data and network weights, which greatly improves the computational efficiency without basically losing recognition performance; we also use lazy technology for the last one. The calculation of the softmax layer is optimized to further improve the real-time rate. The existing model can achieve about 1.1 times the real-time rate on an ordinary laptop computer. On the other hand, we started from the model structure and algorithm level, and tried to use technologies such as frame skipping, projection layer, SVD decomposition, model resolution adjustment to accelerate model calculations, and preliminary results have been achieved, and many work is currently in progress.
In the past year or two, speech recognition technology has made gratifying progress on the basis of the original DNN technology. For example, CTC technology, it is generally believed that the combination of CTC and LSTM can help the model learn the following Context information and achieve a recognition effect equivalent to BLSTM. Since BLSTM can already learn the following Context information, the combination of CTC and BLSTM can no longer bring Significant performance improvement. LSTM+CTC has certain advantages over BLSTM in terms of computational complexity and decoding speed, but many researchers have also found that the LSTM+CTC model has the problem of inaccurate estimation of the phoneme time boundary. Tasks, such as desensitization of voice data, are not a small problem. There are also some technologies such as Attention-based models that have also made good progress in research. I believe that in the future, there will be more excellent model structures and algorithms to continuously improve the accuracy of speech recognition, and intelligent voice interaction will also make continuous progress on this basis.
References
[1] Chen K, Yan ZJ, Huo Q. A context-sensitive-chunk BPTT approach to training deep LSTM/BLSTM recurrent neural networks for offline handwriting recognition[C]. 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE Computer Society, 2015:411-415.
[2] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, Sanjeev Khudanpur, James GlassHighway Long Short-Term Memory RNNs for Distant Speech Recognition. http://arxiv.org/abs/1510.08983
section>
Xinzhiyuan recommended1
Xinzhiyuan and First release jointly by Yunqi community
Author: Kun Cheng, Algorithm Engineer, Alibaba Cloud Data Division
On Double Eleven, the overall service volume of the Ant Financial Customer Center exceeded 5 million More than 94% of the trips are resolved through self-service driven by artificial intelligence technology. In the entire self-service, a very important part is the voice-to-text service of the call center, which is a typical telephone voice recognition problem.
Telephone speech recognition is one of the most complex and difficult problems in the field of speech recognition today. In the process of the dialogue, the speaker’s natural and casual style, accent, lack of fluency (repetition, modification of one’s own statement), and complex and diverse transmission channels are all concentrated in this scene. With the development of deep learning and other technologies, the accuracy of today’s telephone speech recognition has reached a good level, which was unimaginable a few years ago.
We are using a speech recognition acoustic model based on LC-BLSTM-DNN hybrid. In order to test the effect of this model, we specially invite a technical expert, Students who have a good Mandarin accent dialed Alipay’s 95188 customer service hotline to experience the latest voice recognition technology upgraded by Alibaba iDST not long ago. The results are surprising. As far as we know (to the best of our knowledge), this is also the first industrial application of this model structure in the field of speech recognition. This article will introduce the background of this acoustic model and our specific implementation work.
Traditionally, the acoustic model of speech recognition is generally modeled by GMM-HMM. In recent years, with the development of deep learning technology, the modeling method based on DNN-HMM has made great progress. Compared with traditional methods, the accuracy of speech recognition can be increased by 20%-30%. It has replaced the former as academia and Mainstream configuration in industry. The advantage of DNN is that by increasing the number of layers and nodes of the neural network, the network’s abstraction and modeling capabilities for complex data are expanded, but at the same time, DNN also has some shortcomings. For example, DNN generally uses frame composition to consider context-related information. The influence of speech frames is not the best way to reflect the correlation between speech sequences. Autoregressive Neural Network (RNN) solves this problem to a certain extent. It achieves the purpose of using the correlation between sequence data through the self-connection of network nodes. Further, researchers have proposed a long and short-term memory network (LSTM-RNN), which can effectively alleviate the gradient explosion and gradient dissipation problems that simple RNNs are prone to. The researchers then extended the LSTM to use a two-way long and short-term memory network ( BLSTM-RNN)进行声学模型建模,以充分考虑上下文信息的影响。
BLSTM可以有效地提升语音识别的准确率,相比于DNN模型,相对性能提升可以达到15%-20%。但同时BLSTM也存在两个非常重要的问题:
1.句子级进行更新,模型的收敛速度通常较慢,并且由于存在大量的逐帧计算,无法有效发挥GPU等并行计算工具的计算能力,训练会非常耗时;
2.由于需要用到整句递归计算每一帧的后验概率,解码延迟和实时率无法得到有效保证,很难应用于实际服务。
对于这两个问题,文献[1]首先提出Context-Sensitive-Chunk BLSTM(CSC-BLSTM)的方法加以解决,而此后文献[2]又提出了Latency Controlled BLSTM(LC-BLSTM)这一改进版本,更好、更高效的减轻了这两个问题。我们在此基础上采用LC-BLSTM-DNN混合结构配合多机多卡、16bit量化等训练和优化方法进行声学模型建模,取得了相比于DNN模型约17-24%的相对识别错误率下降。目前该套模型已在电话语音识别中率先应用,并将陆续在我们支持的其他语音识别业务上线。
典型的LSTM节点结构下图所示,与一般DNN或simple RNN采用简单的激活函数节点不同,LSTM由3个gate:input gate、forget gate、output gate和一个cell组成,输入、输出节点以及cell同各个门之间都存在连接;input gate、forget gate同cell之间也存在连接,cell内部还有自连接。这样通过控制不同门的状态,可以实现更好的长短时信息保存和误差传播。
LSTM可以像DNN一样逐层堆积成为Deep LSTM,为了更好的利用上下文信息,还可以使用BLSTM逐层堆积构造Deep LSTM,其结构如下图所示,网络中沿时间轴存在正向和反向两个信息传递过程,每一个时间帧的计算都依赖于前面所有时间帧和后面所有时间帧的计算结果,对于语音信号这种时序序列,该模型充分考虑了上下文对于当前语音帧的影响,能够极大的提高音素状态的分类准确率。
然而由于标准的BLSTM是对整句语音数据进行建模,训练和解码过程存在收敛慢、延迟高、实时率低等问题,针对这些弊端我们采用了Latency Controlled BLSTM进行解决,与标准的BLSTM使用整句语音进行训练和解码不同,Latency Control BLSTM使用类似truncated BPTT的更新方式,并在cell中间状态处理和数据使用上有着自己的特点,如下图所示,训练时每次使用一小段数据进行更新,数据由中心chunk和右向附加chunk构成,其中右向附加chunk只用于cell中间状态的计算,误差只在中心chunk上进行传播。时间轴上正向移动的网络,前一个数据段在中心chunk结束时的cell中间状态被用于下一个数据段的初始状态,时间轴上反向移动的网络,每一个数据段开始时都将cell中间状态置为0。该方法可以很大程度上加快网络的收敛速度,并有助于得到更好的性能。解码阶段的数据处理与训练时基本相同,不同之处在于中心chunk和右向附加chunk的维度可以根据需求进行调节,并不必须与训练采用相同配置。
一般的基于DNN的语音识别声学模型结构如下图所示,DNN 的输入一般采用传统频谱特征及其改进特征 (如 MFCC、PLP、Filterbank 等) 经过帧拼接得到,拼接长度一般选择 9-15 帧之间,时间上约 10ms左右。而输出则一般采用各种粒度的音素声学单元,常见的有单音子音素 (Monophone)、单音子音素的状态以及三音子音素 (Triphone) 绑定状态等。输出层的标注一般采用 GMM-HMM 基线系统经强对齐( Forced-alignment)得到。
与DNN类似,我们可以通过堆积LC-BLSTM得到Deep LC-BLSTM,但是单纯使用多层LC-BLSTM来构成声学模型不仅在计算复杂度上会带来很大压力,更为重要的是并不能取得最优的识别性能。经过多组实验尝试,我们决定采用LC-BLSTM-DNN混合模型结构,输入语音特征先经过3层节点数为1000(正向+反向) LC-BLSTM变换,再经过2层节点数为2048的DNN全连接和softmax层得到输出,如下图所示。识别结果与最好的DNN基线比较如下表所示,目前网络规模和参数配置还在不断优化中。
CER% |
产品线A |
产品线B |
产品线C |
DNN |
15.4 |
11.1 |
12.4 |
LC-BLSTM-DNN |
12.7(17.5%) |
8.51(23.4%) |
9.4(24.2%) |
我们采用团队内部同学开发的多机多卡训练工具进行LC-BLSTM-DNN声学模型训练,在使用6机12卡,训练数据集大小为2100小时的情况下,加速效果如下表所示:
处理一个 epoch 所需时间(小时) |
|
单机单卡 (Baseline) |
65.6 |
6机12卡(Middleware) |
6.5(x10.1) |
可以看到6机12卡的加速比约为10.1倍,通常在5-6次epoch迭代后即可收敛,即一天半以内即可完成模型的训练,可以预期即使未来达到一万小时的训练数据,模型训练也可以在一周内完成。如果进一步增加机卡数目,加速比还会进一步的提高。可以说LC-BLSTM-DNN的训练高耗时问题已通过该多机多卡训练工具很好的解决。
DNN由于需要使用拼帧技术,解码延迟通常在5-10帧,大约50-100ms。标准的BLSTM则需要整句的延迟,而在LC-BLSTM-DNN中由于使用了chunk数据计算的技术,解码的延迟可以控制在20帧左右,大约200ms,对于对延迟要求更为严格的线上服务类任务,还可以在少量损失识别性能的情况下(0.2%-0.3%绝对值左右),将延迟控制在100ms,完全可以满足各类任务的需求。
在解码实时率方面,虽然我们设计的LC-BLSTM-DNN模型结构相比普通Deep BLSTM已经减少了很多的计算复杂度,但未经任何优化的LC-BLSTM-DNN仍不能满足需求,为此我们从两方面着手尝试解决这一问题。一方面,我们通过许多工程技术进行优化,包括使用16bit量化技术对特征数据和网络权重进行处理,在基本不损失识别性能的情况下,大大提高了计算效率;我们还使用了lazy技术对最后一个softmax层的计算进行优化,进一步提高了实时率。现有的模型在一台普通笔记本电脑上已经可以取得约1.1倍实时率。另一方面,我们从模型结构和算法层面入手,尝试使用跳帧、projection层、SVD分解、模型分辨率调整等技术来加速模型的计算,初步已取得不错的效果,目前许多工作正在进行中。
近一两年语音识别技术在原有DNN技术基础上又取得了可喜的进展。例如CTC技术,一般认为CTC与LSTM结合可以有助于模型学习到下文Context信息,取得与BLSTM相当的识别效果,而由于BLSTM已经可以学习到下文的Context信息,因此CTC与BLSTM结合并不能再带来明显的性能提升。 LSTM+CTC在计算复杂度和解码速度上相比BLSTM具有一定的优势,但许多研究人员也发现LSTM+CTC模型存在对音素时间边界估计不准确的问题,这对某些需要准确估计音素边界的任务,如语音数据脱敏等,是一个不小的问题。还有一些诸如Attention-based模型等技术同样在研究中取得了不错的进展。相信未来会有更为优异的模型结构和算法来不断的提高语音识别的准确率,智能语音交互亦会在此基础上取得不断的进步。
参考文献
[1] Chen K, Yan Z J, Huo Q. A context-sensitive-chunk BPTT approach to training deep LSTM/BLSTM recurrent neural networks for offline handwriting recognition[C]. 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE Computer Society, 2015:411-415.
[2] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, Sanjeev Khudanpur, James GlassHighway Long Short-Term Memory RNNs for Distant Speech Recognition. http://arxiv.org/abs/1510.08983