Chinese word segmentation is to allow the computer system to automatically add spaces or other boundary marks between words in Chinese text.
1. The main difficulties of Chinese word segmentation are the standardization of word segmentation, ambiguity segmentation, and unregistered word recognition.
First, there is no recognized authority for word segmentation so far.
Secondly, ambiguity segmentation is more complicated, and it often requires complex contextual semantic analysis and even prosodic analysis.
Ambiguity recognition only needs to address three situations: intersection ambiguity, which is currently studied more; combined ambiguity, needs to combine the entire sentence; true ambiguity, needs to combine other sentences in context.
Thirdly, for large-scale real texts, the impact of unregistered words on the accuracy of word segmentation far exceeds that of ambiguity segmentation.
2, understand a few concepts: stopWord stop words, filterWord filter words
3, word segmentation algorithm, can be simply summarized into three categories:
One , An algorithm based on string matching
Second, an algorithm based on understanding
The third, an algorithm based on statistics
4. Commonly used open source tokenizers
p>
IKanalysis Ansj