Opennlp: driving text, those things

OpenNLP: Control the text and segmentation of those things

AuthorBai Ningchao

March 27, 2016 19:55:03

Summary: Strings, character arrays, and other text representation processing libraries form the basis of most text processing programs. Most languages ​​include basic processing libraries, which are also necessary for the preliminary work of text processing or natural language processing. Typical representatives are word segmentation, part-of-speech tagging, sentence recognition and so on. The tools introduced in this article are mainly for English word segmentation. There are many English word segmentation tools. The author has compared the efficiency and ease of use of Apache OpenNLP. In addition, it provides an open source API for Java development. The article begins with an introduction to OpenNLP, then introduces 6 commonly used models, and finally summarizes the use and Java implementation of each model. Some authors may question what to do with Chinese word segmentation? The following chapters will separately introduce the Chinese word segmentation tool NLPIR (ICTCLA) developed by the research team of the Chinese Academy of Sciences based on the hidden Markov model. The content has been compiled and compiled by many documents and books, and the code has been run without errors. (This article is original, please indicate the source for reprinting: OpenNLP: Controlling the text and segmenting words)

Contents


[Text Mining (0)] Quickly understand what natural language processing is

[Text Mining (1)] OpenNLP: Controlling text and segmenting words

[ Text Mining (2)] [NLP] Tika text preprocessing: extract the contents of various formats of files

[Text Mining (3)] Build a search tool by yourself

1 What is OpenNLP What kind of “internal power” does it have?

What is OpenNLP?

Wikipedia: ApacheOpenNLPThe library is a machine learning color: #000000;”>Natural language text processing development kit, it supports some common language processing Tasks such as: Tokenization, Sentence segmentation , Part of speech tagging, Intrinsic entity Extraction (referring to recognizing proper nouns in sentences, such as people’s names), shallow analysis (Sentence and character block< /span>), Syntax analysis and referencing . These tasks usually require more advanced word processing services.

Official documents: Apache’s OpenNLP library is a toolkit for processing natural language text based on machine learning. It supports the most common NLP tasks such as word segmentation, sentence segmentation, partial part-of-speech tagging, named entity extraction, segmentation, parsing, and reference resolution. These tasks usually require the establishment of more advanced word processing services. OpenNLP also includes maximum entropy and machine learning based on perception. The goal of the OpenNLP project is to create a mature toolkit for the aforementioned tasks. An additional purpose is to provide a large number of pre-built models for various languages, and these models are derived from annotated text resources.

  • Developer: Apache Software Foundation
  • Stable version: 1.5.2-incubating (November 28, 2011, 5 years ago)
  • Development status: Active
  • programming language: Java
  • Type: Natural language processing
  • Website: http://incubator.apache.org/opennlp/

Use: It supports multiple operating systems such as Windows and Linux, This article mainly introduces under Windows:

1 Command line interface (CLI):< span style="font-size: 10pt;">OpenNLP script uses JAVA_CMD and JAVA_HOME variables to determine which commands are used to execute the Java virtual machine. The OpenNLP script uses the OPENNLP_HOME variable to determine the location of the binary distribution of OpenNLP. It is recommended that this variable point to the current OpenNLP version and the binary distribution of the updated PATH variable includes $OPENNLP_HOME/bin or %OPENNLP_HOME%\bin. This configuration allows easy invocation of OpenNLP. The following example assumes that this configuration has been completed. Use it as follows:When the tool is executed this way, the model is loaded and the tool is waiting for input from standard input. This input is processed and printed to standard output.

$ opennlp ToolName lang-model-name.bin  output.txt

2 Java calls its API in the console: The following code demonstrations all use this method.

  • On the official website (click to download): apache-opennlp-1.5.3 toolkit
  • Unzip the file: (such as: savepath\apache-opennlp-1.5.3\lib) to lib In the following file copy project < /li>
  • Go to the official website model page to download the bin file and download the required model, as follows:
  • Then create a java program to do the corresponding processing, such as: word segmentation, part-of-speech tagging, etc.

3 Example: So far complete the java configuration, and then “The quick, red fox jumped over the lazy, brown dogs.” A sentence is segmented, and the conventional method uses spaces to segment the code as follows:

//English segmentation: spaces or newlines
public static void ENSplit(String str)
{
String[] result=str.split("\\s+");
for(String s:result){
System.out.println(s+" ");
}
System.out.println();
}

 Word segmentation result:

As the needs change, if I want to separate the punctuation. In fact, this is also meaningful, you can use punctuation to determine the boundary of the sentence. The following OpenNLP will use its method to complete, thus introducing the topic of this article.

2 Sentence Detector

Feature introduction:

The sentence detector is used to detect sentence boundaries.

The sentence detector returns an array of strings.

API: The sentence detector also provides an API to train a New sentence detection model. Three basic steps are necessary to train it:

  • The application must open a sample data stream div>
  • Call the SentenceDetectorME.t​​rain method
  • Save SentenceModel to a file or use it directly

Code implementation:

/**
* 1. Sentence Detector: Sentence Detector
* @deprecated Sentence detector is for detecting sentence boundaries. Given the following paragraph:
* Hi. How are you? This is Mike.
* Such as: Hi. How are you? This is Mike. will return as follows:
* Hi. How are you?
* This is Mike.
* @throws IOException
* @throws InvalidFormatException
*/
public static void SentenceDetector(String str) throws InvalidFormatException, IOException
{
//always start with a model, a model is learned from training data
InputStream is = new FileInputStream("./nlpbin/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);

String sentences[] = sdetector.sentDetect(str);

System.out.println(sentences[0]);
System.out.println(sentences[1]);
is.close();
System.out.println("---------------1------------");
}

Run results:

3 Tag generation

Function introduction: The input character sequence of the OpenNLP segmentation segment is a tag. This is often the words separated by spaces, but there are exceptions. For example, “isn’t” is divided into “is” and “n’t” because it is AA brief format “isn’t”. Our sentences are divided into the following tokens: Symbols are usually words, punctuation marks , Digital, etc. OpenNLP provides the implementation of a variety of tag generators:

  • Blank Tag Generator-a blank tag generator, Non-blank sequences are determined as symbols
  • Simple tag generator-a character-based tag generator , Sequence tags of the same character class
  • learnable tag generator-a maximum entropy tag generation Detector based on probability model symbol boundary

AP I: The hyphenation can be integrated into the application defined by the API. The shared instance of the WhitespaceTokenizer can be retrieved from the static field WhitespaceTokenizer.INSTANCE. The shared instance of SimpleTokenizer can be retrieved in the same way as from SimpleTokenizer.INSTANCE. To instantiate the TokenizerME (to learn the token generator in Tokenizer), the symbolic model must be created first.

Code implementation:

/**
* 2. Token Generator: Tokenizer
* @deprecated Tokens are usually words which are separated by space, but there are exceptions. For example, "isn't" gets split into "is" and "n't, since it is aa brief format of "is not". Our sentence is separated into the following tokens:
* @param str
*/
public static void Tokenize(String str) throws InvalidFormatException, IOException {
InputStream is = new FileInputStream("./nlpbin/en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize(str);
for (String a: tokens)
System.out.println(a);
is.close();
System.out.println("--------------2-------------");
}

Run result:

4 Name search

Function introduction: The name finder can detect text named entities and numbers. In order to be able to detect the entity name search model is required. The model is dependent on language and entity type which is training. The OpenNLP project provides many of these various freely available corpus well-trained pre-trained name framing modes. They can be downloaded on our model download page. To find the original text, the text must be divided into symbols and sentence names. A detailed description is given in the one-sentence detector and tag generator tutorial. It is important that the tokenization of the training data and the input text is the same. According to different models, you can find entity names such as person names and place names.

API: It is recommended to use the training API instead of the command line tool to train the name finder from the application. Three basic steps are necessary to train it:

  • The application must open a sample data stream div>
  • Call NameFinderME.t​​rain method
  • Save TokenNameFinderModel to file or database

Code implementation:< /p>

/**
* 3. Name search: Name Finder
* @deprecated By its name, name finder just finds names in the context. Check out the following example to see what name finder can do. It accepts an array of strings, and find the names inside.
* @param str
*/
public static void findName() throws IOException {
InputStream is = new FileInputStream("./nlpbin/en-ner-person.bin");
TokenNameFinderModel model = new TokenNameFinderModel(is);
is.close();
NameFinderME nameFinder = new NameFinderME(model);
String[] sentence = new String[]{
"Mike",
"Tom",
"Smith",
"is",
"a",
"good",
"person"
};
Span nameSpans[] = nameFinder.find(sentence);
for(Span s: nameSpans)
System.out.println(s.toString());
System.out.println("--------------3-------------");
}

Run result:

5 POS tagger

Function introduction: Some of the mark symbols of the Voice Marker and their corresponding word types based on the symbol itself and the context of the symbol. The symbol may use multiple POS tags depending on the symbol and context. The OpenNLP POS tagger uses a probability model to predict the correct POS tag out of the tag group. In order to limit the possible tags, a symbol tag dictionary can be used. This increases the tagging and runtime performance of the catcher.

API: Part of the part of speech tagging training API supports a new POS mode of training. Three basic steps are necessary to train it:

  • The application must open a sample data stream div>
  • Call the POSTagger.train method
  • Save the POSModel to a file or database

Code implementation 1:

/**
* 4.POS tagger: POS Tagger
* @deprecated Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP
* @param str
*/
public static void POSTag(String str) throws IOException {
POSModel model = new POSModelLoader().load(new File("./nlpbin/en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");//Display loading time
POSTaggerME tagger = new POSTaggerME(model);
ObjectStream lineStream = new PlainTextByLineStream(new StringReader(str));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
System.out.println("--------------4-------------");
}

Run result 1:

Code Implementation 2:

< div class="cnblogs_Highlighter">

/**
* Example of OpenNLP part-of-speech tagging tool: maximum entropy part-of-speech tagger pos-maxent
* JJ adjectives, JJS adjectives superlative level, JJR adjectives comparative level
* RB adverb, RBR adverb superlative, RBS adverb comparative level
* DT qualifier
* NN name, NNS name retest, NNP proper nouns, NNPS proper nouns plural:
* PRP: personal pronouns, PRP$: possessive pronouns
* VB verb infinitive, VBD past tense, VBN past participle, VBZ present person third-person singular, VBP present non-third person, VBG gerund or present participle
*/
public static void POSMaxent(String str) throws InvalidFormatException, IOException
{
//Give the path where the part-of-speech model is located
File posModeFile=new File("./nlpbin/en-pos-maxent.bin");
FileInputStream posModeStream=new FileInputStream(posModeFile);
POSModel model=new POSModel(posModeStream);
//Split the sentence into words
POSTaggerME tagger=new POSTaggerME(model);
String[] words=SimpleTokenizer.INSTANCE.tokenize(str);
//Pass the sentence of the cut word to the tagger
String[] result=tagger.tag(words);
for (int i=0; i

  Run result 2:

6 detailed

function introduction : Text block is divided by the relevant parts of the word syntax, such as noun base, verb base text, but its internal structure is not specified, and it has no role in the main sentence.

API: This generalization provides an API to cultivate new Generalized model. The following sample code demonstrates how to do this:

Code implementation:

/**
* 5. Sequence annotation: Chunker
* @deprecated divides the tokens generated by the token generator into a sentence into a set of blocks. What chunker does is to partition a sentence to a set of chunks by using the tokens generated by tokenizer.
* @param str
*/
public static void chunk(String str) throws IOException {
POSModel model = new POSModelLoader().load(new File("./nlpbin/en-pos-maxent.bin"));
//PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
ObjectStream lineStream = new PlainTextByLineStream(new StringReader(str));
//perfMon.start();
String line;
String whitespaceTokenizerLine[] = null;
String[] tags = null;
while ((line = lineStream.read()) != null) {
whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE.tokenize(line);
tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
//perfMon.incrementCounter();
}
//perfMon.stopAndPrintFinalResult();

// chunker
InputStream is = new FileInputStream("./nlpbin/en-chunker.bin");
ChunkerModel cModel = new ChunkerModel(is);
ChunkerME chunkerME = new ChunkerME(cModel);
String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);
for (String s: result)
System.out.println(s);
Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);
for (Span s: span)
System.out.println(s.toString());
System.out.println("--------------5-------------");
}

Run result:

7 Analyzer

Feature introduction: The easiest way to try the parser is in the command line tool. This tool is only for demonstration and testing. Please start the parsing tool from the English block parser model on our website and use the following command.

Code implementation:

 /**
* 6. Analyzer: Parser
* @deprecated Given this sentence: "Programcreek is a very huge and useful website.", parser can return the following:
* (TOP (S (NP (NN Programcreek)) (VP (VBZ is) (NP (DT a) (ADJP (RB very) (JJ huge) (CC and) (JJ useful)))) (. website.) ))
* (TOP
* (S
* (NP
* (NN Programcreek)
*)
* (VP
* (VBZ is)
* (NP
* (DT a)
* (ADJP
* (RB very)
* (JJ huge)
* (CC and)
* (JJ userful)
*)
*)
*)
* (. website.)
*)
*)
* @param str
*/
public static void Parse() throws InvalidFormatException, IOException {
// http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Parser#Training_Tool
InputStream is = new FileInputStream("./nlpbin/en-parser-chunking.bin");
ParserModel model = new ParserModel(is);
Parser parser = ParserFactory.create(model);
String sentence = "Programcreek is a very huge and useful website.";
opennlp.tools.parser.Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);
for (opennlp.tools.parser.Parse p: topParses)
p.show();
is.close();
}

Run result:

8 References

1 Official tutorial Apache OpenNLP Developer Documentation

2 Various models in openNLP

3 openNLP open source tools

4 Wikipedia: OpenNLP

5 Control the text chapter 2 Section 2

6 OpenNLP tool source code sharing: access password 37f6

7 OpenNLP article use model (bin file) sharing: access password 1d65

related articles< /h2>


[Text processing] Natural language processing used in real life

[Text processing] Construction of multiple Bayesian models and realization of text classification

[Text Processing] Quickly understand what natural language processing is

[Text Processing] Overview of Domain Ontology Construction Methods

$ opennlp ToolName lang-model-name.bin  output.txt

//English cut Words: spaces or newlines
public static void ENSplit(String str)
{
String[] result=str.split("\\s+");
for(String s:result){
System.out.println(s+" ");
}
System.out.println();
}

The application must open a sample data stream

Call the SentenceDetectorME.t​​rain method

Save the SentenceModel to a file or use it directly

/**
* 1. Sentence Detector: Sentence Detector
* @deprecated Sentence detector is for detecting sentence boundaries. Given the following paragraph:
* Hi. How are you? This is Mike.
* Such as: Hi. How are you? This is Mike. will return as follows:
* Hi. How are you?
* This is Mike.
* @throws IOException
* @throws InvalidFormatException
*/
public static void SentenceDetector(String str) throws InvalidFormatException, IOException
{
//always start with a model, a model is learned from training data
InputStream is = new FileInputStream("./nlpbin/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);

String sentences[] = sdetector.sentDetect(str);

System.out.println(sentences[0]);
System.out.println(sentences[1]);
is.close();
System.out.println("---------------1------------");
}

blank tag generator-a blank tag generator, non-blank sequences are determined as symbols

Simple tag generator-a character tag generator, sequence tags of the same character type

Learnable Marker Generator-a maximum entropy marker generator that detects symbol boundaries based on probability models

/**
* 2. Token Generator: Tokenizer
* @deprecated Tokens are usually words which are separated by space, but there are exceptions. For example, "isn't" gets split into "is" and "n't, since it is aa brief format of "is not". Our sentence is separated into the following tokens:
* @param str
*/
public static void Tokenize(String str) throws InvalidFormatException, IOException {
InputStream is = new FileInputStream("./nlpbin/en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize(str);
for (String a: tokens)
System.out.println(a);
is.close();
System.out.println("--------------2-------------");
}

The application must open a sample data stream

Call NameFinderME.t​​rain method

Save TokenNameFinderModel to file or database< /p>

/**
* 3. Name search: Name Finder
* @deprecated By its name, name finder just finds names in the context. Check out the following example to see what name finder can do. It accepts an array of strings, and find the names inside.
* @param str
*/
public static void findName() throws IOException {
InputStream is = new FileInputStream("./nlpbin/en-ner-person.bin");
TokenNameFinderModel model = new TokenNameFinderModel(is);
is.close();
NameFinderME nameFinder = new NameFinderME(model);
String[] sentence = new String[]{
"Mike",
"Tom",
"Smith",
"is",
"a",
"good",
"person"
};
Span nameSpans[] = nameFinder.find(sentence);
for(Span s: nameSpans)
System.out.println(s.toString());
System.out.println("--------------3-------------");
}

The application must open a sample data stream

Call POSTagger.train method

Save POSModel to file or database

/**
* 4.POS tagger: POS Tagger
* @deprecated Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP
* @param str
*/
public static void POSTag(String str) throws IOException {
POSModel model = new POSModelLoader().load(new File("./nlpbin/en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");//Display loading time
POSTaggerME tagger = new POSTaggerME(model);
ObjectStream lineStream = new PlainTextByLineStream(new StringReader(str));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
System.out.println("--------------4-------------");
}

/**
* Example of OpenNLP part-of-speech tagging tool: maximum entropy part-of-speech tagger pos-maxent
* JJ adjectives, JJS adjectives superlative level, JJR adjectives comparative level
* RB adverb, RBR adverb superlative, RBS adverb comparative level
* DT qualifier
* NN name, NNS name retest, NNP proper nouns, NNPS proper nouns plural:
* PRP: personal pronouns, PRP$: possessive pronouns
* VB verb infinitive, VBD past tense, VBN past participle, VBZ present person third-person singular, VBP present non-third person, VBG gerund or present participle
*/
public static void POSMaxent(String str) throws InvalidFormatException, IOException
{
//Give the path where the part-of-speech model is located
File posModeFile=new File("./nlpbin/en-pos-maxent.bin");
FileInputStream posModeStream=new FileInputStream(posModeFile);
POSModel model=new POSModel(posModeStream);
//Split the sentence into words
POSTaggerME tagger=new POSTaggerME(model);
String[] words=SimpleTokenizer.INSTANCE.tokenize(str);
//Pass the sentence of the cut word to the tagger
String[] result=tagger.tag(words);
for (int i=0; i

/**
* 5. Sequence annotation: Chunker
* @deprecated divides the tokens generated by the token generator into a sentence into a set of blocks. What chunker does is to partition a sentence to a set of chunks by using the tokens generated by tokenizer.
* @param str
*/
public static void chunk(String str) throws IOException {
POSModel model = new POSModelLoader().load(new File("./nlpbin/en-pos-maxent.bin"));
//PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
ObjectStream lineStream = new PlainTextByLineStream(new StringReader(str));
//perfMon.start();
String line;
String whitespaceTokenizerLine[] = null;
String[] tags = null;
while ((line = lineStream.read()) != null) {
whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE.tokenize(line);
tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
//perfMon.incrementCounter();
}
//perfMon.stopAndPrintFinalResult();

// chunker
InputStream is = new FileInputStream("./nlpbin/en-chunker.bin");
ChunkerModel cModel = new ChunkerModel(is);
ChunkerME chunkerME = new ChunkerME(cModel);
String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);
for (String s: result)
System.out.println(s);
Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);
for (Span s: span)
System.out.println(s.toString());
System.out.println("--------------5-------------");
}

/**
* 6. Analyzer: Parser
* @deprecated Given this sentence: "Programcreek is a very huge and useful website.", parser can return the following:
* (TOP (S (NP (NN Programcreek)) (VP (VBZ is) (NP (DT a) (ADJP (RB very) (JJ huge) (CC and) (JJ useful)))) (. website.) ))
* (TOP
* (S
* (NP
* (NN Programcreek)
*)
* (VP
* (VBZ is)
* (NP
* (DT a)
* (ADJP
* (RB very)
* (JJ huge)
* (CC and)
* (JJ userful)
*)
*)
*)
* (. website.)
*)
*)
	 * @param str
*/
	public static void Parse() throws InvalidFormatException, IOException {
		// http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Parser#Training_Tool
		InputStream is = new FileInputStream("./nlpbin/en-parser-chunking.bin");
		ParserModel model = new ParserModel(is);
		Parser parser = ParserFactory.create(model);
		String sentence = "Programcreek is a very huge and useful website.";
		opennlp.tools.parser.Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);
		for (opennlp.tools.parser.Parse p : topParses)
			p.show();
		is.close();
	}

Leave a Comment

Your email address will not be published.