package

org.apache.lucene.analysis.standard

The org.apache.lucene.analysis.standard package contains three fast grammar-based tokenizers constructed with JFlex:

StandardTokenizer: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike UAX29URLEmailTokenizer, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
StandardAnalyzer includes StandardTokenizer, StandardFilter, LowerCaseFilter and StopFilter. When the Version specified in the constructor is lower than 3.1, the ClassicTokenizer implementation is invoked.
ClassicTokenizer: this class was formerly (prior to Lucene 3.1) named StandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.) ClassicAnalyzer includes ClassicTokenizer, StandardFilter, LowerCaseFilter and StopFilter.
UAX29URLEmailTokenizer: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.

Classes

ClassicAnalyzer	Filters `ClassicTokenizer` with `ClassicFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
ClassicFilter	Normalizes tokens extracted with `ClassicTokenizer`.
ClassicTokenizer	A grammar-based tokenizer constructed with JFlex This should be a good tokenizer for most European-language documents: Splits words at punctuation characters, removing punctuation.
StandardAnalyzer	Filters `StandardTokenizer` with `StandardFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
StandardFilter	Normalizes tokens extracted with `StandardTokenizer`.
StandardTokenizer	A grammar-based tokenizer constructed with JFlex.
StandardTokenizerImpl	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 Tokens produced are of the following types: <ALPHANUM>: A sequence of alphabetic and numeric characters <NUM>: A number <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer <IDEOGRAPHIC>: A single CJKV ideographic character <HIRAGANA>: A single hiragana character
UAX29URLEmailTokenizer	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.