package

org.apache.lucene.analysis.standard

The org.apache.lucene.analysis.standard package contains three fast grammar-based tokenizers constructed with JFlex:

Classes

ClassicAnalyzer Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words. 
ClassicFilter Normalizes tokens extracted with ClassicTokenizer
ClassicTokenizer A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. 
StandardAnalyzer Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. 
StandardFilter Normalizes tokens extracted with StandardTokenizer
StandardTokenizer A grammar-based tokenizer constructed with JFlex. 
StandardTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29

Tokens produced are of the following types:
  • <ALPHANUM>: A sequence of alphabetic and numeric characters
  • <NUM>: A number
  • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
  • <IDEOGRAPHIC>: A single CJKV ideographic character
  • <HIRAGANA>: A single hiragana character
 
UAX29URLEmailTokenizer This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.