package

org.apache.lucene.analysis.cn

Analyzer for Chinese, which indexes unigrams (individual chinese characters).

Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

  • StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
  • CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
  • SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase: "我是中国人"
  1. StandardAnalyzer: 我-是-中-国-人
  2. CJKAnalyzer: 我是-是中-中国-国人
  3. SmartChineseAnalyzer: 我-是-中国-人

Classes

ChineseAnalyzer This class is deprecated. Use StandardAnalyzer instead, which has the same functionality. This analyzer will be removed in Lucene 5.0  
ChineseFilter This class is deprecated. Use StopFilter instead, which has the same functionality. This filter will be removed in Lucene 5.0  
ChineseTokenizer This class is deprecated. Use StandardTokenizer instead, which has the same functionality. This filter will be removed in Lucene 5.0