class |
ClassicTokenizerDescriptor |
This tokenizer has heuristics for special treatment of acronyms, company names, email addresses, and internet host
names.
|
class |
KeywordTokenizerDescriptor |
The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text
as a single term.
|
class |
LetterTokenizerDescriptor |
The letter tokenizer breaks text into terms whenever it encounters a character which is not a letter.
|
class |
NGramTokenizerDescriptor |
A tokenizer that produces a stream of n-gram.
|
class |
PathHierarchyTokenizerDescriptor |
Tokenizer for path-like hierarchies.
|
class |
PatternTokenizerDescriptor |
The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator,
or to capture matching text as terms.
|
class |
StandardTokenizerDescriptor |
A standard tokenizer based on unicode segmentation standard.
|
class |
UAXURLEmailTokenizerDescriptor |
Tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.
|
class |
WhitespaceTokenizerDescriptor |
The whitespace tokenizer breaks text into terms whenever it encounters a whitespace character.
|