com.atlassian.confluence.search.lucene.extractor
Class HTMLSearchableTextExtractor

java.lang.Object
  extended by com.atlassian.confluence.search.lucene.extractor.HTMLSearchableTextExtractor

public final class HTMLSearchableTextExtractor
extends java.lang.Object

A utility class that will take a String formatted as HTML and remove all tags and attributes leaving only the text nodes and CData content intact. In the case of stripping link tags, key attributes (like content-title) will replace the stripped tags as opposed to removing the tag entirely. Inline elements will be simply stripped, however the start of block elements such as 'p' will be replaced with a newline.

The tag stripper also knows which elements in the Confluence schema should be removed entirely for indexing.


Constructor Summary
HTMLSearchableTextExtractor()
           
 
Method Summary
static java.lang.String stripTags(java.lang.String htmlSource)
           
static java.lang.String stripTags(java.lang.String pageTitle, java.lang.String htmlSource)
           
static java.lang.String stripTags(java.lang.String htmlSource, java.lang.String[] elementsToIgnore)
           
static java.lang.String stripTags(java.lang.String pageTitle, java.lang.String htmlSource, java.lang.String[] elementsToIgnore)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HTMLSearchableTextExtractor

public HTMLSearchableTextExtractor()
Method Detail

stripTags

public static java.lang.String stripTags(java.lang.String htmlSource)
                                  throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

stripTags

public static java.lang.String stripTags(java.lang.String htmlSource,
                                         java.lang.String[] elementsToIgnore)
                                  throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

stripTags

public static java.lang.String stripTags(java.lang.String pageTitle,
                                         java.lang.String htmlSource)
                                  throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException

stripTags

public static java.lang.String stripTags(java.lang.String pageTitle,
                                         java.lang.String htmlSource,
                                         java.lang.String[] elementsToIgnore)
                                  throws org.xml.sax.SAXException
Throws:
org.xml.sax.SAXException


Copyright © 2003-2014 Atlassian. All Rights Reserved.