Class DuplicateNestedTagsRemoverImpl

  • All Implemented Interfaces:
    DuplicateNestedTagsRemover

    public class DuplicateNestedTagsRemoverImpl
    extends Object
    implements DuplicateNestedTagsRemover

    Removes all duplicate nested tags, see the corresponding CONFSERVER-54754 ticket

    The responsibility of this class is to:
    • Convert string to a stream of xml events.
    • For every top-level tree, call the parser that builds a special tree convenient for analysis.
    • For every top-level tree, call the analyser that removes duplicate.
    • Convert the final list of xml events to the regular string and return it.

    Note that Confluence document can have multiple roots so we have analyse each tree independently.

    To mitigate risks related to removal of the data other than duplicate tags, the algorithm has few limitations:

    • Only elements with the nested level of 4 (by default) are being analysed. It helps to prevent issues when users added two similar tags intentionally
    • Only tags from the allowed list are being analysed. By default, those tags are span and div.
    • The algorithm analyses repetitive groups of single tags, double tags and triple tags. For performance reasons, groups of 4 and more elements are not analysed. Note that customers suffer from either single or double duplicate tags only.

    Note that all the parameters above are configurable via the system variables.

    Since:
    7.19.14
    • Constructor Detail

      • DuplicateNestedTagsRemoverImpl

        public DuplicateNestedTagsRemoverImpl​(XmlOutputFactory xmlFragmentOutputFactory,
                                              XmlEventReaderFactory xmlEventReaderFactory,
                                              com.atlassian.confluence.impl.content.duplicatetags.internal.SingleXmlBranchReader singleXmlBranchReader,
                                              com.atlassian.confluence.impl.content.duplicatetags.internal.SingleXmlBranchDuplicateAnalyser singleXmlBranchDuplicateAnalyser,
                                              boolean disableAlgorithm)
    • Method Detail

      • cleanQuietly

        public String cleanQuietly​(String inputXml)
        Removes all nested duplicates and returns the cleaned up text. Can process xml document with multiple root nodes (Confluence storage format). For each top-level tree, it creates an instance of SingleXmlBranchProcessor which analyses one single-root tree and removes the duplicates. In case of exceptions, for example if the input xml is invalid, it returns the original untouched text. Never thrown exceptions.
        Specified by:
        cleanQuietly in interface DuplicateNestedTagsRemover
        Parameters:
        inputXml - input xml as string
        Returns:
        cleaned up xml as string