Class BM25LSimilarity


  • public class BM25LSimilarity
    extends org.apache.lucene.search.similarities.Similarity
    Extension of BM25 which shifts the term frequency normalization formula to boost scores of very long documents.

    Moved from the confluence-search plugin into core

    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity

        org.apache.lucene.search.similarities.Similarity.SimScorer, org.apache.lucene.search.similarities.Similarity.SimWeight
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected boolean discountOverlaps
      True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
    • Constructor Summary

      Constructors 
      Constructor Description
      BM25LSimilarity()
      BM25 with these default values: k1 = 1.25, b = 0.4. d = 0.5.
      BM25LSimilarity​(float k1, float b, float d)
      BM25 with the supplied parameter values.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected float avgFieldLength​(org.apache.lucene.search.CollectionStatistics collectionStats)  
      long computeNorm​(org.apache.lucene.index.FieldInvertState state)  
      org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight​(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)  
      protected float decodeNormValue​(byte b)
      The default implementation returns 1 / f2 where f is SmallFloat.byte315ToFloat(byte).
      protected byte encodeNormValue​(float boost, int fieldLength)
      The default implementation encodes boost / sqrt(length) with SmallFloat.floatToByte315(float).
      float getB()  
      float getDelta()  
      boolean getDiscountOverlaps()  
      float getK1()  
      protected float idf​(long docFreq, long numDocs)
      Implemented as log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
      org.apache.lucene.search.Explanation idfExplain​(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)
      Computes a score factor for a simple term and returns an explanation for that score factor.
      org.apache.lucene.search.Explanation idfExplain​(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)
      Computes a score factor for a phrase.
      protected float scorePayload​(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)
      The default implementation returns 1
      void setDiscountOverlaps​(boolean v)
      Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.
      org.apache.lucene.search.similarities.Similarity.SimScorer simScorer​(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context)  
      protected float sloppyFreq​(int distance)
      Implemented as 1 / (distance + 1).
      String toString()  
      • Methods inherited from class org.apache.lucene.search.similarities.Similarity

        coord, queryNorm
    • Field Detail

      • discountOverlaps

        protected boolean discountOverlaps
        True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
    • Constructor Detail

      • BM25LSimilarity

        public BM25LSimilarity​(float k1,
                               float b,
                               float d)
        BM25 with the supplied parameter values.
        Parameters:
        k1 - Controls non-linear term frequency normalization (saturation).
        b - Controls to what degree document length normalizes tf values.
        d - shift parameter.
      • BM25LSimilarity

        public BM25LSimilarity()
        BM25 with these default values:
        • k1 = 1.25,
        • b = 0.4.
        • d = 0.5.
    • Method Detail

      • idf

        protected float idf​(long docFreq,
                            long numDocs)
        Implemented as log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
        Parameters:
        docFreq - docFreq
        numDocs - numDocs
        Returns:
        log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
      • sloppyFreq

        protected float sloppyFreq​(int distance)
        Implemented as 1 / (distance + 1).
        Parameters:
        distance - distance
        Returns:
        1 / (distance + 1).
      • scorePayload

        protected float scorePayload​(int doc,
                                     int start,
                                     int end,
                                     org.apache.lucene.util.BytesRef payload)
        The default implementation returns 1
        Parameters:
        doc - doc
        start - start index
        end - end index
        payload - payload
        Returns:
        1
      • avgFieldLength

        protected float avgFieldLength​(org.apache.lucene.search.CollectionStatistics collectionStats)
        Parameters:
        collectionStats - collectionStats
        Returns:
        the average as sumTotalTermFreq / maxDoc, or returns 1 if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).
      • encodeNormValue

        protected byte encodeNormValue​(float boost,
                                       int fieldLength)
        The default implementation encodes boost / sqrt(length) with SmallFloat.floatToByte315(float). This is compatible with Lucene's default implementation. If you change this, then you should change decodeNormValue(byte) to match.
        Parameters:
        boost - boost
        fieldLength - fieldLength
        Returns:
        boost / sqrt(length)
      • decodeNormValue

        protected float decodeNormValue​(byte b)
        The default implementation returns 1 / f2 where f is SmallFloat.byte315ToFloat(byte).
        Parameters:
        b - byte
        Returns:
        1 / f2
      • setDiscountOverlaps

        public void setDiscountOverlaps​(boolean v)
        Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
        Parameters:
        v - discountOverlaps
      • getDiscountOverlaps

        public boolean getDiscountOverlaps()
        Returns:
        true if overlap tokens are discounted from the document's length.
        See Also:
        setDiscountOverlaps(boolean)
      • computeNorm

        public final long computeNorm​(org.apache.lucene.index.FieldInvertState state)
        Specified by:
        computeNorm in class org.apache.lucene.search.similarities.Similarity
      • idfExplain

        public org.apache.lucene.search.Explanation idfExplain​(org.apache.lucene.search.CollectionStatistics collectionStats,
                                                               org.apache.lucene.search.TermStatistics termStats)
        Computes a score factor for a simple term and returns an explanation for that score factor.

        The default implementation uses:

         idf(docFreq, searcher.maxDoc());
         

        Note that CollectionStatistics.maxDoc() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.maxDoc(), and in the same direction. In addition, CollectionStatistics.maxDoc() is more efficient to compute

        Parameters:
        collectionStats - collection-level statistics
        termStats - term-level statistics for the term
        Returns:
        an Explain object that includes both an idf score factor and an explanation for the term.
      • idfExplain

        public org.apache.lucene.search.Explanation idfExplain​(org.apache.lucene.search.CollectionStatistics collectionStats,
                                                               org.apache.lucene.search.TermStatistics[] termStats)
        Computes a score factor for a phrase.

        The default implementation sums the idf factor for each term in the phrase.

        Parameters:
        collectionStats - collection-level statistics
        termStats - term-level statistics for the terms in the phrase
        Returns:
        an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
      • computeWeight

        public final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight​(float queryBoost,
                                                                                              org.apache.lucene.search.CollectionStatistics collectionStats,
                                                                                              org.apache.lucene.search.TermStatistics... termStats)
        Specified by:
        computeWeight in class org.apache.lucene.search.similarities.Similarity
      • simScorer

        public final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer​(org.apache.lucene.search.similarities.Similarity.SimWeight stats,
                                                                                          org.apache.lucene.index.AtomicReaderContext context)
                                                                                   throws IOException
        Specified by:
        simScorer in class org.apache.lucene.search.similarities.Similarity
        Throws:
        IOException
      • getDelta

        public float getDelta()