Class BM25LSimilarity
- java.lang.Object
-
- org.apache.lucene.search.similarities.Similarity
-
- com.atlassian.confluence.internal.search.v2.lucene.BM25LSimilarity
-
public class BM25LSimilarity extends org.apache.lucene.search.similarities.Similarity
Extension of BM25 which shifts the term frequency normalization formula to boost scores of very long documents.Moved from the confluence-search plugin into core
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
discountOverlaps
True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
-
Constructor Summary
Constructors Constructor Description BM25LSimilarity()
BM25 with these default values:k1 = 1.25
,b = 0.4
.d = 0.5
.BM25LSimilarity(float k1, float b, float d)
BM25 with the supplied parameter values.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected float
avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)
long
computeNorm(org.apache.lucene.index.FieldInvertState state)
org.apache.lucene.search.similarities.Similarity.SimWeight
computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
protected float
decodeNormValue(byte b)
The default implementation returns1 / f2
wheref
isSmallFloat.byte315ToFloat(byte)
.protected byte
encodeNormValue(float boost, int fieldLength)
The default implementation encodesboost / sqrt(length)
withSmallFloat.floatToByte315(float)
.float
getB()
float
getDelta()
boolean
getDiscountOverlaps()
float
getK1()
protected float
idf(long docFreq, long numDocs)
Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.org.apache.lucene.search.Explanation
idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation for that score factor.org.apache.lucene.search.Explanation
idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)
Computes a score factor for a phrase.protected float
scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)
The default implementation returns1
void
setDiscountOverlaps(boolean v)
Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.org.apache.lucene.search.similarities.Similarity.SimScorer
simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context)
protected float
sloppyFreq(int distance)
Implemented as1 / (distance + 1)
.String
toString()
-
-
-
Constructor Detail
-
BM25LSimilarity
public BM25LSimilarity(float k1, float b, float d)
BM25 with the supplied parameter values.- Parameters:
k1
- Controls non-linear term frequency normalization (saturation).b
- Controls to what degree document length normalizes tf values.d
- shift parameter.
-
BM25LSimilarity
public BM25LSimilarity()
BM25 with these default values:k1 = 1.25
,b = 0.4
.d = 0.5
.
-
-
Method Detail
-
idf
protected float idf(long docFreq, long numDocs)
Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.- Parameters:
docFreq
- docFreqnumDocs
- numDocs- Returns:
log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.
-
sloppyFreq
protected float sloppyFreq(int distance)
Implemented as1 / (distance + 1)
.- Parameters:
distance
- distance- Returns:
1 / (distance + 1)
.
-
scorePayload
protected float scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)
The default implementation returns1
- Parameters:
doc
- docstart
- start indexend
- end indexpayload
- payload- Returns:
1
-
avgFieldLength
protected float avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)
- Parameters:
collectionStats
- collectionStats- Returns:
- the average as
sumTotalTermFreq / maxDoc
, or returns1
if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).
-
encodeNormValue
protected byte encodeNormValue(float boost, int fieldLength)
The default implementation encodesboost / sqrt(length)
withSmallFloat.floatToByte315(float)
. This is compatible with Lucene's default implementation. If you change this, then you should changedecodeNormValue(byte)
to match.- Parameters:
boost
- boostfieldLength
- fieldLength- Returns:
boost / sqrt(length)
-
decodeNormValue
protected float decodeNormValue(byte b)
The default implementation returns1 / f2
wheref
isSmallFloat.byte315ToFloat(byte)
.- Parameters:
b
- byte- Returns:
1 / f2
-
setDiscountOverlaps
public void setDiscountOverlaps(boolean v)
Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.- Parameters:
v
- discountOverlaps
-
getDiscountOverlaps
public boolean getDiscountOverlaps()
- Returns:
- true if overlap tokens are discounted from the document's length.
- See Also:
setDiscountOverlaps(boolean)
-
computeNorm
public final long computeNorm(org.apache.lucene.index.FieldInvertState state)
- Specified by:
computeNorm
in classorg.apache.lucene.search.similarities.Similarity
-
idfExplain
public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation for that score factor.The default implementation uses:
idf(docFreq, searcher.maxDoc());
Note that
CollectionStatistics.maxDoc()
is used instead ofIndexReader#numDocs()
because alsoTermStatistics.docFreq()
is used, and when the latter is inaccurate, so isCollectionStatistics.maxDoc()
, and in the same direction. In addition,CollectionStatistics.maxDoc()
is more efficient to compute- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the term- Returns:
- an Explain object that includes both an idf score factor and an explanation for the term.
-
idfExplain
public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)
Computes a score factor for a phrase.The default implementation sums the idf factor for each term in the phrase.
- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the terms in the phrase- Returns:
- an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
-
computeWeight
public final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
- Specified by:
computeWeight
in classorg.apache.lucene.search.similarities.Similarity
-
simScorer
public final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) throws IOException
- Specified by:
simScorer
in classorg.apache.lucene.search.similarities.Similarity
- Throws:
IOException
-
getK1
public float getK1()
- Returns:
- the
k1
parameter - See Also:
BM25LSimilarity(float, float, float)
-
getB
public float getB()
- Returns:
- the
b
parameter - See Also:
BM25LSimilarity(float, float, float)
-
getDelta
public float getDelta()
-
-