public class BM25LSimilarity
extends org.apache.lucene.search.similarities.Similarity
Moved from the confluence-search plugin into core
Modifier and Type | Field and Description |
---|---|
protected boolean |
discountOverlaps
True if overlap tokens (tokens with a position of increment of zero) are
discounted from the document's length.
|
Constructor and Description |
---|
BM25LSimilarity()
BM25 with these default values:
k1 = 1.25 ,
b = 0.4 .
d = 0.5 .
|
BM25LSimilarity(float k1,
float b,
float d)
BM25 with the supplied parameter values.
|
Modifier and Type | Method and Description |
---|---|
protected float |
avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats) |
long |
computeNorm(org.apache.lucene.index.FieldInvertState state) |
org.apache.lucene.search.similarities.Similarity.SimWeight |
computeWeight(float queryBoost,
org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics... termStats) |
protected float |
decodeNormValue(byte b)
The default implementation returns
1 / f2
where f is SmallFloat.byte315ToFloat(byte) . |
protected byte |
encodeNormValue(float boost,
int fieldLength)
The default implementation encodes
boost / sqrt(length)
with SmallFloat.floatToByte315(float) . |
float |
getB() |
float |
getDelta() |
boolean |
getDiscountOverlaps() |
float |
getK1() |
protected float |
idf(long docFreq,
long numDocs)
Implemented as
log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)) . |
org.apache.lucene.search.Explanation |
idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.
|
org.apache.lucene.search.Explanation |
idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics[] termStats)
Computes a score factor for a phrase.
|
protected float |
scorePayload(int doc,
int start,
int end,
org.apache.lucene.util.BytesRef payload)
The default implementation returns
1 |
void |
setDiscountOverlaps(boolean v)
Sets whether overlap tokens (Tokens with 0 position increment) are
ignored when computing norm.
|
org.apache.lucene.search.similarities.Similarity.SimScorer |
simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats,
org.apache.lucene.index.AtomicReaderContext context) |
protected float |
sloppyFreq(int distance)
Implemented as
1 / (distance + 1) . |
String |
toString() |
protected boolean discountOverlaps
public BM25LSimilarity(float k1, float b, float d)
k1
- Controls non-linear term frequency normalization (saturation).b
- Controls to what degree document length normalizes tf values.d
- shift parameter.public BM25LSimilarity()
k1 = 1.25
,
b = 0.4
.d = 0.5
.protected float idf(long docFreq, long numDocs)
log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.docFreq
- docFreqnumDocs
- numDocslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.protected float sloppyFreq(int distance)
1 / (distance + 1)
.distance
- distance1 / (distance + 1)
.protected float scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)
1
doc
- docstart
- start indexend
- end indexpayload
- payload1
protected float avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)
collectionStats
- collectionStatssumTotalTermFreq / maxDoc
,
or returns 1
if the index does not store sumTotalTermFreq (Lucene 3.x indexes
or any field that omits frequency information).protected byte encodeNormValue(float boost, int fieldLength)
boost / sqrt(length)
with SmallFloat.floatToByte315(float)
. This is compatible with
Lucene's default implementation. If you change this, then you should
change decodeNormValue(byte)
to match.boost
- boostfieldLength
- fieldLengthboost / sqrt(length)
protected float decodeNormValue(byte b)
1 / f2
where f
is SmallFloat.byte315ToFloat(byte)
.b
- byte1 / f2
public void setDiscountOverlaps(boolean v)
v
- discountOverlapspublic boolean getDiscountOverlaps()
setDiscountOverlaps(boolean)
public final long computeNorm(org.apache.lucene.index.FieldInvertState state)
computeNorm
in class org.apache.lucene.search.similarities.Similarity
public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)
The default implementation uses:
idf(docFreq, searcher.maxDoc());
Note that CollectionStatistics.maxDoc()
is used instead of
IndexReader#numDocs()
because also
TermStatistics.docFreq()
is used, and when the latter
is inaccurate, so is CollectionStatistics.maxDoc()
, and in the same direction.
In addition, CollectionStatistics.maxDoc()
is more efficient to compute
collectionStats
- collection-level statisticstermStats
- term-level statistics for the termpublic org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)
The default implementation sums the idf factor for each term in the phrase.
collectionStats
- collection-level statisticstermStats
- term-level statistics for the terms in the phrasepublic final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
computeWeight
in class org.apache.lucene.search.similarities.Similarity
public final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) throws IOException
simScorer
in class org.apache.lucene.search.similarities.Similarity
IOException
public float getK1()
k1
parameterBM25LSimilarity(float, float, float)
public float getB()
b
parameterBM25LSimilarity(float, float, float)
public float getDelta()
Copyright © 2003–2020 Atlassian. All rights reserved.