public class BM25LSimilarity
extends org.apache.lucene.search.similarities.Similarity
Moved from the confluence-search plugin into core
| Modifier and Type | Field and Description |
|---|---|
protected boolean |
discountOverlaps
True if overlap tokens (tokens with a position of increment of zero) are
discounted from the document's length.
|
| Constructor and Description |
|---|
BM25LSimilarity()
BM25 with these default values:
k1 = 1.25,
b = 0.4.
d = 0.5.
|
BM25LSimilarity(float k1,
float b,
float d)
BM25 with the supplied parameter values.
|
| Modifier and Type | Method and Description |
|---|---|
protected float |
avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats) |
long |
computeNorm(org.apache.lucene.index.FieldInvertState state) |
org.apache.lucene.search.similarities.Similarity.SimWeight |
computeWeight(float queryBoost,
org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics... termStats) |
protected float |
decodeNormValue(byte b)
The default implementation returns
1 / f2
where f is SmallFloat.byte315ToFloat(byte). |
protected byte |
encodeNormValue(float boost,
int fieldLength)
The default implementation encodes
boost / sqrt(length)
with SmallFloat.floatToByte315(float). |
float |
getB() |
float |
getDelta() |
boolean |
getDiscountOverlaps() |
float |
getK1() |
protected float |
idf(long docFreq,
long numDocs)
Implemented as
log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)). |
org.apache.lucene.search.Explanation |
idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.
|
org.apache.lucene.search.Explanation |
idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics[] termStats)
Computes a score factor for a phrase.
|
protected float |
scorePayload(int doc,
int start,
int end,
org.apache.lucene.util.BytesRef payload)
The default implementation returns
1 |
void |
setDiscountOverlaps(boolean v)
Sets whether overlap tokens (Tokens with 0 position increment) are
ignored when computing norm.
|
org.apache.lucene.search.similarities.Similarity.SimScorer |
simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats,
org.apache.lucene.index.AtomicReaderContext context) |
protected float |
sloppyFreq(int distance)
Implemented as
1 / (distance + 1). |
String |
toString() |
protected boolean discountOverlaps
public BM25LSimilarity(float k1,
float b,
float d)
k1 - Controls non-linear term frequency normalization (saturation).b - Controls to what degree document length normalizes tf values.d - shift parameter.public BM25LSimilarity()
k1 = 1.25,
b = 0.4.d = 0.5.protected float idf(long docFreq,
long numDocs)
log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).docFreq - docFreqnumDocs - numDocslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).protected float sloppyFreq(int distance)
1 / (distance + 1).distance - distance1 / (distance + 1).protected float scorePayload(int doc,
int start,
int end,
org.apache.lucene.util.BytesRef payload)
1doc - docstart - start indexend - end indexpayload - payload1protected float avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)
collectionStats - collectionStatssumTotalTermFreq / maxDoc,
or returns 1 if the index does not store sumTotalTermFreq (Lucene 3.x indexes
or any field that omits frequency information).protected byte encodeNormValue(float boost,
int fieldLength)
boost / sqrt(length)
with SmallFloat.floatToByte315(float). This is compatible with
Lucene's default implementation. If you change this, then you should
change decodeNormValue(byte) to match.boost - boostfieldLength - fieldLengthboost / sqrt(length)protected float decodeNormValue(byte b)
1 / f2
where f is SmallFloat.byte315ToFloat(byte).b - byte1 / f2public void setDiscountOverlaps(boolean v)
v - discountOverlapspublic boolean getDiscountOverlaps()
setDiscountOverlaps(boolean)public final long computeNorm(org.apache.lucene.index.FieldInvertState state)
computeNorm in class org.apache.lucene.search.similarities.Similaritypublic org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics termStats)
The default implementation uses:
idf(docFreq, searcher.maxDoc());
Note that CollectionStatistics.maxDoc() is used instead of
IndexReader#numDocs() because also
TermStatistics.docFreq() is used, and when the latter
is inaccurate, so is CollectionStatistics.maxDoc(), and in the same direction.
In addition, CollectionStatistics.maxDoc() is more efficient to compute
collectionStats - collection-level statisticstermStats - term-level statistics for the termpublic org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics[] termStats)
The default implementation sums the idf factor for each term in the phrase.
collectionStats - collection-level statisticstermStats - term-level statistics for the terms in the phrasepublic final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight(float queryBoost,
org.apache.lucene.search.CollectionStatistics collectionStats,
org.apache.lucene.search.TermStatistics... termStats)
computeWeight in class org.apache.lucene.search.similarities.Similaritypublic final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats,
org.apache.lucene.index.AtomicReaderContext context)
throws IOException
simScorer in class org.apache.lucene.search.similarities.SimilarityIOExceptionpublic float getK1()
k1 parameterBM25LSimilarity(float, float, float)public float getB()
b parameterBM25LSimilarity(float, float, float)public float getDelta()
Copyright © 2003–2019 Atlassian. All rights reserved.