package

org.apache.lucene.search.grouping

This module enables search result grouping with Lucene, where hits with the same value in the specified single-valued group field are grouped together. For example, if you group by the author field, then all documents with the same value in the author field fall into a single group.

Grouping requires a number of inputs:

  • groupField: this is the field used for grouping. For example, if you use the author field then each group has all books by the same author. Documents that don't have this field are grouped under a single group with a null group value.
  • groupSort: how the groups are sorted. For sorting purposes, each group is "represented" by the highest-sorted document according to the groupSort within it. For example, if you specify "price" (ascending) then the first group is the one with the lowest price book within it. Or if you specify relevance group sort, then the first group is the one containing the highest scoring book.
  • topNGroups: how many top groups to keep. For example, 10 means the top 10 groups are computed.
  • groupOffset: which "slice" of top groups you want to retrieve. For example, 3 means you'll get 7 groups back (assuming topNGroups is 10). This is useful for paging, where you might show 5 groups per page.
  • withinGroupSort: how the documents within each group are sorted. This can be different from the group sort.
  • maxDocsPerGroup: how many top documents within each group to keep.
  • withinGroupOffset: which "slice" of top documents you want to retrieve from each group.

The implementation is two-pass: the first pass (FirstPassGroupingCollector) gathers the top groups, and the second pass (SecondPassGroupingCollector) gathers documents within those groups. If the search is costly to run you may want to use the CachingCollector class, which caches hits and can (quickly) replay them for the second pass. This way you only run the query once, but you pay a RAM cost to (briefly) hold all hits. Results are returned as a TopGroups instance.

Known limitations:

  • The group field must be a single-valued indexed field. FieldCache is used to load the FieldCache.StringIndex for this field.
  • Unlike Solr's implementation, this module cannot group by function query values nor by arbitrary queries.
  • Sharding is not directly supported, though is not too difficult, if you can merge the top groups and top documents per group yourself.

Typical usage looks like this (using the CachingCollector):

  FirstPassGroupingCollector c1 = new FirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups);

  boolean cacheScores = true;
  double maxCacheRAMMB = 4.0;
  CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);
  s.search(new TermQuery(new Term("content", searchTerm)), cachedCollector);

  Collection topGroups = c1.getTopGroups(groupOffset, fillFields);

  if (topGroups == null) {
    // No groups matched
    return;
  }

  boolean getScores = true;
  boolean getMaxScores = true;
  boolean fillFields = true;
  SecondPassGroupingCollector c2 = new SecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset+docsPerGroup, getScores, getMaxScores, fillFields);

  //Optionally compute total group count
  AllGroupsCollector allGroupsCollector = null;
  if (requiredTotalGroupCount) {
    allGroupsCollector = new AllGroupsCollector("author");
    c2 = MultiCollector.wrap(c2, allGroupsCollector);
  }

  if (cachedCollector.isCached()) {
    // Cache fit within maxCacheRAMMB, so we can replay it:
    cachedCollector.replay(c2);
  } else {
    // Cache was too large; must re-execute query:
    s.search(new TermQuery(new Term("content", searchTerm)), c2);
  }

  TopGroups groupsResult = c2.getTopGroups(docOffset);
  if (requiredTotalGroupCount) {
    groupResult = new TopGroups(groupsResult, allGroupsCollector.getGroupCount());
  }

  // Render groupsResult...

Classes

AllGroupsCollector A collector that collects all groups that match the query. 
FirstPassGroupingCollector FirstPassGroupingCollector is the first of two passes necessary to collect grouped hits. 
GroupDocs Represents one group in the results. 
SearchGroup  
SecondPassGroupingCollector SecondPassGroupingCollector is the second of two passes necessary to collect grouped docs. 
TopGroups Represents result returned by a grouping search.