Skip to content

SOLR-8127 Distributed Luke#4149

Open
kotman12 wants to merge 25 commits intoapache:mainfrom
kotman12:distributed-luke
Open

SOLR-8127 Distributed Luke#4149
kotman12 wants to merge 25 commits intoapache:mainfrom
kotman12:distributed-luke

Conversation

@kotman12
Copy link
Contributor

@kotman12 kotman12 commented Feb 19, 2026

https://issues.apache.org/jira/browse/SOLR-8127

Description

Currently, LukeRequestHandler fetches its data from the local shard handling the request. Thus all the statistics it produces are for a single shard only. This is a bit misleading for a multi-shard cloud because an API that exists at the collection level produces shard-specific response. The solution proposed here distributes the request to one replica from each shard and aggregates the responses whenever possible. I introduce a shards response collection for data that can't be aggregated, or at least cannot be practically aggregated.

Solution

I used Claude Opus 4.6 to generate the implementation with many manual refinements/iterations.

Design Decisions

  1. One-replica-per-shard aggregation
    The distributed Luke handler queries one replica from each shard and aggregates the results. An alternative approach would be to query every replica in the cluster, which could be useful for inspecting low-level Lucene characteristics (directory paths, segment files, version numbers). This implementation takes the simpler one-replica-per-shard approach to match how other distributed handlers in Solr work and to keep the scope manageable. Querying all replicas can be considered as a future enhancement if there's community interest.

  2. distrib=false as default
    The handler defaults to distrib=false (local mode), preserving backward compatibility since the response structure changes for distrib=true. When distrib=true, the response adds a shards section containing per-shard data that is either not mergeable or difficult to merge. This includes full index metadata (directory, segmentsFile, version, userData, lastModified) and detailed per-field statistics (topTerms, distinct, histogram).

  3. docsAsLong field in LukeResponse
    The existing docs field in LukeResponse.FieldInfo is an int. When summing document counts across many shards, this can overflow. A new docsAsLong accessor was added to LukeResponse to handle collection-wide doc counts safely. The original docs field is preserved for backward compatibility.

  4. show=doc lucene response has shard-local docFreq
    When looking up a document with distrib=true, the response includes both a high-level solr section (stored fields) and a low-level lucene section (per-field index analysis including docFreq). The docFreq values in the lucene section are not aggregated across shards, they reflect only the shard where the document was found. Aggregating docFreq across shards would require additional fan-out requests to every shard for each term, which is expensive and I'm not sure what the demand is for this kind of logic. I acknowledge that it may feel inconsistent since the docs logic per field does aggregate across shards.

  5. distrib=true with show=doc only only supports id request param, not docId
    Lucene document IDs (docId) are internal to each shard's index and have no meaning across shards, i.e. the same docId value refers to different documents on different shards. Only the logical id parameter (using the schema's uniqueKeyField) is supported for distributed doc lookup, since it can be unambiguously resolved across the cluster. Passing docId with distrib=true returns a clear error. Additionally, if the same id is found on multiple shards (indicating index corruption), the handler returns an error rather than silently picking one.

  6. Single-shard fallback to local mode
    When distrib=true is set on a collection with only one shard, the handler falls back to local mode returning the standard local Luke response without the shards field. This matches a common pattern in other Solr handlers. However, this creates a response structure inconsistency: clients parsing a distributed response need to handle two shapes depending on shard count. For a single-shard collection, per-shard index metadata (like directory, segmentsFile) appears at the top level; for multi-shard collections, it appears under shards. This may be annoying for client code that needs to look in two places for the same data. An alternative would be to always use the distributed response structure when distrib=true is explicitly requested, even for single-shard collections. This is an open question worth community input.

Tests

Added LukeRequestHandlerDistribTest to test this feature. Also have built this locally to ensure it works E2E.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@github-actions github-actions bot added documentation Improvements or additions to documentation client:solrj tests labels Feb 19, 2026
@kotman12 kotman12 closed this Feb 19, 2026
@kotman12 kotman12 reopened this Feb 19, 2026
@kotman12 kotman12 changed the title Distributed luke SOLR-8127 Distributed luke Feb 19, 2026
@kotman12 kotman12 marked this pull request as ready for review February 25, 2026 20:27
@kotman12 kotman12 changed the title SOLR-8127 Distributed luke SOLR-8127 Distributed Luke Feb 26, 2026
String v = uniqueKey.getType().toInternal(params.get(ID));
Term t = new Term(uniqueKey.getName(), v);
docId = searcher.getFirstMatch(t);
if (docId < 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a slight change to the existing contract. Previously if a docId was missing it would throw. Now I propose returning an empty response (more in line with how search works). This is to handle the distributed case more elegantly. It is theoretically a breaking change although perhaps unlikely someone was relying on this error? One way I could see this being a problem is someone assuming the doc field will be in the response if the response code is a 200 without null checking first. I suppose this could be problematic and perhaps warrants more care...

final String originalShardAddr;
final LukeResponse.FieldInfo originalFieldInfo;
private Object indexFlags;
private String indexFlagsShardAddr;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source of the field info and the index flags can actually be different. If a shard has all the documents of a particular field deleted it will have field info but without the index flags, those can theoretically come from yet another shard.


// docId is a Lucene-internal integer, not meaningful across shards
if (reqParams.getInt(DOC_ID) != null) {
throw new SolrException(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current API returns a single doc in case of show=doc, thus requesting a Lucene docId in distributed mode makes no sense.


String[] shards = rb.shards;
if (shards == null || shards.length == 0) {
return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distributed response includes extra fields, thus delegating to the non-distributed API in case of a single sharded is a bit annoying API-wise because the client may have to handle different response structure per, say, collection or cloud.

// Deal with the chance that the first bunch of terms are in deleted documents. Is there a
// better way?
for (int idx = 0; idx < 1000 && postingsEnum == null; ++idx) {
for (int idx = 0; idx < 1000; ++idx) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was explained in #4157

* different Lucene FieldInfo (and thus different index flags strings) on each shard.
*/
@Test
public void testInconsistentIndexFlagsAcrossShards() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test too brittle and possibly overkill?

String schema;
int docs;
int distinct;
Long docsAsLong;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit sad I had to add this. It would have been nice to be able to change docs. If I got this in before Solr 10 it could have been possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client:solrj documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant