Skip to content

fix(router): increase inference validation token budget#432

Open
geelen wants to merge 1 commit intoNVIDIA:mainfrom
geelen:codex/increase-inference-validation-token-budget
Open

fix(router): increase inference validation token budget#432
geelen wants to merge 1 commit intoNVIDIA:mainfrom
geelen:codex/increase-inference-validation-token-budget

Conversation

@geelen
Copy link

@geelen geelen commented Mar 18, 2026

Summary

Increase the inference validation probe token budget from 1 to 32 so OpenAI-compatible backends that reject extremely small output budgets can still pass verification.

Related Issue

N/A

Changes

  • Increased the validation probe token budget from 1 to 32 for chat completions, completions, Anthropic messages, and responses probes
  • Updated the router-side validation test to expect the new probe budget
  • Updated the server-side inference verification test to match the new probe request shape

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@geelen geelen requested a review from a team as a code owner March 18, 2026 09:53
@github-actions
Copy link

github-actions bot commented Mar 18, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@github-actions
Copy link

Thank you for your interest in contributing to OpenShell, @geelen.

This project uses a vouch system for first-time contributors. Before submitting a pull request, you need to be vouched by a maintainer.

To get vouched:

  1. Open a Vouch Request discussion.
  2. Describe what you want to change and why.
  3. Write in your own words — do not have an AI generate the request.
  4. A maintainer will comment /vouch if approved.
  5. Once vouched, open a new PR (preferred) or reopen this one after a few minutes.

See CONTRIBUTING.md for details.

@github-actions github-actions bot closed this Mar 18, 2026
@geelen
Copy link
Author

geelen commented Mar 18, 2026

I have read the DCO document and I hereby sign the DCO.

@drew drew reopened this Mar 18, 2026
@github-actions github-actions bot closed this Mar 18, 2026
@drew drew requested a review from pimlock March 18, 2026 16:08
@pimlock pimlock reopened this Mar 18, 2026
@NVIDIA NVIDIA deleted a comment from github-actions bot Mar 18, 2026
@pimlock pimlock added the test:e2e Requires end-to-end coverage label Mar 18, 2026
@geelen
Copy link
Author

geelen commented Mar 18, 2026

FYI I have now tested this against the particular endpoint and it does indeed pass validation automatically. Also the value of 32 was just plucked out of thin air, but seemed like a safe default (my endpoint returned 11 tokens in response).

@pimlock
Copy link
Collaborator

pimlock commented Mar 18, 2026

FYI I have now tested this against the particular endpoint and it does indeed pass validation automatically. Also the value of 32 was just plucked out of thin air, but seemed like a safe default (my endpoint returned 11 tokens in response).

I think 32-ish makes sense and shouldn't impact the time it takes for response to come back too much. Flakiness/potential timeouts, etc. was a reason to include the --no-verify flag, so the check is not a blocker.

I just checked how the openclaw does verification and they also use 1 for max_tokens: https://github.com/openclaw/openclaw/blob/757c2cc2deb9a1157a0b5685eaff33bd4bb70485/src/commands/onboard-custom.ts#L269


Out of curiosity - what's the validation on the inference-api side? I'm assuming this is some kind of default that litellm is enforcing?

@pimlock
Copy link
Collaborator

pimlock commented Mar 19, 2026

@geelen did more research on this and tried different models and depending on the model I got the error or not. I looped through all the models and 5 was enough to pass the check for all of them.

I'd say - let's update this to 5 and merge? This way the check would be faster and less risk of running into timeout (in case someone uses it with a super slow setup, would have to be <1 tps).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants