Skip to content

v2.4.0 — sharpen the token moat: lean tools, doctor, easy install, TESTED_BY fix#559

Open
tirth8205 wants to merge 17 commits into
mainfrom
release/v2.4.0
Open

v2.4.0 — sharpen the token moat: lean tools, doctor, easy install, TESTED_BY fix#559
tirth8205 wants to merge 17 commits into
mainfrom
release/v2.4.0

Conversation

@tirth8205

Copy link
Copy Markdown
Owner

Implements the 4-phase "sharpen the token moat" plan. Nothing here changes the design (local SQLite graph, Tree-sitter, MCP) — it makes default responses smaller, makes the savings provable, removes Python-install friction, and widens the review lead. 1474 tests pass; ruff + mypy clean.

⚠️ Holding for your review — do not auto-merge. Several defaults change (lean tool set, minimal detail). Read "Behavior changes" in CHANGELOG before merging.

Phase 1 — bigger token moat

  • Lean tool set by default: serve exposes 7 curated tools instead of 30 (~8k fewer tokens/session, fewer agent mis-picks). --tools all / CRG_TOOLS=all restores everything; --tools lean|<csv> available.
  • Token-budgeted review: get_review_context + detect_changes take max_tokens (default 6000), keep highest-risk findings, report omissions — a review can no longer overflow the context window (the audit found responses up to 400k tokens).
  • Minimal by default for query_graph / semantic_search / impact_radius / detect_changes; max_results cap (100) on query_graph; CRG_DETAIL_LEVEL + serve --detail overrides.

Phase 2 — prove the savings

  • code-review-graph doctor: health checklist + your latest Token Savings number as proof it's working (answers "is it running?"). Verified live here: flags a stale graph, shows ~91% saved.
  • Weekly eval CI surfaces the median token reduction; README "benchmarks: reproducible" badge.

Phase 3 — easy install

  • install.sh / install.ps1: curl … | sh bootstraps uv then installs the CLI — no manual Python setup. pip/pipx/uvx stay as alternatives. sh -n + PowerShell-parse lint job added.
  • First-run guard: read tools return a clear not_built status instead of silently making an empty DB and returning "0 results".

Phase 4 — widen the review lead

Bumps to v2.4.0. Full "Behavior changes" list in CHANGELOG.md.

🤖 Generated with Claude Code

tirth8205 and others added 17 commits June 14, 2026 17:55
The GitHub Action's PR comment rendered absolute CI-runner paths like
/home/runner/work/repo/repo/code_review_graph/embeddings.py::get_provider,
which is ugly and leaks the runner's directory layout.

Add relativize_path(): strip the GITHUB_WORKSPACE prefix (set by Actions
checkout), fall back to the doubled <repo>/<repo>/ segment derived from
GITHUB_REPOSITORY when the env is absent, and otherwise leave the path
untouched rather than guess-mangle it. The ::symbol suffix is preserved,
and already-relative paths pass through unchanged.

Wire it into the Symbol and Location columns and the test-gap list. Risk
math and sticky-comment marker logic are untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Update the FAQ comparison section with verified specifics: CRG ships
risk-scored change review, execution-flow criticality, impact radius, a
PR-review GitHub Action, dead-code, wiki, and hybrid keyword+vector
search locally and free; note (factually, without name-calling) that
some popular retrieval-focused graph tools are keyword/FTS-only today
and position deeper PR-review intelligence as a forthcoming
hosted/commercial product.

Add a one-line wedge — "CRG focuses on review; retrieval tools focus on
navigation; both are useful, often together" — and per-job guidance.

Add one wedge sentence near the top of the README: review-native,
token-efficient, local, free, no waitlist.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add a `doctor` subcommand (and code_review_graph/doctor.py) that runs a
fast, read-only health checklist and prints ✓/✗ lines with next-step hints:

1. graph.db exists and has nodes (critical → run build)
2. freshness: stored git_head_sha vs current HEAD via a safe git subprocess
   (list args, stdin=DEVNULL, bounded timeout) — warns when stale
3. MCP config present for repo-local platforms (reuses skills.PLATFORMS)
4. serve command resolves (reuses skills._detect_serve_command)
5. MCP server import/boot smoke — confirms registered tool count > 0
6. platform-native / git hooks installed
7. embeddings present, or a note that semantic search falls back to keyword

Exits non-zero only on critical failures; warnings never fail the exit code.
When a graph exists, the closing line surfaces the latest detect-changes Token
Savings number as a 'see your savings' proof. Generalizes the PASS/FAIL,
exit-code-driven pattern of scripts/diagnose_pypi_connectivity.py without
touching it.

Wires the command into the banner, install Next-steps output, CLAUDE.md,
docs/COMMANDS.md, and the README Quick Start. Adds tests/test_doctor.py
(unbuilt repo flags missing graph; built repo is healthy; exit codes).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The parser canonically writes TESTED_BY edges as source=production,
target=test (reads as "X is tested by Y"). Several read-side consumers
traversed them in the inverted direction; the naming-convention fallback
(test_<name> / Test<name>) silently masked the bug for conventionally
named tests, so the suite stayed green while real coverage lookups were
wrong.

Flip every inverted consumer to read source=production / target=test:
- graph.py::get_transitive_tests: the three TESTED_BY queries (direct,
  bare-name fallback, transitive) now select target_qualified where
  source_qualified = ?.
- graph.py::load_flow_adjacency: track the production node (src) in
  has_tested_by so criticality scoring reflects real coverage.
- changes.py: test-gap detection looks up a changed production function's
  tests by source.
- tools/query.py: query_graph(pattern="tests_for") follows edges by source
  and returns the target test node.
- enrich.py: symbol enrichment traverses TESTED_BY by source and reads the
  target test name.

Already-correct readers (analysis.py, tools/build.py, tools/review.py)
were left untouched. No DB migration is needed: stored edges were always
canonical, only the read side was inverted.

Regression tests use unconventional test names so the naming-convention
fallback cannot mask future regressions, and a producer-side guard
asserts the parser emits source=production / target=test (not merely that
TESTED_BY edges exist) so a future parser flip cannot silently re-break it.

Fixes #515

Co-Authored-By: Pratik Jagzap <pratikjagzap530@gmail.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The producer-side parser test and the consumer-side fixture tests each
exercise only one half of the TESTED_BY contract. Add a single test that
parses a real production/test fixture pair through the parser, stores the
emitted nodes and edges, and asserts get_transitive_tests surfaces the
test as covering the production source. This couples the parser's
canonical direction to the consumer query, so a future parser flip breaks
this test even if every hand-seeded fixture test still passes.

Refs #515

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
find_dead_code() inferred test coverage from *incoming* edges, but the
parser writes TESTED_BY as source=production, target=test, so a tested
production node is the *source* of its TESTED_BY edge. The inverted read
(plus a bare-name fallback that searched TESTED_BY targets) never matched,
so a function whose only reference was its test was wrongly flagged as
dead code.

Compute has_test_refs from outgoing TESTED_BY edges (by qualified name,
class-qualified name, and a plausible-caller-filtered bare-name source
fallback), aligning dead-code detection with the canonical direction.
This consumer site was missed by #515's original sweep.

Refs #515

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First-run UX fix. Read tools used to silently create an empty graph.db and
return a misleading "Found 0 results" when the graph had never been built.
They now return {status: "not_built", message, next_tool_suggestions:
["build_or_update_graph_tool"]} instead.

- Add _graph_is_built / _not_built_response / _get_store_for_read to
  tools/_common.py. "Built" means any node OR a build-marker metadata row
  (last_updated / last_build_type), so a legitimately empty-but-built graph
  is distinguished from a never-built one. A missing graph.db short-circuits
  before GraphStore would create one.
- Route read tools (query, search, impact, stats, flows, communities,
  architecture, review/detect-changes/affected-flows, refactor preview,
  minimal-context, wiki, hub/bridge/gaps/surprising/suggested) through the
  guard. Build/update/postprocess/embed paths keep using _get_store and are
  never blocked.
- not_built response is smaller than the old empty result and avoids the
  stray empty-db side-effect, preserving the token-savings invariant.
- Update existing tests that patched _get_store on guarded modules.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lower the front door without changing the engine. Add install.sh (macOS/Linux,
POSIX sh, no bashisms) and install.ps1 (Windows) at repo root that:
  1. detect uv and install it via the official Astral installer
     (https://astral.sh/uv/install.{sh,ps1}, echoed before running) if missing,
  2. uv tool install code-review-graph (fallback pipx, then pip --user),
  3. print next steps (install / build / status).

Scripts are idempotent and fail-loud with helpful messages. Comments state
honestly that this installs uv (a Python toolchain manager), not a bundled
runtime. README gains a "Quick install (no Python setup)" one-liner at the top
of Quick Start, above the pip instructions, with pip/uvx kept as alternatives.

Add tests/test_install_scripts.py (sh -n syntax check, bashism guard, pinned-URL
and content asserts; PowerShell parse when pwsh is available) and a CI
script-lint job that validates both scripts' syntax.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add reporter.median_token_reduction / median_token_reduction_table: the
median of per-question (1 - graph_tokens / baseline_tokens) * 100 over the
agent_baseline CSV rows (status=ok only), falling back to token_efficiency
when no agent-baseline rows exist. Empty/failed runs report n/a rather than
fabricating a number.

eval.yml now leads its job summary with that headline median table (in
addition to the full report) and already uploads the CSV artifact. The
workflow yaml parses and the summary heredoc was validated end-to-end against
both empty and populated results dirs.

Add a static 'benchmarks: reproducible' shields badge to the README linking
to the eval workflow runs. The badge asserts the pipeline is reproducible;
the canonical numbers live in the README Benchmarks section and
docs/REPRODUCING.md, never auto-committed from CI. Documents the local
reproduction recipe in REPRODUCING.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add opt-in payload guards to the two tool functions that could blow the
token promise:

- get_review_context / detect_changes_func gain a max_tokens parameter
  (default 0 = off at the function layer). When set, the lowest-risk
  source snippets / changed functions / flows / test gaps are dropped
  first (highest-risk and review_priorities kept longest) and an honest
  "omitted" note is added. Budgeting uses the existing chars/4
  context_savings estimate; the Token Savings panel math is untouched.
- query_graph gains a max_results cap (default 0 = off) so callers_of /
  callees_of on a hot symbol cannot return an unbounded payload; trims
  edges to the surviving set and reports truncated/total_results.

Both default to no-op here; the MCP wrappers set the user-facing
defaults. Backward compatible for all internal/CLI/eval callers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Shrink the per-session and per-call token cost of the MCP surface.

Lean tool mode (default): CRG registers 30 tools (~8k description tokens
per turn). The server now loads a curated lean set of 7 tools by default
(get_minimal_context, query_graph, semantic_search_nodes, detect_changes,
get_review_context, get_impact_radius, get_affected_flows) — the smallest
set covering every documented workflow. The full set stays one step away:
`serve --tools all` / `CRG_TOOLS=all`, `--tools lean`, or a custom CSV;
unknown names are ignored gracefully. A one-line stderr notice prints
whenever tools are trimmed (stdout stays clean for JSON-RPC).

Minimal-first MCP defaults: query_graph_tool, semantic_search_nodes_tool,
get_impact_radius_tool and detect_changes_tool now default
detail_level="minimal". get_review_context_tool stays "standard" but is
capped by max_tokens (its content IS the context; the budget is the safe
lever). A server-wide override (CRG_DETAIL_LEVEL env + `serve --detail`)
can force every tool's verbosity and beats the per-call argument.

Payload caps wired to the wrappers: query_graph_tool max_results=100,
get_review_context_tool / detect_changes_tool max_tokens=6000.

The underlying tool functions keep standard/0 defaults so CLI, eval and
internal callers are unchanged. Docs (README tool-filtering, COMMANDS
signatures + serve examples) updated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ED_BY fix

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

code-review-graph review

Overall risk: 0.80 (HIGH) — 232 changed function(s)/class(es), 0 affected flow(s), 131 test gap(s)

Risk-scored changes

Risk Level Symbol Location Tested
0.80 high code_review_graph/doctor.py::latest_token_savings code_review_graph/doctor.py:434 no
0.75 high code_review_graph/cli.py::_handle_init code_review_graph/cli.py:191 no
0.75 high code_review_graph/eval/reporter.py::median_token_reduction code_review_graph/eval/reporter.py:107 no
0.75 high code_review_graph/main.py::query_graph_tool code_review_graph/main.py:276 no
0.70 high code_review_graph/eval/reporter.py::median_token_reduction_table code_review_graph/eval/reporter.py:165 no
0.65 medium code_review_graph/doctor.py::CheckResult code_review_graph/doctor.py:43 no
0.65 medium code_review_graph/enrich.py::_format_node_context code_review_graph/enrich.py:98 no
0.65 medium code_review_graph/main.py::_resolve_detail_level code_review_graph/main.py:123 no
0.65 medium code_review_graph/tools/review.py::_apply code_review_graph/tools/review.py:636 no
0.65 medium code_review_graph/tools/review.py::_fits code_review_graph/tools/review.py:640 no

Test gaps

  • code_review_graph/changes.py::analyze_changes (code_review_graph/changes.py:277)
  • code_review_graph/cli.py::_print_banner (code_review_graph/cli.py:92)
  • code_review_graph/cli.py::_handle_init (code_review_graph/cli.py:191)
  • code_review_graph/cli.py::_handle_doctor (code_review_graph/cli.py:354)
  • code_review_graph/cli.py::main (code_review_graph/cli.py:365)
  • ...and 126 more without direct tests

Token savings: this graph-backed report used ~383,975 fewer tokens (~92%) than reading every changed file in full (estimated, chars/4 approximation).


Powered by code-review-graph — local-first analysis; no code leaves the CI runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant