v2.4.0 — sharpen the token moat: lean tools, doctor, easy install, TESTED_BY fix by tirth8205 · Pull Request #559 · tirth8205/code-review-graph

tirth8205 · 2026-06-14T17:21:04Z

Implements the 4-phase "sharpen the token moat" plan. Nothing here changes the design (local SQLite graph, Tree-sitter, MCP) — it makes default responses smaller, makes the savings provable, removes Python-install friction, and widens the review lead. 1474 tests pass; ruff + mypy clean.

⚠️ Holding for your review — do not auto-merge. Several defaults change (lean tool set, minimal detail). Read "Behavior changes" in CHANGELOG before merging.

Phase 1 — bigger token moat

Lean tool set by default: serve exposes 7 curated tools instead of 30 (~8k fewer tokens/session, fewer agent mis-picks). --tools all / CRG_TOOLS=all restores everything; --tools lean|<csv> available.
Token-budgeted review: get_review_context + detect_changes take max_tokens (default 6000), keep highest-risk findings, report omissions — a review can no longer overflow the context window (the audit found responses up to 400k tokens).
Minimal by default for query_graph / semantic_search / impact_radius / detect_changes; max_results cap (100) on query_graph; CRG_DETAIL_LEVEL + serve --detail overrides.

Phase 2 — prove the savings

code-review-graph doctor: health checklist + your latest Token Savings number as proof it's working (answers "is it running?"). Verified live here: flags a stale graph, shows ~91% saved.
Weekly eval CI surfaces the median token reduction; README "benchmarks: reproducible" badge.

Phase 3 — easy install

install.sh / install.ps1: curl … | sh bootstraps uv then installs the CLI — no manual Python setup. pip/pipx/uvx stay as alternatives. sh -n + PowerShell-parse lint job added.
First-run guard: read tools return a clear not_built status instead of silently making an empty DB and returning "0 results".

Phase 4 — widen the review lead

TESTED_BY direction fix (Bug: tests_for query uses wrong edge direction for TESTED_BY, always falls back to naming convention #515): integrates community PR fix(graph): correct TESTED_BY edge direction in tests_for queries #527 by @Devilthelegend, with the producer-side regression test we asked for added, credited via Co-Authored-By. Fixes tests_for, get_transitive_tests, test-gap detection, flow criticality, enrichment, dead-code — and makes the Action's "Tested" column correct. No DB migration.
GitHub Action now renders repo-relative paths (not absolute CI-runner paths).
Positioning: docs/FAQ.md + a README wedge line — factual, non-disparaging (review-native + local + free vs retrieval-focused tools that gate PR-review behind hosted products).

Bumps to v2.4.0. Full "Behavior changes" list in CHANGELOG.md.

🤖 Generated with Claude Code

The GitHub Action's PR comment rendered absolute CI-runner paths like /home/runner/work/repo/repo/code_review_graph/embeddings.py::get_provider, which is ugly and leaks the runner's directory layout. Add relativize_path(): strip the GITHUB_WORKSPACE prefix (set by Actions checkout), fall back to the doubled <repo>/<repo>/ segment derived from GITHUB_REPOSITORY when the env is absent, and otherwise leave the path untouched rather than guess-mangle it. The ::symbol suffix is preserved, and already-relative paths pass through unchanged. Wire it into the Symbol and Location columns and the test-gap list. Risk math and sticky-comment marker logic are untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Update the FAQ comparison section with verified specifics: CRG ships risk-scored change review, execution-flow criticality, impact radius, a PR-review GitHub Action, dead-code, wiki, and hybrid keyword+vector search locally and free; note (factually, without name-calling) that some popular retrieval-focused graph tools are keyword/FTS-only today and position deeper PR-review intelligence as a forthcoming hosted/commercial product. Add a one-line wedge — "CRG focuses on review; retrieval tools focus on navigation; both are useful, often together" — and per-job guidance. Add one wedge sentence near the top of the README: review-native, token-efficient, local, free, no waitlist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add a `doctor` subcommand (and code_review_graph/doctor.py) that runs a fast, read-only health checklist and prints ✓/✗ lines with next-step hints: 1. graph.db exists and has nodes (critical → run build) 2. freshness: stored git_head_sha vs current HEAD via a safe git subprocess (list args, stdin=DEVNULL, bounded timeout) — warns when stale 3. MCP config present for repo-local platforms (reuses skills.PLATFORMS) 4. serve command resolves (reuses skills._detect_serve_command) 5. MCP server import/boot smoke — confirms registered tool count > 0 6. platform-native / git hooks installed 7. embeddings present, or a note that semantic search falls back to keyword Exits non-zero only on critical failures; warnings never fail the exit code. When a graph exists, the closing line surfaces the latest detect-changes Token Savings number as a 'see your savings' proof. Generalizes the PASS/FAIL, exit-code-driven pattern of scripts/diagnose_pypi_connectivity.py without touching it. Wires the command into the banner, install Next-steps output, CLAUDE.md, docs/COMMANDS.md, and the README Quick Start. Adds tests/test_doctor.py (unbuilt repo flags missing graph; built repo is healthy; exit codes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The parser canonically writes TESTED_BY edges as source=production, target=test (reads as "X is tested by Y"). Several read-side consumers traversed them in the inverted direction; the naming-convention fallback (test_<name> / Test<name>) silently masked the bug for conventionally named tests, so the suite stayed green while real coverage lookups were wrong. Flip every inverted consumer to read source=production / target=test: - graph.py::get_transitive_tests: the three TESTED_BY queries (direct, bare-name fallback, transitive) now select target_qualified where source_qualified = ?. - graph.py::load_flow_adjacency: track the production node (src) in has_tested_by so criticality scoring reflects real coverage. - changes.py: test-gap detection looks up a changed production function's tests by source. - tools/query.py: query_graph(pattern="tests_for") follows edges by source and returns the target test node. - enrich.py: symbol enrichment traverses TESTED_BY by source and reads the target test name. Already-correct readers (analysis.py, tools/build.py, tools/review.py) were left untouched. No DB migration is needed: stored edges were always canonical, only the read side was inverted. Regression tests use unconventional test names so the naming-convention fallback cannot mask future regressions, and a producer-side guard asserts the parser emits source=production / target=test (not merely that TESTED_BY edges exist) so a future parser flip cannot silently re-break it. Fixes #515 Co-Authored-By: Pratik Jagzap <pratikjagzap530@gmail.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The producer-side parser test and the consumer-side fixture tests each exercise only one half of the TESTED_BY contract. Add a single test that parses a real production/test fixture pair through the parser, stores the emitted nodes and edges, and asserts get_transitive_tests surfaces the test as covering the production source. This couples the parser's canonical direction to the consumer query, so a future parser flip breaks this test even if every hand-seeded fixture test still passes. Refs #515 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

find_dead_code() inferred test coverage from *incoming* edges, but the parser writes TESTED_BY as source=production, target=test, so a tested production node is the *source* of its TESTED_BY edge. The inverted read (plus a bare-name fallback that searched TESTED_BY targets) never matched, so a function whose only reference was its test was wrongly flagged as dead code. Compute has_test_refs from outgoing TESTED_BY edges (by qualified name, class-qualified name, and a plausible-caller-filtered bare-name source fallback), aligning dead-code detection with the canonical direction. This consumer site was missed by #515's original sweep. Refs #515 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

First-run UX fix. Read tools used to silently create an empty graph.db and return a misleading "Found 0 results" when the graph had never been built. They now return {status: "not_built", message, next_tool_suggestions: ["build_or_update_graph_tool"]} instead. - Add _graph_is_built / _not_built_response / _get_store_for_read to tools/_common.py. "Built" means any node OR a build-marker metadata row (last_updated / last_build_type), so a legitimately empty-but-built graph is distinguished from a never-built one. A missing graph.db short-circuits before GraphStore would create one. - Route read tools (query, search, impact, stats, flows, communities, architecture, review/detect-changes/affected-flows, refactor preview, minimal-context, wiki, hub/bridge/gaps/surprising/suggested) through the guard. Build/update/postprocess/embed paths keep using _get_store and are never blocked. - not_built response is smaller than the old empty result and avoids the stray empty-db side-effect, preserving the token-savings invariant. - Update existing tests that patched _get_store on guarded modules. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Lower the front door without changing the engine. Add install.sh (macOS/Linux, POSIX sh, no bashisms) and install.ps1 (Windows) at repo root that: 1. detect uv and install it via the official Astral installer (https://astral.sh/uv/install.{sh,ps1}, echoed before running) if missing, 2. uv tool install code-review-graph (fallback pipx, then pip --user), 3. print next steps (install / build / status). Scripts are idempotent and fail-loud with helpful messages. Comments state honestly that this installs uv (a Python toolchain manager), not a bundled runtime. README gains a "Quick install (no Python setup)" one-liner at the top of Quick Start, above the pip instructions, with pip/uvx kept as alternatives. Add tests/test_install_scripts.py (sh -n syntax check, bashism guard, pinned-URL and content asserts; PowerShell parse when pwsh is available) and a CI script-lint job that validates both scripts' syntax. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add reporter.median_token_reduction / median_token_reduction_table: the median of per-question (1 - graph_tokens / baseline_tokens) * 100 over the agent_baseline CSV rows (status=ok only), falling back to token_efficiency when no agent-baseline rows exist. Empty/failed runs report n/a rather than fabricating a number. eval.yml now leads its job summary with that headline median table (in addition to the full report) and already uploads the CSV artifact. The workflow yaml parses and the summary heredoc was validated end-to-end against both empty and populated results dirs. Add a static 'benchmarks: reproducible' shields badge to the README linking to the eval workflow runs. The badge asserts the pipeline is reproducible; the canonical numbers live in the README Benchmarks section and docs/REPRODUCING.md, never auto-committed from CI. Documents the local reproduction recipe in REPRODUCING.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add opt-in payload guards to the two tool functions that could blow the token promise: - get_review_context / detect_changes_func gain a max_tokens parameter (default 0 = off at the function layer). When set, the lowest-risk source snippets / changed functions / flows / test gaps are dropped first (highest-risk and review_priorities kept longest) and an honest "omitted" note is added. Budgeting uses the existing chars/4 context_savings estimate; the Token Savings panel math is untouched. - query_graph gains a max_results cap (default 0 = off) so callers_of / callees_of on a hot symbol cannot return an unbounded payload; trims edges to the surviving set and reports truncated/total_results. Both default to no-op here; the MCP wrappers set the user-facing defaults. Backward compatible for all internal/CLI/eval callers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Shrink the per-session and per-call token cost of the MCP surface. Lean tool mode (default): CRG registers 30 tools (~8k description tokens per turn). The server now loads a curated lean set of 7 tools by default (get_minimal_context, query_graph, semantic_search_nodes, detect_changes, get_review_context, get_impact_radius, get_affected_flows) — the smallest set covering every documented workflow. The full set stays one step away: `serve --tools all` / `CRG_TOOLS=all`, `--tools lean`, or a custom CSV; unknown names are ignored gracefully. A one-line stderr notice prints whenever tools are trimmed (stdout stays clean for JSON-RPC). Minimal-first MCP defaults: query_graph_tool, semantic_search_nodes_tool, get_impact_radius_tool and detect_changes_tool now default detail_level="minimal". get_review_context_tool stays "standard" but is capped by max_tokens (its content IS the context; the budget is the safe lever). A server-wide override (CRG_DETAIL_LEVEL env + `serve --detail`) can force every tool's verbosity and beats the per-call argument. Payload caps wired to the wrappers: query_graph_tool max_results=100, get_review_context_tool / detect_changes_tool max_tokens=6000. The underlying tool functions keep standard/0 defaults so CLI, eval and internal callers are unchanged. Docs (README tool-filtering, COMMANDS signatures + serve examples) updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ED_BY fix Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-14T17:21:33Z

code-review-graph review

Overall risk: 0.80 (HIGH) — 232 changed function(s)/class(es), 0 affected flow(s), 131 test gap(s)

Risk-scored changes

Risk	Level	Symbol	Location	Tested
0.80	high	code_review_graph/doctor.py::latest_token_savings	code_review_graph/doctor.py:434	no
0.75	high	code_review_graph/cli.py::_handle_init	code_review_graph/cli.py:191	no
0.75	high	code_review_graph/eval/reporter.py::median_token_reduction	code_review_graph/eval/reporter.py:107	no
0.75	high	code_review_graph/main.py::query_graph_tool	code_review_graph/main.py:276	no
0.70	high	code_review_graph/eval/reporter.py::median_token_reduction_table	code_review_graph/eval/reporter.py:165	no
0.65	medium	code_review_graph/doctor.py::CheckResult	code_review_graph/doctor.py:43	no
0.65	medium	code_review_graph/enrich.py::_format_node_context	code_review_graph/enrich.py:98	no
0.65	medium	code_review_graph/main.py::_resolve_detail_level	code_review_graph/main.py:123	no
0.65	medium	code_review_graph/tools/review.py::_apply	code_review_graph/tools/review.py:636	no
0.65	medium	code_review_graph/tools/review.py::_fits	code_review_graph/tools/review.py:640	no

Test gaps

code_review_graph/changes.py::analyze_changes (code_review_graph/changes.py:277)
code_review_graph/cli.py::_print_banner (code_review_graph/cli.py:92)
code_review_graph/cli.py::_handle_init (code_review_graph/cli.py:191)
code_review_graph/cli.py::_handle_doctor (code_review_graph/cli.py:354)
code_review_graph/cli.py::main (code_review_graph/cli.py:365)
...and 126 more without direct tests

Token savings: this graph-backed report used ~383,975 fewer tokens (~92%) than reading every changed file in full (estimated, chars/4 approximation).

Powered by code-review-graph — local-first analysis; no code leaves the CI runner.

tirth8205 and others added 17 commits June 14, 2026 17:55

Merge branch 'moat/lean-tools-and-budget' into release/v2.4.0

56dfa9d

Merge branch 'trust/doctor-and-benchmark-ci' into release/v2.4.0

a511741

Merge branch 'adopt/frictionless-install' into release/v2.4.0

3bc2d49

Merge branch 'review/action-polish-and-positioning' into release/v2.4.0

21701a4

Merge branch 'review/tested-by-keystone' into release/v2.4.0

06e2a73

release: v2.4.0 — token-moat improvements, doctor, easy install, TEST…

8ac8cdc

…ED_BY fix Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

tirth8205 mentioned this pull request Jun 14, 2026

fix(graph): correct TESTED_BY edge direction in tests_for queries #527

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.4.0 — sharpen the token moat: lean tools, doctor, easy install, TESTED_BY fix#559

v2.4.0 — sharpen the token moat: lean tools, doctor, easy install, TESTED_BY fix#559
tirth8205 wants to merge 17 commits into
mainfrom
release/v2.4.0

tirth8205 commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tirth8205 commented Jun 14, 2026

Phase 1 — bigger token moat

Phase 2 — prove the savings

Phase 3 — easy install

Phase 4 — widen the review lead

Uh oh!

github-actions Bot commented Jun 14, 2026

code-review-graph review

Risk-scored changes

Test gaps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant