v2.4.0 — sharpen the token moat: lean tools, doctor, easy install, TESTED_BY fix#559
Open
tirth8205 wants to merge 17 commits into
Open
v2.4.0 — sharpen the token moat: lean tools, doctor, easy install, TESTED_BY fix#559tirth8205 wants to merge 17 commits into
tirth8205 wants to merge 17 commits into
Conversation
The GitHub Action's PR comment rendered absolute CI-runner paths like /home/runner/work/repo/repo/code_review_graph/embeddings.py::get_provider, which is ugly and leaks the runner's directory layout. Add relativize_path(): strip the GITHUB_WORKSPACE prefix (set by Actions checkout), fall back to the doubled <repo>/<repo>/ segment derived from GITHUB_REPOSITORY when the env is absent, and otherwise leave the path untouched rather than guess-mangle it. The ::symbol suffix is preserved, and already-relative paths pass through unchanged. Wire it into the Symbol and Location columns and the test-gap list. Risk math and sticky-comment marker logic are untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Update the FAQ comparison section with verified specifics: CRG ships risk-scored change review, execution-flow criticality, impact radius, a PR-review GitHub Action, dead-code, wiki, and hybrid keyword+vector search locally and free; note (factually, without name-calling) that some popular retrieval-focused graph tools are keyword/FTS-only today and position deeper PR-review intelligence as a forthcoming hosted/commercial product. Add a one-line wedge — "CRG focuses on review; retrieval tools focus on navigation; both are useful, often together" — and per-job guidance. Add one wedge sentence near the top of the README: review-native, token-efficient, local, free, no waitlist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add a `doctor` subcommand (and code_review_graph/doctor.py) that runs a fast, read-only health checklist and prints ✓/✗ lines with next-step hints: 1. graph.db exists and has nodes (critical → run build) 2. freshness: stored git_head_sha vs current HEAD via a safe git subprocess (list args, stdin=DEVNULL, bounded timeout) — warns when stale 3. MCP config present for repo-local platforms (reuses skills.PLATFORMS) 4. serve command resolves (reuses skills._detect_serve_command) 5. MCP server import/boot smoke — confirms registered tool count > 0 6. platform-native / git hooks installed 7. embeddings present, or a note that semantic search falls back to keyword Exits non-zero only on critical failures; warnings never fail the exit code. When a graph exists, the closing line surfaces the latest detect-changes Token Savings number as a 'see your savings' proof. Generalizes the PASS/FAIL, exit-code-driven pattern of scripts/diagnose_pypi_connectivity.py without touching it. Wires the command into the banner, install Next-steps output, CLAUDE.md, docs/COMMANDS.md, and the README Quick Start. Adds tests/test_doctor.py (unbuilt repo flags missing graph; built repo is healthy; exit codes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The parser canonically writes TESTED_BY edges as source=production, target=test (reads as "X is tested by Y"). Several read-side consumers traversed them in the inverted direction; the naming-convention fallback (test_<name> / Test<name>) silently masked the bug for conventionally named tests, so the suite stayed green while real coverage lookups were wrong. Flip every inverted consumer to read source=production / target=test: - graph.py::get_transitive_tests: the three TESTED_BY queries (direct, bare-name fallback, transitive) now select target_qualified where source_qualified = ?. - graph.py::load_flow_adjacency: track the production node (src) in has_tested_by so criticality scoring reflects real coverage. - changes.py: test-gap detection looks up a changed production function's tests by source. - tools/query.py: query_graph(pattern="tests_for") follows edges by source and returns the target test node. - enrich.py: symbol enrichment traverses TESTED_BY by source and reads the target test name. Already-correct readers (analysis.py, tools/build.py, tools/review.py) were left untouched. No DB migration is needed: stored edges were always canonical, only the read side was inverted. Regression tests use unconventional test names so the naming-convention fallback cannot mask future regressions, and a producer-side guard asserts the parser emits source=production / target=test (not merely that TESTED_BY edges exist) so a future parser flip cannot silently re-break it. Fixes #515 Co-Authored-By: Pratik Jagzap <pratikjagzap530@gmail.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The producer-side parser test and the consumer-side fixture tests each exercise only one half of the TESTED_BY contract. Add a single test that parses a real production/test fixture pair through the parser, stores the emitted nodes and edges, and asserts get_transitive_tests surfaces the test as covering the production source. This couples the parser's canonical direction to the consumer query, so a future parser flip breaks this test even if every hand-seeded fixture test still passes. Refs #515 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
find_dead_code() inferred test coverage from *incoming* edges, but the parser writes TESTED_BY as source=production, target=test, so a tested production node is the *source* of its TESTED_BY edge. The inverted read (plus a bare-name fallback that searched TESTED_BY targets) never matched, so a function whose only reference was its test was wrongly flagged as dead code. Compute has_test_refs from outgoing TESTED_BY edges (by qualified name, class-qualified name, and a plausible-caller-filtered bare-name source fallback), aligning dead-code detection with the canonical direction. This consumer site was missed by #515's original sweep. Refs #515 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First-run UX fix. Read tools used to silently create an empty graph.db and
return a misleading "Found 0 results" when the graph had never been built.
They now return {status: "not_built", message, next_tool_suggestions:
["build_or_update_graph_tool"]} instead.
- Add _graph_is_built / _not_built_response / _get_store_for_read to
tools/_common.py. "Built" means any node OR a build-marker metadata row
(last_updated / last_build_type), so a legitimately empty-but-built graph
is distinguished from a never-built one. A missing graph.db short-circuits
before GraphStore would create one.
- Route read tools (query, search, impact, stats, flows, communities,
architecture, review/detect-changes/affected-flows, refactor preview,
minimal-context, wiki, hub/bridge/gaps/surprising/suggested) through the
guard. Build/update/postprocess/embed paths keep using _get_store and are
never blocked.
- not_built response is smaller than the old empty result and avoids the
stray empty-db side-effect, preserving the token-savings invariant.
- Update existing tests that patched _get_store on guarded modules.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lower the front door without changing the engine. Add install.sh (macOS/Linux,
POSIX sh, no bashisms) and install.ps1 (Windows) at repo root that:
1. detect uv and install it via the official Astral installer
(https://astral.sh/uv/install.{sh,ps1}, echoed before running) if missing,
2. uv tool install code-review-graph (fallback pipx, then pip --user),
3. print next steps (install / build / status).
Scripts are idempotent and fail-loud with helpful messages. Comments state
honestly that this installs uv (a Python toolchain manager), not a bundled
runtime. README gains a "Quick install (no Python setup)" one-liner at the top
of Quick Start, above the pip instructions, with pip/uvx kept as alternatives.
Add tests/test_install_scripts.py (sh -n syntax check, bashism guard, pinned-URL
and content asserts; PowerShell parse when pwsh is available) and a CI
script-lint job that validates both scripts' syntax.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add reporter.median_token_reduction / median_token_reduction_table: the median of per-question (1 - graph_tokens / baseline_tokens) * 100 over the agent_baseline CSV rows (status=ok only), falling back to token_efficiency when no agent-baseline rows exist. Empty/failed runs report n/a rather than fabricating a number. eval.yml now leads its job summary with that headline median table (in addition to the full report) and already uploads the CSV artifact. The workflow yaml parses and the summary heredoc was validated end-to-end against both empty and populated results dirs. Add a static 'benchmarks: reproducible' shields badge to the README linking to the eval workflow runs. The badge asserts the pipeline is reproducible; the canonical numbers live in the README Benchmarks section and docs/REPRODUCING.md, never auto-committed from CI. Documents the local reproduction recipe in REPRODUCING.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add opt-in payload guards to the two tool functions that could blow the token promise: - get_review_context / detect_changes_func gain a max_tokens parameter (default 0 = off at the function layer). When set, the lowest-risk source snippets / changed functions / flows / test gaps are dropped first (highest-risk and review_priorities kept longest) and an honest "omitted" note is added. Budgeting uses the existing chars/4 context_savings estimate; the Token Savings panel math is untouched. - query_graph gains a max_results cap (default 0 = off) so callers_of / callees_of on a hot symbol cannot return an unbounded payload; trims edges to the surviving set and reports truncated/total_results. Both default to no-op here; the MCP wrappers set the user-facing defaults. Backward compatible for all internal/CLI/eval callers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Shrink the per-session and per-call token cost of the MCP surface. Lean tool mode (default): CRG registers 30 tools (~8k description tokens per turn). The server now loads a curated lean set of 7 tools by default (get_minimal_context, query_graph, semantic_search_nodes, detect_changes, get_review_context, get_impact_radius, get_affected_flows) — the smallest set covering every documented workflow. The full set stays one step away: `serve --tools all` / `CRG_TOOLS=all`, `--tools lean`, or a custom CSV; unknown names are ignored gracefully. A one-line stderr notice prints whenever tools are trimmed (stdout stays clean for JSON-RPC). Minimal-first MCP defaults: query_graph_tool, semantic_search_nodes_tool, get_impact_radius_tool and detect_changes_tool now default detail_level="minimal". get_review_context_tool stays "standard" but is capped by max_tokens (its content IS the context; the budget is the safe lever). A server-wide override (CRG_DETAIL_LEVEL env + `serve --detail`) can force every tool's verbosity and beats the per-call argument. Payload caps wired to the wrappers: query_graph_tool max_results=100, get_review_context_tool / detect_changes_tool max_tokens=6000. The underlying tool functions keep standard/0 defaults so CLI, eval and internal callers are unchanged. Docs (README tool-filtering, COMMANDS signatures + serve examples) updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ED_BY fix Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
code-review-graph reviewOverall risk: 0.80 (HIGH) — 232 changed function(s)/class(es), 0 affected flow(s), 131 test gap(s) Risk-scored changes
Test gaps
Token savings: this graph-backed report used ~383,975 fewer tokens (~92%) than reading every changed file in full (estimated, chars/4 approximation). Powered by code-review-graph — local-first analysis; no code leaves the CI runner. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the 4-phase "sharpen the token moat" plan. Nothing here changes the design (local SQLite graph, Tree-sitter, MCP) — it makes default responses smaller, makes the savings provable, removes Python-install friction, and widens the review lead. 1474 tests pass; ruff + mypy clean.
Phase 1 — bigger token moat
serveexposes 7 curated tools instead of 30 (~8k fewer tokens/session, fewer agent mis-picks).--tools all/CRG_TOOLS=allrestores everything;--tools lean|<csv>available.get_review_context+detect_changestakemax_tokens(default 6000), keep highest-risk findings, report omissions — a review can no longer overflow the context window (the audit found responses up to 400k tokens).max_resultscap (100) on query_graph;CRG_DETAIL_LEVEL+serve --detailoverrides.Phase 2 — prove the savings
code-review-graph doctor: health checklist + your latest Token Savings number as proof it's working (answers "is it running?"). Verified live here: flags a stale graph, shows ~91% saved.Phase 3 — easy install
install.sh/install.ps1:curl … | shbootstrapsuvthen installs the CLI — no manual Python setup. pip/pipx/uvx stay as alternatives.sh -n+ PowerShell-parse lint job added.not_builtstatus instead of silently making an empty DB and returning "0 results".Phase 4 — widen the review lead
Co-Authored-By. Fixestests_for,get_transitive_tests, test-gap detection, flow criticality, enrichment, dead-code — and makes the Action's "Tested" column correct. No DB migration.Bumps to v2.4.0. Full "Behavior changes" list in CHANGELOG.md.
🤖 Generated with Claude Code