Skip to content

feat(parser): Markdown and YAML knowledge extractors#539

Open
otalrapha wants to merge 1 commit into
tirth8205:mainfrom
otalrapha:feat/markdown-yaml-knowledge-extractors
Open

feat(parser): Markdown and YAML knowledge extractors#539
otalrapha wants to merge 1 commit into
tirth8205:mainfrom
otalrapha:feat/markdown-yaml-knowledge-extractors

Conversation

@otalrapha

Copy link
Copy Markdown

What

Adds two regex-based extractors so the graph can index knowledge files that have no bundled tree-sitter grammar:

  • Markdown (.md/.mdx/.qmd): headings → Section nodes, nesting → CONTAINS edges, links (backtick filenames, [[wikilinks]], [text](x.md)) → REFERENCES edges. Fenced code blocks are skipped.
  • YAML (.yaml/.yml): top-level keys → Section nodes, registry-style id:/name: list entries → Type nodes, with CONTAINS edges. Pure regex, no PyYAML dependency.

Why

Both reuse the existing NodeInfo/EdgeInfo schema and the shared CONTAINS/REFERENCES edge kinds, so documentation and config land in the same graph as code: a spec.md can REFERENCES the source file it describes, and a CRON-123 registry entry becomes a queryable node. Helpful for repos that mix code with specs, docs and config registries.

Approach

Follows the existing pattern for grammar-less inputs (_parse_rescript, _parse_sql): extension mapping in EXTENSION_TO_LANGUAGE, dispatch in parse_bytes, dedicated _parse_markdown / _parse_yaml methods.

Tests

tests/test_knowledge_extractors.py — 4 cases (headings/nesting, fenced-code skip, references, yaml entries). ruff check passes clean, pytest green locally.

Adds regex-based extractors for two formats that have no bundled tree-sitter
grammar (same approach already used for ReScript and SQL CREATE PROCEDURE):

- Markdown (.md/.mdx/.qmd): each heading becomes a Section node; heading nesting
  produces CONTAINS edges; backtick filenames / [[wikilinks]] / [text](x.md)
  links produce REFERENCES edges. Fenced code blocks are skipped.
- YAML (.yaml/.yml): top-level keys become Section nodes and registry-style
  list entries (id:/name:) become Type nodes, with CONTAINS edges. Pure regex,
  no PyYAML dependency.

Because both reuse the existing NodeInfo/EdgeInfo schema and the shared
CONTAINS/REFERENCES edge kinds, documentation and config now live in the SAME
graph as code — e.g. a spec.md REFERENCES the source file it describes, and a
CRON-123 registry entry is a queryable node.

Tests: tests/test_knowledge_extractors.py (4 cases — headings/nesting, code
fence skip, references, yaml entries). ruff clean, pytest green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant