Skip to content

feat(cursor): ingest sessions from state.vscdb#413

Open
wesm wants to merge 8 commits intomainfrom
badliveware/cursor-compatability
Open

feat(cursor): ingest sessions from state.vscdb#413
wesm wants to merge 8 commits intomainfrom
badliveware/cursor-compatability

Conversation

@wesm
Copy link
Copy Markdown
Owner

@wesm wesm commented Apr 29, 2026

Summary

Adds Cursor support backed by Cursor's global state.vscdb SQLite database in addition to the existing JSONL transcript ingestion. The vscdb source is the canonical one for live Cursor sessions; the parser reconstructs message threads from cursorDiskKV bubble headers and bubbles, persists tool calls with their results, and resolves project names from the adjacent workspaceStorage/ directories.

When both vscdb and JSONL data exist for the same session, vscdb wins. SyncAll, SyncPaths, and the explicit per-session resync path all consult the stored file_path so a JSONL change does not overwrite the richer vscdb messages, and a state.vscdb#<id> virtual path now resyncs through syncSingleCursorVscdb when no JSONL fallback exists.

Other notable changes:

  • Default state.vscdb location is platform-aware (Library/Application Support/Cursor/... on macOS, AppData/Roaming/Cursor/... on Windows, .config/Cursor/... on Linux), overridable via CURSOR_STATE_DB.
  • SQLite paths are opened through file: URIs so reserved characters (?, #, %, spaces) in the filesystem path don't get parsed as DSN parameters.
  • extractWorkspaceProject decodes file:// workspace folder URLs with Windows drive-letter and UNC handling.
  • vscdb change detection now reparses sessions when the stored data_version falls behind the parser's current version, matching the file-sync invariant.

Origin and follow-ups

Rebased from @BadLiveware's original branch; review iterations addressed platform handling, dedup completeness across non-SyncAll paths, DSN safety, and data-version reparsing.

This still needs validation by Cursor users:

  • Live updates from vscdb depend on the 15-minute scheduled SyncAll today. The file watcher observes only agent dirs; adding CursorStateDB to the watch set so vscdb writes propagate immediately is a known follow-up.
  • The schema work in internal/parser/cursor_vscdb.go was developed against synthetic fixtures. Real-world Cursor schemas across versions need verification against actual installs. Goal is to be a drop-in replacement for the standalone vscdb parser in ~/code/tkmx-client.

BadLiveware and others added 7 commits April 28, 2026 20:45
…cation

Track per-session success/failure when writing cursor vscdb sessions
in SyncAll, mirroring the existing OpenCode pattern, so excluded and
preserved sessions are counted correctly and write errors are reflected
in stats rather than silently dropped.

Also restore the "mcp-" prefix classification in NormalizeToolCategory
that vscdb uses for MCP tool invocations (e.g. mcp-github,
mcp-linear-search), which was lost during the rebase resolution.
…ool results

Address three medium-severity findings from the branch review:

* Default state.vscdb path is now platform-aware (macOS/Linux/Windows)
  via parser.DefaultCursorStateDBPath. Previously the Linux-only
  default silently disabled vscdb sync on macOS and Windows installs
  unless CURSOR_STATE_DB was set.

* processCursor now consults the stored session file_path when the
  in-memory cursorVscdbSynced set is empty (SyncPaths,
  SyncSingleSession, watcher events). Without this, JSONL transcript
  writes could overwrite the richer vscdb messages, tool calls, and
  the virtual file_path that syncCursorVscdb uses for change
  detection. Adds IsCursorVscdbVirtualPath helper plus a
  SyncSingleSession+SyncPaths regression test.

* buildCursorVscdbMessages now persists ToolFormerData.Result as a
  ParsedToolResult on the assistant message, matching how the JSONL
  parser surfaces tool result content. ContentRaw stores the raw
  JSON; ContentLength reflects the decoded textual length so search
  and analytics see comparable numbers.
…okup

extractWorkspaceProject was using url.Parse(...).Path verbatim, which
loses Windows drive-letter and UNC semantics — file:///C:/repo would
become /C:/repo and file://host/share would become /share. Now that
the platform-aware default enables vscdb on Windows, these URLs need
proper conversion.

Add fileURLToPath helper that:
- Strips the leading slash before a Windows drive letter
- Reconstructs UNC paths from u.Host on Windows
- Decodes percent-encoded path segments via url.PathUnescape
- Passes non-file:// inputs through unchanged

Add fileURLToPath unit tests covering POSIX paths, percent-encoded
spaces, and Unicode. Windows-specific drive/UNC behavior can only
be exercised when running tests on Windows; assertions skip there.
openCursorVscdb and extractWorkspaceComposerIDs were concatenating
the database path with "?mode=ro&..." DSN parameters. The mattn
sqlite3 driver splits at the first '?' to separate path from query
params, so any Cursor path containing '?', '#', or '%' (legal on
macOS/Linux and increasingly common given the new platform-aware
defaults) would fail to open the intended database. The risk is
real because the feature itself addresses sessions as
"state.vscdb#<sessionID>".

Build the DSN through cursorVscdbDSN, which serializes a
url.URL{Scheme:"file"} so reserved path characters are properly
percent-encoded. Forward slashes are forced via filepath.ToSlash so
Windows drive letters serialize correctly under the file URI.

Drop _journal_mode=WAL from the read-only DSN: with mode=ro now
correctly honored under URI mode, the WAL switch fails on non-WAL
databases. Cursor already runs its global state.vscdb in WAL mode,
and SQLite reads WAL journals fine without an explicit pragma.

Add tests for cursorVscdbDSN encoding and an end-to-end open test
that uses a directory whose name contains both '?' and '#'.
syncCursorVscdb was treating a session as unchanged whenever
the stored file_mtime matched the vscdb meta. Without a
data_version check, a Cursor session ingested under one
parser version stayed frozen across an agentsview upgrade
until Cursor itself wrote a new lastUpdatedAt — likely never
for archived sessions.

Mirror the file-sync invariant: include
GetDataVersionByPath(virtualPath) >= CurrentDataVersion in the
unchanged condition. Add an integration test that sets the
stored data_version one below current and asserts the next
SyncAll reparses the session.
…essions

When a Cursor session was ingested only from state.vscdb (no
JSONL fallback) the stored file_path is a virtual path of the form
state.vscdb#<id>. SyncSingleSession went through FindSourceFile,
which os.Stat'd the virtual path, fell back to scanning agent
dirs, found nothing, and returned "source file not found" — so
the /sessions/:id/resync UI flow simply broke for vscdb-only
sessions.

Add a syncSingleCursorVscdb helper that re-parses a single
session directly from CursorStateDB, mirroring the
syncSingleOpenCode pattern for OpenCode SQLite virtual paths,
and dispatch to it in SyncSingleSession when the stored source
is a vscdb virtual path. Includes a regression test that resyncs
a vscdb-only session before and after a vscdb mutation.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (2ff84d3)

Summary verdict: two Medium issues remain around Cursor vscdb parent/child relationship syncing; no Critical or High findings were reported.

Medium

  • internal/sync/engine.go:1840: syncCursorVscdb skips unchanged child sessions based only on the child session’s own lastUpdatedAt, but the parent/child relationship comes from the parent session’s subComposerIds. If a parent later adds or removes a child while the child row itself is unchanged, the child is marked synced and never rewritten, leaving ParentSessionID / RelationshipType stale.

    • Fix: Include relationship changes in vscdb change detection, or apply parent/child relationship updates for all metas even when child content is unchanged.
  • internal/sync/engine.go:4298: syncSingleCursorVscdb reparses and writes a single vscdb session without reconstructing the childToParent mapping used by syncCursorVscdb. Explicitly resyncing a child session will write it without ParentSessionID and RelationshipType, likely clearing an existing subagent relationship.

    • Fix: Build the parent mapping from metas in syncSingleCursorVscdb and set sess.ParentSessionID / sess.RelationshipType before writeSessionFull, matching the full-sync path.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Copy Markdown
Owner Author

wesm commented Apr 29, 2026

@srosro do you know some cursor users that could help test this?

Move the Cursor state.vscdb path off the top-level Config and
EngineConfig structs and into the existing per-agent
AgentDirs[AgentCursor] slot. The cursor-specific path no longer
needs cursor-specific plumbing through the engine and command
entry points; it lives alongside the legacy JSONL transcripts
root in the same dirs slice and is selected by basename match.

The agent registry's DefaultDirs lists all three platform
defaults (macOS, Windows, Linux); FindCursorVscdb stats them and
returns the one that exists. CURSOR_STATE_DB still works as an
override — it filters any state.vscdb-named entries out of
AgentDirs[AgentCursor] and appends the user-supplied path.

Discovery and classifyOnePath skip vscdb-named entries so the
JSONL transcripts walker doesn't try to descend into a SQLite
file. The sync engine reaches the vscdb via FindCursorVscdb on
demand instead of caching it in a struct field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 29, 2026

roborev: Combined Review (fcb2564)

Verdict: One Medium issue found; no High or Critical findings reported.

Medium

  • internal/sync/engine.go:1840 - syncCursorVscdb skips unchanged child sessions based only on the child session's own lastUpdatedAt / data version, but parent-child relationships are derived from the parent session's SubComposerIDs. If Cursor adds or changes a child relationship on the parent while the child content is unchanged, the child row is never rewritten, so parent_session_id / relationship_type can remain stale.

    Suggested fix: Compare the computed parent mapping against the stored child session relationship and either update those relationship columns independently or force that child into the changed set when the relationship differs.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants