Skip to content

feat(kiloclaw): add Cloudflare Analytics Engine instrumentation#1311

Open
pandemicsyn wants to merge 4 commits intomainfrom
florian/chore/telemetry
Open

feat(kiloclaw): add Cloudflare Analytics Engine instrumentation#1311
pandemicsyn wants to merge 4 commits intomainfrom
florian/chore/telemetry

Conversation

@pandemicsyn
Copy link
Contributor

@pandemicsyn pandemicsyn commented Mar 20, 2026

Summary

Add Cloudflare Analytics Engine instrumentation to KiloClaw across HTTP routes, DO lifecycle, and reconciliation paths, then refactor reconcile telemetry to use a unified ReconcileContext that dual-writes console reconcile logs and AE events.

  • HTTP telemetry: global timingMiddleware; instrumented() wrappers for /api/admin/* and /api/kiloclaw/*; platform middleware emits request events and now records validated query/body-derived user context.
  • DO lifecycle telemetry: emitEvent() in KiloClawInstance records provision/start/stop/destroy lifecycle events with duration/value metrics.
  • Reconcile telemetry: reconcile call sites now emit consistently via rctx.log(...) with reconcile.{action} naming and common state-derived dimensions; includes duration/error/value fields where applicable.
  • Hardening: reconcile analytics error serialization is now guarded so unserializable error payloads cannot break best-effort analytics ([unserializable error] fallback).

Verification

  • pnpm typecheck (in kiloclaw) — pass
  • pnpm test (in kiloclaw) — pass (42 files / 936 tests)
  • Pre-push hook (repo root) — pass:
    • pnpm format:check
    • pnpm lint (monorepo)
    • pnpm typecheck (monorepo)

Visual Changes

N/A

Reviewer Notes

  • Blob layout is centralized in src/utils/analytics.ts and shared by HTTP, DO, and reconcile emitters.
  • Platform middleware skips non-error events for routes without user context (for example version metadata endpoints) to reduce low-signal telemetry noise.

pandemicsyn added 2 commits March 19, 2026 21:30
Add analytics tracking across three layers: HTTP routes, DO lifecycle
events, and reconciliation corrective actions. Uses a single dataset
(kiloclaw_events) with 13 blobs, 2 doubles, and 1 index.

HTTP layer: timingMiddleware + per-route instrumented() wrappers for
admin/kiloclaw routes, Hono middleware for platform routes. parseBody()
sets userId on context so the middleware captures it for POST/PATCH.

DO lifecycle: emitEvent() helper with Omit<> pattern. Tracks provision
(with duration), start (with startup time), stop/destroy (with machine
uptime via value double).

Reconciliation: events for status drift, volume repair, metadata
recovery, API key refresh, stale provision destroy, bound machine
recovery, and destroy finalization. Only emitted on corrective actions.
Add performance.now() timing to four reconcile corrective action events:
- reconcile.api_key_refreshed: mint + Fly update + push flow
- reconcile.metadata_recovery: listMachines + candidate selection + persist
- reconcile.volume_repaired: ensureVolume replacement flow
- reconcile.bound_machine_recovery: getVolume + state persistence
@pandemicsyn pandemicsyn marked this pull request as ready for review March 20, 2026 13:57
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 20, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (13 files)
  • kiloclaw/src/durable-objects/kiloclaw-instance/index.ts
  • kiloclaw/src/durable-objects/kiloclaw-instance/log.ts
  • kiloclaw/src/durable-objects/kiloclaw-instance/postgres.ts
  • kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts
  • kiloclaw/src/index.ts
  • kiloclaw/src/middleware/analytics.ts
  • kiloclaw/src/routes/api.ts
  • kiloclaw/src/routes/kiloclaw.ts
  • kiloclaw/src/routes/platform.ts
  • kiloclaw/src/types.ts
  • kiloclaw/src/utils/analytics.ts
  • kiloclaw/worker-configuration.d.ts
  • kiloclaw/wrangler.jsonc

Reviewed by gpt-5.4-20260305 · 195,735 tokens

Introduce ReconcileContext that dual-writes every reconcileLog call to
both console JSON and Cloudflare Analytics Engine. This replaces the
previous approach of selectively instrumenting ~10 reconcile events
with full coverage of all ~43 events.

Key changes:
- Add ReconcileContext type and createReconcileContext() factory in log.ts
- Replace reason+env parameter threading with single rctx across all
  reconcile functions
- Remove all manual writeEvent calls from reconcile.ts
- Validate userId via setValidatedQueryUserId() in platform GET/DELETE
  routes instead of reading raw query params in analytics middleware
- Skip analytics for non-user-scoped platform routes (e.g. /versions)
- Truncate error messages to 200 chars in analytics middleware
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants