Add telemetry instrumentation for azd tool first-run and operations#8261
Add telemetry instrumentation for azd tool first-run and operations#8261hemarina wants to merge 7 commits into
Conversation
…na/tool-telemetry-8168
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry usage-attribute instrumentation for the azd tool command group and its first-run onboarding middleware, enabling measurement of first-run funnel behavior and tool operation outcomes while avoiding raw error strings in tool-specific telemetry.
Changes:
- Introduces new
tool.*telemetry field constants and documents the emitted schema. - Instruments
tool install/upgrade/check/showand the tool first-run middleware to emit usage attributes on the command span. - Extends telemetry coverage tests and adds unit tests for new tool telemetry helpers/paths.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/internal/tracing/fields/fields.go | Adds new tool.* attribute key constants for first-run and tool operations telemetry. |
| cli/azd/docs/tracing-in-azd.md | Documents the new tool telemetry schema and coverage expectations. |
| cli/azd/cmd/tool.go | Emits tool operation telemetry attributes and adds helpers to aggregate install/upgrade results. |
| cli/azd/cmd/tool_test.go | Adds unit tests validating aggregate tool install/upgrade telemetry emission behavior. |
| cli/azd/cmd/telemetry_coverage_test.go | Registers tool subcommands for telemetry coverage and asserts new field constants. |
| cli/azd/cmd/middleware/tool_first_run.go | Emits first-run funnel telemetry attributes (skip reason, opt-in, selection, install outcomes). |
| cli/azd/cmd/middleware/tool_first_run_test.go | Adds tests validating first-run skip-reason emission and silent skip paths. |
| cli/azd/.vscode/cspell.yaml | Adds spelling dictionary entries for newly introduced telemetry terms. |
Comments suppressed due to low confidence (1)
cli/azd/cmd/tool.go:1063
tool.idis set from the raw CLI arg beforeFindToolvalidates it. If the arg is invalid, this can emit arbitrary user input into telemetry and inflate cardinality. Move theSetUsageAttributes(fields.ToolIdKey...)call to afterFindToolsucceeds (or set it fromtoolDef.Id).
toolID := a.args[0]
tracing.SetUsageAttributes(fields.ToolIdKey.String(toolID))
toolDef, err := a.manager.FindTool(toolID)
if err != nil {
return nil, fmt.Errorf("finding tool: %w", err)
…na/tool-telemetry-8168
wbreza
left a comment
There was a problem hiding this comment.
Thanks for this — it's a thoughtful telemetry pass, especially the tool.install.* vs tool.firstrun.install_* namespacing to dodge last-write-wins on a single span. That's the kind of detail experienced telemetry engineers regularly miss. The privacy posture (no error strings, no paths, classification declared on every key) and the telemetry_coverage_test.go round-trip assertions also stand out. 👍
The findings below are organized by severity, with code shapes wherever a fix isn't obvious. I deliberately skipped the things Copilot flagged that you've already addressed (raw-arg leak, empty-string tools_selected_names, the doc table syntax) — those are good as-is, and you're correct to push back on the || table comments.
🔴 Critical — blocks the merge as currently coded
C1. emitToolInstallTelemetry records "0 failures" when the batch call itself fails
cli/azd/cmd/tool.go (install action ~L514, upgrade action ~L770)
installResults, rawResults, _ := runToolOperation(ctx, tools, operationFn, "Installing", "install", a.console)
emitToolInstallTelemetry(rawResults, time.Since(start))runToolOperation returns (results, opErr). When manager.InstallTools returns (nil, err) — network failure, unsupported platform, permission denied — rawResults is nil and you emit:
tool.install.success_count = 0
tool.install.failure_count = 0
tool.install.failed_ids = (unset)
A user who runs azd tool install az-cli docker and gets a hard error produces a span that's indistinguishable from a no-op. The whole point of #8168 ("success rate") breaks on exactly the cases you most want to see.
The third return is right there — please capture it and synthesize failures when the batch infrastructure itself failed:
start := time.Now()
installResults, rawResults, opErr := runToolOperation(ctx, tools, operationFn, "Installing", "install", a.console)
emitToolInstallTelemetry(rawResults, time.Since(start), opErr, tools)func emitToolInstallTelemetry(
results []*tool.InstallResult, elapsed time.Duration, opErr error, requested []*tool.ToolDefinition,
) {
successCount, failureCount := 0, 0
failedIDs := make([]string, 0, len(requested))
if opErr != nil && len(results) == 0 {
// Operation infrastructure failure — every requested tool effectively failed.
// Maintain the invariant: success_count + failure_count == len(requested).
failureCount = len(requested)
for _, t := range requested {
failedIDs = append(failedIDs, t.Id)
}
} else {
for _, r := range results { /* existing loop */ }
}
...
}Alternatively (cleaner long-term): add a tool.install.outcome string enum ("ok", "partial", "operation_error") — that's how cmd/login.go and cmd/deploy.go model the same shape (auth.outcome, deploy.outcome).
🟠 High — please address before merge
H1. Comma-joined ID lists are unbounded-cardinality (your own doc forbids this)
Files & lines:
cli/azd/cmd/tool.go~L509 (tool.idsinstall), ~L758 (tool.idsupgrade), ~L665 (tool.install.failed_idsinemitToolInstallTelemetry)cli/azd/cmd/middleware/tool_first_run.go~L238/241 (tool.firstrun.tools_selected_names/_deselected_names), ~L289 (tool.firstrun.install_failed_ids)
Each individual ID is bounded (~12 built-ins today), but strings.Join(ids, ",") preserves user-supplied order, so the value space is permutations of subsets — Σ P(N,k), which is in the tens of thousands for 12 tools and millions for 20. cli/azd/docs/tracing-in-azd.md (the very doc this PR edits) requires attributes to avoid "high-cardinality values," and OTel/AppInsights typically warns above a few hundred distinct values per key. The cost is twofold: indexing/storage bloat downstream, and defeated faceting (every install becomes its own bucket).
One-line fix at each site: sort before joining. The IDs come from a fixed manifest, so this is safe and the analytical signal ("which tools were selected together") is preserved while collapsing N! into 2^N:
sort.Strings(resolvedIDs)
tracing.SetUsageAttributes(fields.ToolIdsKey.String(strings.Join(resolvedIDs, ",")))Apply identically to failedIDs, selectedIDs, deselectedIDs. Also update the Notes column in tracing-in-azd.md to say "comma-separated sorted built-in tool IDs" so the invariant is documented next to the schema.
For background on why this matters in OTel, see the OpenTelemetry Sampling docs on attribute cardinality — and the existing pattern in azd: pkg/extensions/manager.go sorts extension IDs before emitting extension.ids for exactly this reason.
H2. tool.upgrade.from_version is never emitted on the most common upgrade path
cli/azd/cmd/tool.go ~L736-749
fromVersions := make(map[string]string)
if len(a.args) > 0 {
for _, id := range a.args {
toolDef, findErr := a.manager.FindTool(id)
...
toolsToUpgrade = append(toolsToUpgrade, toolDef)
// ← no fromVersions write here
}
} else {
for _, s := range statuses { /* populates fromVersions */ }
}fromVersions only gets populated on the no-args (auto-detect) branch. The canonical single-target case — azd tool upgrade az-cli — never writes to the map, so tool.upgrade.from_version is always unset there. The PR description and tracing-in-azd.md both advertise this attribute as emitted on "Single-target upgrade," so either:
- Preferred: in the explicit branch, fetch the current status for each ID (using whatever detection method already lives in the upgrade auto-detect path) and populate
fromVersions[toolDef.Id] = status.InstalledVersion. The signal is the same on both branches. - Cheaper: narrow the doc to "auto-detect path only" — but that loses signal exactly where you most need it (the user knew what they were upgrading).
H3. tool.upgrade.to_version is emitted even when the upgrade failed
cli/azd/cmd/tool.go ~L783-786
if r.InstalledVersion != "" {
singleAttrs = append(singleAttrs, fields.ToolUpgradeToVersionKey.String(r.InstalledVersion))
}InstallResult.InstalledVersion is "the version detected after installation" (per its doc in pkg/tool/installer.go). On failure that's typically either the unchanged old version or empty, and an analyst seeing from=1.0.0 to=1.0.0 success=false can't tell whether the version actually didn't move or the installer just didn't re-detect.
if r.Success && r.InstalledVersion != "" {
singleAttrs = append(singleAttrs, fields.ToolUpgradeToVersionKey.String(r.InstalledVersion))
}(from_version is fine to emit regardless — it's a pre-state.)
H4. tool.firstrun.completed is overloaded and partially misdocumented
cli/azd/cmd/middleware/tool_first_run.go L190 (opt-in declined) and L259 (success path); cli/azd/internal/tracing/fields/fields.go ~L486
// opt-in declined:
tracing.SetUsageAttributes(
fields.ToolFirstRunOptInKey.Bool(false),
fields.ToolFirstRunCompletedKey.Bool(true), // ← always true
)
m.markCompleted()
// success path: same Bool(true)Three smaller things converging:
- The key is
Boolbut only evertrue. Dashboards filteringtool.firstrun.completed = falsewill silently return zero rows forever. - It's now set both when the user declined opt-in and when the flow succeeded. Most readers will see
completed=true ∧ optin=falseas a contradiction. - The multi-select error path and
installSpinner.Runerror path still fall through tomarkCompleted+completed=true, despite "failed during install" not really being "completed."
Two clean options — pick one:
- Option A (minimal): rename to
tool.firstrun.persisted(always true oncemarkCompletedruns). Honest about what it actually measures: "we wrote the flag to user config so we won't ask again." - Option B (cleaner long-term): replace with
tool.firstrun.outcomestring enum:"completed" | "declined" | "cancelled" | "detect_failed" | "install_failed". One low-cardinality field, mutually exclusive withskip_reason, self-documenting in dashboards.
Either way: update the field comment in fields.go and the Notes column in tracing-in-azd.md to match the chosen semantics.
H5. tool.id and tool.ids co-emitted for single-tool installs (docs say mutually exclusive)
cli/azd/cmd/tool.go ~L509-521
tracing.SetUsageAttributes(fields.ToolIdsKey.String(strings.Join(resolvedIDs, ",")))
...
if len(rawResults) == 1 {
tracing.SetUsageAttributes(singleResultCommonAttrs(rawResults[0])...) // appends tool.id
}So azd tool install az-cli produces both tool.id="az-cli" and tool.ids="az-cli". tracing-in-azd.md describes them as mutually exclusive ("Single-target" vs "Batch"). Anyone writing a downstream query against tool.ids will double-count single-tool installs.
if len(resolvedIDs) > 1 {
tracing.SetUsageAttributes(fields.ToolIdsKey.String(strings.Join(resolvedIDs, ",")))
}🟡 Medium
M1. tool.dry_run is emitted before arg validation
cli/azd/cmd/tool.go ~L484: tool.dry_run is set immediately, but tool.ids (correctly) waits until after FindTool resolves the arguments. If a user runs azd tool install bogus-name, the span gets tool.dry_run=false with no tool.ids — analytically muddled. Move the emit to alongside tool.ids after the FindTool loop so the contract is uniform ("we emit usage attrs once we've confirmed we're operating on built-ins").
M2. First-run install error path emits install_duration_ms without counts
cli/azd/cmd/middleware/tool_first_run.go ~L371-376: on installSpinner.Run error you emit duration but not install_success_count / install_failure_count / install_failed_ids (even though results may contain partial data). Downstream consumers seeing duration > 0 with no counts will treat the record as malformed. Either also emit zero/partial counts on the error path, or move the duration emit into the same installAttrs block after success.
M3. Aggregate install loop is duplicated between command and middleware
cli/azd/cmd/tool.go:625-655 (emitToolInstallTelemetry) vs cli/azd/cmd/middleware/tool_first_run.go:369-391. Same successCount/failureCount/failedIDs aggregation, only different field keys. The two sites already differ subtly today and will drift on the next change. A small parametric helper:
type installTelemetryKeys struct {
SuccessCount, FailureCount, DurationMs, FailedIDs fields.AttributeKey
}
func emitInstallAggregate(results []*tool.InstallResult, elapsed time.Duration, k installTelemetryKeys) { ... }Then both call sites collapse to one line each.
M4. Tests mutate global tracing baggage with sentinel-survives-the-call assertions
cli/azd/cmd/tool_test.go (TestEmitToolInstallTelemetry_*) and cli/azd/cmd/middleware/tool_first_run_test.go: the __sentinel__ pattern is clever but tracing baggage is process-global. These tests (a) leak state to subsequent tests in the package, (b) can't be t.Parallel(), and (c) tightly couple to baggage internals. Worth adding a tracing.ResetUsageAttributesForTest(t *testing.T) helper that calls t.Cleanup to restore prior state, then dropping the sentinel idiom. Belt-and-braces is fine, but it shouldn't be the only safeguard.
M5. Behavioral test coverage for the emitted attributes is sparse
Of ~25 newly emitted attributes, only 7 have a behavioral test that exercises the production emit path (TestEmitToolInstallTelemetry_* covers success/failure counts/duration/failed_ids; coverage test covers key strings only). Notably uncovered: every tool.firstrun.tools_* and install_* attribute, plus tool.id/tool.ids/tool.dry_run/tool.install.strategy/tool.install.success/tool.upgrade.from_version/tool.upgrade.to_version/tool.check.updates_available.
The whole point of this PR is the signal — if a future refactor silently stops emitting tools_offered, nothing breaks today. One right-sized fix: a single TestToolFirstRunMiddleware_FullFlow end-to-end test that walks opt-in=true → detection → offer → select N → install and asserts each of the ~12 first-run attributes is present on the span (one test, ~12 attributes covered).
Also: the PR description claims a test for "all-failure batch" but the diff has _FailureWithNilTool (which deliberately tests the opposite — that failed_ids is not overwritten when Tool is nil). Add the missing all-failure case (or update the PR description).
M6. PR description / doc / code mismatches worth tightening
-
Privacy claim: "Only built-in tool IDs ... and semver version strings."
tool.upgrade.from_version/to_versionare not validated against a semver regex — they're "whatevertool.VersionRegexcaptures" perpkg/tool/detector.go. Today's manifests are well-behaved, butpkg/tool/manifest.gohas@azure/mcpwith a prerelease pattern(\d+\.\d+\.\d+(?:-[^"]*)?)—-[^"]*allows arbitrary suffixes up to a quote. Defense-in-depth: sanitize at the emit site:var semverRe = regexp.MustCompile(`^\d+\.\d+\.\d+(?:-[A-Za-z0-9.]+)?$`) func sanitizeVersion(v string) string { if semverRe.MatchString(v) { return v } return "non-semver" }
-
Per-tool duration dropped: issue #8168 lists
tool.install.duration_msper-tool; you emit it as a batch wall-clock aggregate. That's defensible, but please add a bullet to the "Design notes / deviations" section explaining the intentional collapse so future readers know it wasn't an oversight. -
failed_idsshorter thanfailure_count: when a failure has a nilTool, the count goes up but no ID is appended (your_FailureWithNilTooltest proves this). Add one sentence to the docs' Notes column: "failed_idsmay be shorter thanfailure_countwhen a failure has no resolved tool reference."
🟢 Nits / smaller polish (take or leave)
- N1
tool.firstrun.optin→tool.firstrun.opt_infor consistency withskip_reason,tools_detected,from_version, etc. Side benefit: drops theoptinentry from.vscode/cspell.yaml. - N2
cli/azd/cmd/middleware/tool_first_run.goimport block:go.opentelemetry.io/otel/attributeis currently grouped with azd internals — should be in the external block pergoimportsdefaults and AGENTS.md import ordering.mage preflightshould fix it. - N3
failedIDs := make([]string, 0)→make([]string, 0, len(results)). You know the upper bound — saves reallocations. - N4
singleResultCommonAttrs(r *tool.InstallResult)dereferencesr.Successwithout nil-checking. Today's callers gate onlen(rawResults) == 1, but the helper itself should be either value-receiver (r tool.InstallResult) or guardnil— small footgun otherwise. - N5
runToolOperationreturns three values, all sites now write_ , _ , _ := …— a named result struct (toolOpOutcome { Items, Results, Err }) would self-document and avoid future-proofing churn. - N6
tool.firstrun.skip_reasonvalues include bothno_promptandnon_interactive. Worth a one-line field-comment clarification on whether they're analytically distinct or whether they should collapse. - N7 Doc table style: new tables are 4-column (
Attribute | Type | Emitted when | Notes); existing### Extension Attributesand### Hook Attributesuse 3-column (Attribute | Description | Example). Not wrong, but worth either folding examples into Notes or matching the existing column shape. Also### First-run experience(sentence case) vs### Extension Attributes(Title Case) is a tiny style nit. - N8 Add a comment near the first-run install emit site noting that
tool.id/tool.ids/tool.dry_runare deliberately not emitted from this middleware to avoid clobbering the subsequent subcommand's writes — that's a load-bearing design choice and future contributors should know. - Doc table syntax (Copilot lines 229 & 248): you're right, the bot misread. Feel free to resolve those threads with a one-liner pointing at the actual rendered bytes — the tables are well-formed.
✅ Things that are great and worth keeping
- Namespace separation (
tool.firstrun.install_*vstool.install.*) to defeat last-write-wins on the same span — this is the kind of design detail experienced telemetry engineers regularly miss, and you called it out explicitly in the PR description. 👏 - Privacy posture — using
failed_ids(low-cardinality IDs) instead of raw error strings, and explicitly deferring error attribution to the global error middleware, is the right boundary. Classification+Purpose+IsMeasurementare set uniformly on every new field. That's the kind of consistency that compounds.- Telemetry coverage test asserts both key string and value type round-trip — catches a whole class of refactor regressions.
- Field doc comments explain when the field is emitted, not just what it means. Exactly what future query authors need.
- Out-of-scope honored — no leaks into
pkg/tool/installer.go,detector.go, orupdate_checker.goas promised. - Engagement with prior feedback — the
resolvedIDs/toolDef.Iddiscipline post-Copilot's first round was the right response.
CI
At time of review, gh pr checks 8261 shows 0 failing / 19 pending (the earlier azure-dev - cli + CodeCoverage_Upload failures appear to have been a transient run that's been restarted). Recommend monitoring the pending run; if CodeCoverage_Upload re-fails it's almost certainly infra/flake, but a re-failed main build would be worth gh run view --log-failed follow-up.
Great work overall. C1 is the only finding that I'd consider a hard blocker — once that's addressed and the H-series are at least decided on (fix vs. document-the-deviation), this lands solidly. Happy to dig into any of the suggested code shapes if they're unclear.
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
Closes #8168.
Summary
Instruments the
azd toolfeature so we can answer the questions in the issue: adoption of the first-run experience, tool preferences, skip rate, install success rate, and command engagement outside first-run.All attributes are attached as usage attributes via
tracing.SetUsageAttributes, so they land on the user''s actual command span (e.g.cmd.tool.install) rather than on a separate child span — matching existing azd telemetry conventions.What''s emitted
First-run experience (
cmd/middleware/tool_first_run.go)tool.firstrun.skip_reasonenv_var,no_prompt,ci_cd,non_interactive,already_completed,config_error)tool.firstrun.optintrue= accepted)tool.firstrun.tools_detectedtool.firstrun.tools_offeredtool.firstrun.tools_selectedtool.firstrun.tools_selected_namestool.firstrun.tools_deselected_namestool.firstrun.completedtool.firstRunCompletedwas persisted to user configtool.firstrun.install_success_counttool.firstrun.install_failure_counttool.firstrun.install_failed_idstool.firstrun.install_duration_msTool commands (
cmd/tool.go—install,upgrade,check,show)tool.idtool.idstool.dry_runinstall/upgradewith--dry-runtool.install.strategytool.install.successtool.install.success_count/tool.install.failure_counttool.install.failed_idstool.install.duration_mstool.upgrade.from_version/tool.upgrade.to_versiontool.check.updates_availabletool checkDesign notes / deviations from the issue''s proposed schema
The "Proposed Telemetry Events" table in #8168 was a sketch; the implementation refines a few things:
tool.id/tool.idsinstead of separatetool.install.tool_idandtool.upgrade.tool_id. The command name already disambiguates; one set of keys keeps the schema small.tool.install.countintosuccess_count+failure_count(+failed_ids). Lets backends compute success rate directly without joining events.tool.install.errorstrings in favor oftool.install.failed_ids. Error strings risk leaking paths, usernames, or proxy URLs; the global error middleware still captures error class on the same span.tool.firstrun.optinbool instead of two separateoptin_accepted/optin_declinedattributes. Same signal, half the schema.tool.firstrun.install_*so that when a user''s very firstazdcommand isazd tool install <x>, the first-run signal is not clobbered by the command-leveltool.install.*write on the same span (usage attributes are last-write-wins).tool.dry_runsince bothinstallandupgradeaccept--dry-runand the flag materially changes what the command did.Privacy
Only built-in tool IDs (e.g.
az-cli,docker) and semver version strings are captured. No file paths, no user-identifiable data, no raw error text.Tests
cmd/middleware/tool_first_run_test.go—TestToolFirstRunMiddleware_EmitsSkipReason(env-var / no-prompt / ci_cd / non-interactive / already-completed) +TestToolFirstRunMiddleware_NoSkipReasonForSilentPaths.cmd/tool_test.go— 4 cases foremitToolInstallTelemetry(all-success batch, all-failure batch, mixed, empty result set).cmd/telemetry_coverage_test.go—tool,tool list,tool check,tool install,tool show,tool upgraderegistered; newToolFieldssubtest asserts every newAttributeKeyresolves to its expectedtool.*string.Docs
docs/tracing-in-azd.md— new "Tool Command Telemetry" section.Out of scope
pkg/tool/installer.go,pkg/tool/detector.go,pkg/tool/update_checker.go— instrumentation lives at the command/middleware layer (where decisions are made) rather than inside the orchestration types. That matches how other azd domains are instrumented.