Add cloud auth and observability guidance by bechols · Pull Request #4351 · temporalio/documentation

bechols · 2026-03-26T23:10:57Z

Summary

Adds opinionated guidance for Cloud auth operating models and for combining Cloud metrics with SDK metrics in observability setups.

Why

The setup docs explain mechanics, but users still need clearer guidance on how to structure service accounts, rotate credentials, and combine Cloud-side and Worker-side signals in practice.

Changes

expands Cloud access control best practices with service-account and rotation guidance
adds cross-links from API key and service account setup docs
clarifies how to combine Cloud metrics, SDK metrics, and Worker health guidance

Validation

yarn build

┆Attachments: EDU-6119 Add cloud auth and observability guidance

vercel · 2026-03-26T23:11:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
temporal-documentation	Ready	Preview, Comment	Mar 26, 2026 11:28pm

github-actions · 2026-03-26T23:11:23Z

📖 Docs PR preview links

Best Practices
- Managing Cloud access control
Cloud
- Manage API keys
- Manage service accounts
- Metrics
  - Openmetrics
    - Metrics integrations
- Monitor worker health

bechols · 2026-03-26T23:15:54Z

docs/cloud/metrics/openmetrics/metrics-integrations.mdx


 Datadog provides a serverless integration with the OpenMetrics endpoint. This integration will scrape metrics, store them in Datadog, and provides a default dashboard with some built in monitors. See the [integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/) for more details.

+For Datadog users, treat this integration as the Cloud-side half of your observability setup:


@dustin-temporal please review

bechols · 2026-03-26T23:16:21Z

docs/cloud/metrics/index.mdx

 Cloud metrics monitor Temporal behavior.
 When used together, Temporal Cloud and SDK metrics measure the health and performance of your full Temporal infrastructure, including the Temporal Cloud Service and user-supplied Temporal Workers.

+Use the following rule of thumb when deciding which signal to rely on:


@dustin-temporal please review

bechols · 2026-03-26T23:16:31Z

docs/cloud/worker-health.mdx

 - [How to detect misconfigured Workers](#detect-misconfigured-workers)
 - [How to configure Sticky cache](#configure-sticky-cache)

+This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand


@dustin-temporal please review

dustin-temporal · 2026-03-27T13:47:42Z

docs/cloud/metrics/openmetrics/metrics-integrations.mdx

+For Datadog users, treat this integration as the Cloud-side half of your observability setup:
+
+- Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting.
+- Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.


Suggested change

- Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.

- Use a Datadog agent to collect [SDK metrics](/cloud/metrics/sdk-metrics-setup) from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.

dustin-temporal · 2026-03-27T13:49:38Z

docs/cloud/metrics/openmetrics/metrics-integrations.mdx

+
+- Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting.
+- Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.
+- Use tracing separately when you need execution-path debugging through your application and Activity code.


Suggested change

- Use tracing separately when you need execution-path debugging through your application and Activity code.

We don't have a good Datadog tracing integration, so I think this is misleading

Datadog supports ingesting OTLP traces directly to their backend. It is just in private preview. Private documentation here: https://docs.datadoghq.com/opentelemetry/setup/otlp_ingest/traces/?tab=javascript

We can have a quick chat with them to get them "whitelist" Temporal for trace ingestion (they just need to add HTTP header for Temporal, very quick thing). And here we can just say, contact Datadog Opentelemetry Team for instructions of ingesting Trace to Datadog.

(the Trace endpoint has been available for more than 1 year, the blocker for DD to announce public availability is the pricing. I heard they are looking to announce Datadog being an OTel native backend this year DASH. This very very very much likely will be part of that announcement)

dustin-temporal · 2026-03-27T13:56:29Z

docs/cloud/metrics/index.mdx


+Use the following rule of thumb when deciding which signal to rely on:
+
+| Question | Primary signal |


This reads like it's for an LLM to reference, but if that's what we're going for then I'm good with it.

dustin-temporal · 2026-03-27T14:06:13Z

docs/cloud/worker-health.mdx


+This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand
+what your Workers are doing, and Cloud metrics to understand what Temporal Cloud is seeing at the Task Queue and service
+level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics).


Suggested change

level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics).

level. For an overview of how these signals fit together, see [Temporal Cloud metrics](/cloud/metrics).

LutaoX · 2026-03-27T14:58:53Z