Skip to content

Add cloud auth and observability guidance#4351

Open
bechols wants to merge 2 commits intomainfrom
docs/cloud-auth-and-observability-guidance
Open

Add cloud auth and observability guidance#4351
bechols wants to merge 2 commits intomainfrom
docs/cloud-auth-and-observability-guidance

Conversation

@bechols
Copy link
Copy Markdown
Contributor

@bechols bechols commented Mar 26, 2026

Summary

Adds opinionated guidance for Cloud auth operating models and for combining Cloud metrics with SDK metrics in observability setups.

Why

The setup docs explain mechanics, but users still need clearer guidance on how to structure service accounts, rotate credentials, and combine Cloud-side and Worker-side signals in practice.

Changes

  • expands Cloud access control best practices with service-account and rotation guidance
  • adds cross-links from API key and service account setup docs
  • clarifies how to combine Cloud metrics, SDK metrics, and Worker health guidance

Validation

  • yarn build

┆Attachments: EDU-6119 Add cloud auth and observability guidance

@bechols bechols requested a review from a team as a code owner March 26, 2026 23:10
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Ready Ready Preview, Comment Mar 26, 2026 11:28pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

📖 Docs PR preview links


Datadog provides a serverless integration with the OpenMetrics endpoint. This integration will scrape metrics, store them in Datadog, and provides a default dashboard with some built in monitors. See the [integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/) for more details.

For Datadog users, treat this integration as the Cloud-side half of your observability setup:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustin-temporal please review

Cloud metrics monitor Temporal behavior.
When used together, Temporal Cloud and SDK metrics measure the health and performance of your full Temporal infrastructure, including the Temporal Cloud Service and user-supplied Temporal Workers.

Use the following rule of thumb when deciding which signal to rely on:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustin-temporal please review

- [How to detect misconfigured Workers](#detect-misconfigured-workers)
- [How to configure Sticky cache](#configure-sticky-cache)

This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustin-temporal please review

For Datadog users, treat this integration as the Cloud-side half of your observability setup:

- Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting.
- Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.
- Use a Datadog agent to collect [SDK metrics](/cloud/metrics/sdk-metrics-setup) from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.


- Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting.
- Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior.
- Use tracing separately when you need execution-path debugging through your application and Activity code.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Use tracing separately when you need execution-path debugging through your application and Activity code.

We don't have a good Datadog tracing integration, so I think this is misleading

Copy link
Copy Markdown
Contributor

@LutaoX LutaoX Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datadog supports ingesting OTLP traces directly to their backend. It is just in private preview. Private documentation here: https://docs.datadoghq.com/opentelemetry/setup/otlp_ingest/traces/?tab=javascript

We can have a quick chat with them to get them "whitelist" Temporal for trace ingestion (they just need to add HTTP header for Temporal, very quick thing). And here we can just say, contact Datadog Opentelemetry Team for instructions of ingesting Trace to Datadog.

(the Trace endpoint has been available for more than 1 year, the blocker for DD to announce public availability is the pricing. I heard they are looking to announce Datadog being an OTel native backend this year DASH. This very very very much likely will be part of that announcement)


Use the following rule of thumb when deciding which signal to rely on:

| Question | Primary signal |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads like it's for an LLM to reference, but if that's what we're going for then I'm good with it.


This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand
what your Workers are doing, and Cloud metrics to understand what Temporal Cloud is seeing at the Task Queue and service
level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics).
level. For an overview of how these signals fit together, see [Temporal Cloud metrics](/cloud/metrics).

- Automation (e.g. Temporal Cloud [Operations API](https://docs.temporal.io/ops), [Terraform provider](https://docs.temporal.io/cloud/terraform-provider), [Temporal CLI](https://docs.temporal.io/cli/setup-cli))

By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts.
By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts.
By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations. API keys are more straightforward to set up, rotate and get started with for most teams. With API keys, you can automate account-level and namespace-level access control for service accounts.

By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts.
By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts.

If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication).
If your organization requires mutual authentication and stronger cryptographic guarantees, use [mTLS certificates](https://docs.temporal.io/cloud/certificates) to authenticate Temporal clients to Temporal Cloud, alongside API keys for control plane automation via Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider).
Note that mTLS certificates do not carry a service account identity or map to Temporal Cloud's RBAC model. Namespace access is granted based on CA trust alone, with optional [Certificate Filters](https://docs.temporal.io/cloud/certificates#manage-certificate-filters) for narrowing access by Common Name.


The way you partition Namespaces should usually match the way you partition machine identities.

- If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can
- If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can rotate credentials independently.

The way you partition Namespaces should usually match the way you partition machine identities.

- If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can
rotate credentials independently.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rotate credentials independently.


- If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can
rotate credentials independently.
- If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces
- If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces should usually have separate Service Accounts and API keys as well.

- If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can
rotate credentials independently.
- If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces
should usually have separate Service Accounts and API keys as well.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
should usually have separate Service Accounts and API keys as well.

rotate credentials independently.
- If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces
should usually have separate Service Accounts and API keys as well.
- If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more
- If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more granular.

- If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces
should usually have separate Service Accounts and API keys as well.
- If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more
granular.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
granular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants