Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links
|
|
|
||
| Datadog provides a serverless integration with the OpenMetrics endpoint. This integration will scrape metrics, store them in Datadog, and provides a default dashboard with some built in monitors. See the [integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/) for more details. | ||
|
|
||
| For Datadog users, treat this integration as the Cloud-side half of your observability setup: |
| Cloud metrics monitor Temporal behavior. | ||
| When used together, Temporal Cloud and SDK metrics measure the health and performance of your full Temporal infrastructure, including the Temporal Cloud Service and user-supplied Temporal Workers. | ||
|
|
||
| Use the following rule of thumb when deciding which signal to rely on: |
| - [How to detect misconfigured Workers](#detect-misconfigured-workers) | ||
| - [How to configure Sticky cache](#configure-sticky-cache) | ||
|
|
||
| This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand |
| For Datadog users, treat this integration as the Cloud-side half of your observability setup: | ||
|
|
||
| - Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting. | ||
| - Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior. |
There was a problem hiding this comment.
| - Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior. | |
| - Use a Datadog agent to collect [SDK metrics](/cloud/metrics/sdk-metrics-setup) from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior. |
|
|
||
| - Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting. | ||
| - Use SDK metrics from your Workers to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior. | ||
| - Use tracing separately when you need execution-path debugging through your application and Activity code. |
There was a problem hiding this comment.
| - Use tracing separately when you need execution-path debugging through your application and Activity code. |
We don't have a good Datadog tracing integration, so I think this is misleading
There was a problem hiding this comment.
Datadog supports ingesting OTLP traces directly to their backend. It is just in private preview. Private documentation here: https://docs.datadoghq.com/opentelemetry/setup/otlp_ingest/traces/?tab=javascript
We can have a quick chat with them to get them "whitelist" Temporal for trace ingestion (they just need to add HTTP header for Temporal, very quick thing). And here we can just say, contact Datadog Opentelemetry Team for instructions of ingesting Trace to Datadog.
(the Trace endpoint has been available for more than 1 year, the blocker for DD to announce public availability is the pricing. I heard they are looking to announce Datadog being an OTel native backend this year DASH. This very very very much likely will be part of that announcement)
|
|
||
| Use the following rule of thumb when deciding which signal to rely on: | ||
|
|
||
| | Question | Primary signal | |
There was a problem hiding this comment.
This reads like it's for an LLM to reference, but if that's what we're going for then I'm good with it.
|
|
||
| This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand | ||
| what your Workers are doing, and Cloud metrics to understand what Temporal Cloud is seeing at the Task Queue and service | ||
| level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics). |
There was a problem hiding this comment.
| level. For an overview of how these signals fit together, see [Temporal Cloud observability and metrics](/cloud/metrics). | |
| level. For an overview of how these signals fit together, see [Temporal Cloud metrics](/cloud/metrics). |
| - Automation (e.g. Temporal Cloud [Operations API](https://docs.temporal.io/ops), [Terraform provider](https://docs.temporal.io/cloud/terraform-provider), [Temporal CLI](https://docs.temporal.io/cli/setup-cli)) | ||
|
|
||
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. | ||
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. |
There was a problem hiding this comment.
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. | |
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations. API keys are more straightforward to set up, rotate and get started with for most teams. With API keys, you can automate account-level and namespace-level access control for service accounts. |
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. | ||
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. | ||
|
|
||
| If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication). |
There was a problem hiding this comment.
| If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication). | |
| If your organization requires mutual authentication and stronger cryptographic guarantees, use [mTLS certificates](https://docs.temporal.io/cloud/certificates) to authenticate Temporal clients to Temporal Cloud, alongside API keys for control plane automation via Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider). | |
| Note that mTLS certificates do not carry a service account identity or map to Temporal Cloud's RBAC model. Namespace access is granted based on CA trust alone, with optional [Certificate Filters](https://docs.temporal.io/cloud/certificates#manage-certificate-filters) for narrowing access by Common Name. |
|
|
||
| The way you partition Namespaces should usually match the way you partition machine identities. | ||
|
|
||
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can |
There was a problem hiding this comment.
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can | |
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can rotate credentials independently. |
| The way you partition Namespaces should usually match the way you partition machine identities. | ||
|
|
||
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can | ||
| rotate credentials independently. |
There was a problem hiding this comment.
| rotate credentials independently. |
|
|
||
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can | ||
| rotate credentials independently. | ||
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces |
There was a problem hiding this comment.
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces | |
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces should usually have separate Service Accounts and API keys as well. |
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can | ||
| rotate credentials independently. | ||
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces | ||
| should usually have separate Service Accounts and API keys as well. |
There was a problem hiding this comment.
| should usually have separate Service Accounts and API keys as well. |
| rotate credentials independently. | ||
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces | ||
| should usually have separate Service Accounts and API keys as well. | ||
| - If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more |
There was a problem hiding this comment.
| - If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more | |
| - If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more granular. |
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces | ||
| should usually have separate Service Accounts and API keys as well. | ||
| - If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more | ||
| granular. |
Summary
Adds opinionated guidance for Cloud auth operating models and for combining Cloud metrics with SDK metrics in observability setups.
Why
The setup docs explain mechanics, but users still need clearer guidance on how to structure service accounts, rotate credentials, and combine Cloud-side and Worker-side signals in practice.
Changes
Validation
yarn build┆Attachments: EDU-6119 Add cloud auth and observability guidance