docs: add Python data handling section#4378
docs: add Python data handling section#4378lennessyy wants to merge 3 commits intolarge-payload-prereleasefrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links |
| | | [PayloadConverter](/develop/python/data-handling/data-conversion) | [PayloadCodec](/develop/python/data-handling/data-encryption) | [ExternalStorage](/develop/python/data-handling/large-payload-storage) | | ||
| | ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- | | ||
| | **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store | | ||
| | **Must be deterministic** | Yes | No | No | |
There was a problem hiding this comment.
Not sure about this line. What are we trying to help the user with here? And @jmaeagle99 could you take a look?
- For codec, I think we say that due to content hashing, codec should be deterministic for cases when the workflow task fails.
There was a problem hiding this comment.
Ah, so this came from the TypeScript page: https://docs.temporal.io/develop/typescript/converters-and-encryption
When I was creating the table, I used the TS page, which had specific instructions on whether or not these components can access external services or employ non-deterministic modules. I think the main thing was to tell users they cannot do that in the payload converter, and thus cannot do any encryption there either.
If you think that line about codec is worth adding, we can change it. Otherwise, I'm okay with removing this row.
There was a problem hiding this comment.
I find it a bit abstract as well. Not sure it's doing much good in such a concise form so prominently in the doc. But when I'm actually building a custom payload converter, I'd like to know that it should be deterministic/not access network.
There was a problem hiding this comment.
I don't think having a "Must be deterministic? Yes/No" explains much and might just create more questions. I think that this information is more for the authors of converters, codecs, and storage drivers rather than the authors of workflows. Even if workflow authors have to think about determinism, just a different kind of determinism.
I think there are two aspects to think about when talking about determinism of these things (converters, codecs, and external storage):
- For a given input, the output should be reproducible when the operation is successful.
- Whether the operation is allowed to fail.
For example, payload converter cannot raise/throw/return errors. That is because these run within the workflow code execution. The workflow code can handle the errors and compensate with the another workflow command. This will cause workflow non-determinism on replay.
In Python, codecs can raise/throw/return errors. That is because they are executed before the workflow code executes and after the workflow code has yielded. In either case, the workflow code has no ability to handle the error. Raising/throwing/returning errors here will cause the WFT to be retried and has no impact on workflow determinism. The same is allowed for external storage.
| /> | ||
|
|
||
| Of these three layers, only the PayloadConverter is required. Temporal uses a default PayloadConverter that handles JSON | ||
| serialization. The PayloadCodec and ExternalStorage layers are optional. You only need to customize these layers when |
There was a problem hiding this comment.
Should we link to encyclopedia for external storage somewhere?
| data_converter = dataclasses.replace( | ||
| temporalio.converter.default(), | ||
| external_storage=ExternalStorage( | ||
| drivers=[MyStorageDriver()], |
There was a problem hiding this comment.
Use LocalDiskStorageDriver here ? (Is there a snipsync?)
There was a problem hiding this comment.
Yes, I will snipsync all the code blcoks once all the content is approved.
| ## Configure payload size threshold | ||
|
|
||
| You can configure the payload size threshold that triggers external storage. By default, payloads larger than 256 KiB | ||
| are offloaded to external storage. You can adjust this with the `payload_size_threshold` parameter, or set it to 1 to |
There was a problem hiding this comment.
| are offloaded to external storage. You can adjust this with the `payload_size_threshold` parameter, or set it to 1 to | |
| are offloaded to external storage. You can adjust this with the `payload_size_threshold` parameter, even setting it to 0 to |
| | ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- | | ||
| | **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store | | ||
| | **Must be deterministic** | Yes | No | No | | ||
| | **Default** | JSON serialization | None (passthrough) | None (passthrough) | |
There was a problem hiding this comment.
| | **Default** | JSON serialization | None (passthrough) | None (passthrough) | | |
| | **Default** | JSON serialization | None (passthrough) | None (all payloads will be stored in Workflow History) | |
There was a problem hiding this comment.
I feel that "passthrough" is correct. It is whatever comes out of this "pipeline" of data handling is what is stored in workflow history and shouldn't be tied to the external storage step.
| @@ -0,0 +1,252 @@ | |||
| --- | |||
| id: large-payload-storage | |||
There was a problem hiding this comment.
"external storage" and "large payload storage" are being used inconsistently throughout these docs. I think we should stick with one, namely external storage.
| | | [PayloadConverter](/develop/python/data-handling/data-conversion) | [PayloadCodec](/develop/python/data-handling/data-encryption) | [ExternalStorage](/develop/python/data-handling/large-payload-storage) | | ||
| | ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- | | ||
| | **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store | | ||
| | **Must be deterministic** | Yes | No | No | |
There was a problem hiding this comment.
I don't think having a "Must be deterministic? Yes/No" explains much and might just create more questions. I think that this information is more for the authors of converters, codecs, and storage drivers rather than the authors of workflows. Even if workflow authors have to think about determinism, just a different kind of determinism.
I think there are two aspects to think about when talking about determinism of these things (converters, codecs, and external storage):
- For a given input, the output should be reproducible when the operation is successful.
- Whether the operation is allowed to fail.
For example, payload converter cannot raise/throw/return errors. That is because these run within the workflow code execution. The workflow code can handle the errors and compensate with the another workflow command. This will cause workflow non-determinism on replay.
In Python, codecs can raise/throw/return errors. That is because they are executed before the workflow code executes and after the workflow code has yielded. In either case, the workflow code has no ability to handle the error. Raising/throwing/returning errors here will cause the WFT to be retried and has no impact on workflow determinism. The same is allowed for external storage.
| | ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- | ---------------------------------------------------------------------- | | ||
| | **Purpose** | Serialize application data to bytes | Transform encoded payloads (encrypt, compress) | Offload large payloads to external store | | ||
| | **Must be deterministic** | Yes | No | No | | ||
| | **Default** | JSON serialization | None (passthrough) | None (passthrough) | |
There was a problem hiding this comment.
I feel that "passthrough" is correct. It is whatever comes out of this "pipeline" of data handling is what is stored in workflow history and shouldn't be tied to the external storage step.
|
|
||
| ### Prerequisites | ||
|
|
||
| - An Amazon S3 bucket that you have write access to. Refer to [lifecycle management](/external-storage#lifecycle) to |
There was a problem hiding this comment.
| - An Amazon S3 bucket that you have write access to. Refer to [lifecycle management](/external-storage#lifecycle) to | |
| - An Amazon S3 bucket that you have read and write access to. Refer to [lifecycle management](/external-storage#lifecycle) to |
If you want to be even more prescriptive, the identity needs at least s3:PutObject and s3:GetObject. It would be unlikely that you can get away with just s3:PutObject.
|
|
||
| - An Amazon S3 bucket that you have write access to. Refer to [lifecycle management](/external-storage#lifecycle) to | ||
| ensure that your payloads remain available for the entire lifetime of the Workflow. | ||
| - The `aioboto3` library is installed and available. |
There was a problem hiding this comment.
The Python has an extra that installs this (and the types for the) library:
python -m pip install "temporalio[aioboto3]"
| os.makedirs(self._store_dir, exist_ok=True) | ||
|
|
||
| prefix = self._store_dir | ||
| sc = context.serialization_context |
There was a problem hiding this comment.
FYI, this is changing in this PR. Haven't been able to merge it yet due to failures impacting the repository.
| Store payloads durably so that they survive process crashes and remain available for debugging and auditing after the | ||
| Workflow completes. Refer to [lifecycle management](/external-storage#lifecycle) for retention requirements. | ||
|
|
||
| The following example shows a complete custom driver implementation that uses local disk as the backing store: |
There was a problem hiding this comment.
Should we caveat that this example should not be used in production? It works for local development and demoing on one machine, but would not work for multi worker environments.
| and [Payload Codec](/develop/python/data-handling/data-encryption) before it reaches the driver. | ||
| See the [Components of a Data Converter](/dataconversion#data-converter-components) for more details. | ||
|
|
||
| Return a `StorageDriverClaim` for each payload with enough information to retrieve it later. Structure your storage keys |
There was a problem hiding this comment.
I think how driver authors want to structure their keys is up to them. They could just use CAS and not use anything prefixing if they don't need sophisticated lifecycle management. So I think these are more recommended rather than something required.
|
|
||
| ## Use multiple storage drivers | ||
|
|
||
| When you have multiple drivers, such as for hot and cold storage tiers, pass a `driver_selector` function that chooses |
There was a problem hiding this comment.
I'm thinking we might be able to use a better example of why you'd have multiple drivers:
- You're worker needs to support receiving workflow starts that were created by far clients that don't use the same driver that you prefer for your worker. Register that far client driver and your preferred driver, and use the selector to always pick your driver.
- Maybe some of your workflows could be optimized with local caching (like Redis) instead of going to a far storage service; you'd trading lower latency for durability, but maybe that workflow type is allowed to be less durable. Register your Redis driver and S3 driver, and use the selector to pick based on workflow type (coming in this PR).
Summary
Test plan
🤖 Generated with Claude Code
┆Attachments: EDU-6148 docs: add Python data handling section