-
Notifications
You must be signed in to change notification settings - Fork 2
Cid/attachments #581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cid/attachments #581
Conversation
| if name: | ||
| processed_msg["name"] = name | ||
| processed_messages.append(processed_msg) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this cause issues?
when the content type is unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we encounter an unknown content type, we preserve the original message unchanged. If OpenAI introduces a new content type we don't recognize, we don't break the trace, we just pass it through as-is. The trace will still work, we just won't have special handling for that new type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it log a warning or emit an alert so we can learn about this new type and add it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed. added a log statement
| elif item_type == "file": | ||
| file_data = item.get("file", {}) | ||
| file_id = file_data.get("file_id") | ||
| file_data_b64 = file_data.get("file_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the base64 is too big? I don't know if it will cause issues while sending like for example if the request size exceeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! The base64 data is stored temporarily in the Attachment object, but it's not sent in the trace payload. The flow is:
- Base64 data is extracted and stored in
Attachment.data_base64 - Before sending the trace,
upload_trace_attachments()uploads the data to our blob storage and gets astorage_uri - After upload,
data_base64is cleared (set toNone) inattachment_uploader.py - Only the
storage_uriis serialized into the trace payload
So the actual trace request only contains the URI reference, not the heavy base64 data.
| elif file_id: | ||
| # Just reference the file ID (can't download without API call) | ||
| attachment = Attachment(name=filename) | ||
| attachment.metadata["openai_file_id"] = file_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we just have openai_file_id, what will we show in the UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we only have openai_file_id, we create an attachment with just the filename and store the file_id in metadata. This is a design decision and in the UI, I thought we could show:
- The
filename(if provided, otherwise "file") - The
openai_file_idreference in metadata - No preview/content (since we don't have access to the actual data)
This is a limitation. We could potentially add an option to fetch the file content using the client, but that adds latency and complexity. For now, I thought we could just capture the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add this as a comment somewhere so @eluzzi5 can come up with a good design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a Linear issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The attachment with storage_uri populated (if upload was needed). | ||
| """ | ||
| # Skip if already uploaded | ||
| if attachment.is_uploaded(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we still upload data from openai or anywhere else to our storage? they might remove that data after sometime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if an attachment has an external URL (like a hosted image URL), we keep that URL and don't re-upload to our storage.
Pros of current approach:
- Faster (no download + re-upload)
- Less storage cost
Cons:
- External URLs may expire
- Data could become unavailable
I think this is worth discussing as a follow-up improvement. We could add an option like persist_external_urls=True that would download and re-upload external content to ensure long-term availability. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think storage is cheap. We can do it async on our backend later but we should surely store it with us. I am sure that particular file is important and if it expires we miss critical information. We can do it later as a followup task but make it default.
Thoughts @gustavocidornelas @whoseoyster ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a compliance risk if we copy data to Openlayer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! The attachment upload feature is opt-in. The user needs to explicitly set attachment_upload_enabled=True when they configure the tracer.
However, we should document this clearly. Worth discussing further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with us copying files over to our storage wherever possible, but allowing opt out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| try: | ||
| from .attachment_uploader import upload_trace_attachments | ||
|
|
||
| upload_count = upload_trace_attachments(trace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this _upload_and_publish_trace will be called when we do have an attachment to upload - right?
because if theres nothing to upload then upload_count will be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_upload_and_publish_trace is called for every trace, regardless of whether there are attachments. It handles both uploading attachments (if any exist and attachment_upload_enabled=True) and publishing the trace data to Openlayer.
If there are no attachments, upload_count will be 0, but the function still proceeds to publish the trace. The attachment upload is just one step in the trace completion process. The naming could be clearer - it's really "process and publish trace" with attachment upload being an optional first step.
matheusportela
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gustavocidornelas I left a bunch of comments but none is blocking (some are literally just because I'm curious)! Let me know if you have any questions
| # TODO: Implement metadata extraction from file | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo, it's very misc. This method was created by AI and it wanted to extract some metadata for each attachment type (e.g., file size, duration, image dimensions, etc.)
At that point, I didn't want it to focus too much on these details, so I deleted everything and wrote a #TODO
a02c309 to
046af02
Compare
matheusportela
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my comments, the PR looks good! (I did not perform manual tests, though, so it might be good to have someone else also approving the PR.)
Pull Request
Summary
Introduces:
Changes
Attachments
Attachmentabstraction, which represents media data uploaded to a blob store. TheAttachmentobject stores metadata and thestorageUri, which can be used by the Openlayer platform to fetch the media.attachmentsfield, which is an array ofAttachmentobjects. This allows users to log arbitrary media to a step. For example:OpenAI multimodal
attachments, this PR also instruments thetrace_openaiwrapper to parse image/audio/files in the input or output of OpenAI LLM calls.Note that if the
typeis one ofimage,audio, orfile, the other object field is anattachment, which is a serializedAttachmentobject.Context
OPEN-8683: Multimodal attachment support for the Python SDK and OPEN-8684: Enhance OpenAI tracer to support multimodal inputs/outputs
Testing