Skip to content

voyage-multimodal-3.5 (video) support#384

Open
fzowl wants to merge 5 commits intodeepset-ai:mainfrom
voyage-ai:feat/embedding-model-voyage-multimodal-3.5
Open

voyage-multimodal-3.5 (video) support#384
fzowl wants to merge 5 commits intodeepset-ai:mainfrom
voyage-ai:feat/embedding-model-voyage-multimodal-3.5

Conversation

@fzowl
Copy link
Contributor

@fzowl fzowl commented Dec 21, 2025

voyage-multimodal-3.5 (video) support

@fzowl fzowl requested a review from a team as a code owner December 21, 2025 16:56
client = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))

# Text-only embedding
result = client.multimodal_embed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the multimodal functionality supported by any embedders in the voyage-embedders-haystack package? If so, please use those components

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bilgeyucel Yes, now multimodal embeddings are supported, so i updated the PR. Can you please take a look?

@fzowl fzowl force-pushed the feat/embedding-model-voyage-multimodal-3.5 branch from e7bae21 to b6c56ff Compare January 30, 2026 13:12
Copy link
Contributor

@bilgeyucel bilgeyucel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fzowl, thanks for the PR! I only have one comment, and it applies to most of the code examples you shared. It looks like all embedder components act as generator components, generating natural language outputs. Can you also use these models just create embeddings?


# Mixed text and image embedding
image_bytes = ByteStream.from_file_path("image.jpg")
result = embedder.run(inputs=[["Describe this image:", image_bytes]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an image embedder or a generative model? It looks like it acts as a generative model (VLMs) 🤔
If that's the case, it's worth renaming it as a VoyageMultimodalChatGenerator with a similar API to this:

from haystack.dataclasses import ImageContent, ChatMessage
from haystack.components.generators.chat import OpenAIChatGenerator

image_content = ImageContent.from_file_path("image.jpg")
user_message = ChatMessage.from_user(content_parts=["Describe the image in short.", image_content])
llm = VoyageMultimodalChatGenerator(model="voyage-multimodal-3.5")

You can check OpenAIChatGenerator for details

@fzowl
Copy link
Contributor Author

fzowl commented Feb 13, 2026

@bilgeyucel Thanks for catching that! You're right that the text strings were misleading.

To clarify: voyage-multimodal-3.5 is an embedding model — it returns float vectors (list[list[float]]), not generated text. The text strings in multimodal inputs are content that gets co-embedded with images/videos into a shared vector space for semantic similarity search. They are not prompts or instructions to the model.

I've updated the three confusing example strings:

  • "What is in this image?""A sunset over the ocean" (descriptive content, not a question)
  • "Describe this image:""Product photo for online store" (label, not an instruction)
  • "Describe this video:""Machine learning tutorial" (topic description, not a prompt)

The other examples ("Document about machine learning", "Technical diagram") already read as descriptive content so those are unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants