Describe the feature or problem you''d like to solve
When using a text-only model like DeepSeek as the primary model, image inputs are either ignored or cause errors, because DeepSeek lacks vision capabilities. Users must manually switch models with /model every time they want to include an image, which breaks workflow flow.
Proposed solution
Automatic model routing based on input modality. When a user''s prompt includes images (pasted, drag-and-drop, or @-referenced), Copilot CLI should:
- Detect that the primary model doesn''t support vision
- Auto-route the image(s) to a vision-capable model (e.g., GPT-4o, Claude Sonnet 4.5) configured as the "vision fallback"
- Receive a text description of the image(s) from the vision model
- Inject that text description into the prompt sent to the primary (text-only) model
- The user only sees the final response from their chosen primary model, with the image description transparently provided as context
Configuration (example .copilot/config or copilot-instructions.md):
vision_fallback_model: "gpt-4o"
vision_fallback_behavior: "describe_and_forward"
Example prompts or workflows
-
User has DeepSeek selected. Pastes a screenshot of a UI bug and types "Fix this layout issue". CLI detects image + text-only model -> sends image to GPT-4o -> GPT-4o returns text description -> that description is prepended to DeepSeek''s prompt -> DeepSeek fixes the code.
-
User @-references a diagram.png file: "Implement this architecture diagram". Same routing flow - architecture described in text, DeepSeek implements.
-
/model deepseek is active. User pastes error screenshot. Without needing to switch models, user gets a code fix.
-
Works in reverse too - if the primary model IS vision-capable, no routing occurs; image is sent directly.
-
Configurable: user can disable auto-routing or choose which vision model to use as fallback.
Additional context
- This is inspired by the existing sub-agent and delegation architecture already present in Copilot CLI
- Reduces friction for users who prefer text-only models for cost/performance but occasionally need vision capabilities
- Could be implemented as a lightweight pre-processing step before the main model invocation
- Related: the /fleet and custom agents infrastructure could potentially be leveraged for this
Describe the feature or problem you''d like to solve
When using a text-only model like DeepSeek as the primary model, image inputs are either ignored or cause errors, because DeepSeek lacks vision capabilities. Users must manually switch models with /model every time they want to include an image, which breaks workflow flow.
Proposed solution
Automatic model routing based on input modality. When a user''s prompt includes images (pasted, drag-and-drop, or @-referenced), Copilot CLI should:
Configuration (example .copilot/config or copilot-instructions.md):
Example prompts or workflows
User has DeepSeek selected. Pastes a screenshot of a UI bug and types "Fix this layout issue". CLI detects image + text-only model -> sends image to GPT-4o -> GPT-4o returns text description -> that description is prepended to DeepSeek''s prompt -> DeepSeek fixes the code.
User @-references a diagram.png file: "Implement this architecture diagram". Same routing flow - architecture described in text, DeepSeek implements.
/model deepseek is active. User pastes error screenshot. Without needing to switch models, user gets a code fix.
Works in reverse too - if the primary model IS vision-capable, no routing occurs; image is sent directly.
Configurable: user can disable auto-routing or choose which vision model to use as fallback.
Additional context