Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 11, 2025

Summary

Adds quickstart documentation for Microsoft.Extensions.DataIngestion library, demonstrating complete ETL pipeline for RAG scenarios.

Contributes to #50534

Changes

Documentation

  • New quickstart: docs/ai/quickstarts/process-data.md
    • Document reading with MarkdownReader
    • AI-powered enrichment (image alt-text, summaries)
    • Semantic chunking with embedding-based similarity
    • Vector storage using SQLite
    • Interactive semantic search
  • Dual-platform support via zone pivots (OpenAI/Azure OpenAI)
  • Added to "Chat with your data (RAG)" section in TOC

Code Snippets

  • Complete compilable C# projects for both platforms
  • Demonstrates pipeline composition: reader → enricher → chunker → writer
  • Includes sample data and region markers for doc references
// Compose data ingestion pipeline
using IngestionPipeline<string> pipeline = new(reader, chunker, writer, loggerFactory)
{
    DocumentProcessors = { imageAlternativeTextEnricher },
    ChunkProcessors = { summaryEnricher }
};

await foreach (var result in pipeline.ProcessAsync(new DirectoryInfo("./data"), searchPattern: "*.md"))
{
    Console.WriteLine($"Completed processing '{result.DocumentId}'. Succeeded: '{result.Succeeded}'.");
}

Based on sample from https://github.com/luisquintanilla/DataIngestion and blog announcement at https://devblogs.microsoft.com/dotnet/introducing-data-ingestion-building-blocks-preview/

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • devblogs.microsoft.com
    • Triggering command: /usr/bin/curl curl -s REDACTED (dns block)
    • Triggering command: /usr/bin/wget wget -q -O /tmp/blog.html REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

Add a quickstart or tutorial for processing custom data based on the data ingestion sample described in https://devblogs.microsoft.com/dotnet/introducing-data-ingestion-building-blocks-preview/. It should live under the "Chat with your data" section of the AI TOC and be titled "Process data" or something similar. This task contributes to #50534.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.


Internal previews

📄 File 🔗 Preview link
docs/ai/quickstarts/process-data.md Process custom data for AI applications
docs/ai/quickstarts/snippets/process-data/data/sample.md docs/ai/quickstarts/snippets/process-data/data/sample
docs/ai/quickstarts/structured-output.md Request a response with structured output
docs/ai/quickstarts/text-to-image.md Quickstart - Generate images from text using AI
docs/ai/toc.yml docs/ai/toc

Copilot AI changed the title [WIP] Add quickstart tutorial for processing custom data Add data ingestion quickstart for processing custom data Dec 11, 2025
Copilot AI requested a review from gewarren December 11, 2025 23:42
@gewarren gewarren marked this pull request as ready for review December 17, 2025 01:52
@gewarren gewarren requested a review from a team as a code owner December 17, 2025 01:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new quickstart tutorial for the Microsoft.Extensions.DataIngestion library, demonstrating how to build an ETL pipeline for RAG scenarios. The quickstart shows users how to read Markdown documents, enrich them with AI, chunk them semantically, and store them in a vector database for semantic search. Additionally, the PR includes cleanup changes to other quickstart files, removing hardcoded model names from user secrets in favor of inline string values.

Key Changes

  • New quickstart documentation showing end-to-end data ingestion pipeline for AI applications
  • Sample code demonstrating pipeline composition with readers, enrichers, chunkers, and vector storage
  • Code cleanup across existing quickstarts (text-to-image, structured-output) to simplify configuration

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
docs/ai/toc.yml Adds new quickstart entry under "Chat with your data (RAG)" section
docs/ai/quickstarts/process-data.md New quickstart documentation for data ingestion pipeline
docs/ai/quickstarts/snippets/process-data/Program.cs Complete C# example implementing data ingestion with Azure OpenAI
docs/ai/quickstarts/snippets/process-data/ProcessData.csproj Project file with required NuGet packages
docs/ai/quickstarts/snippets/process-data/data/sample.md Sample Markdown document for testing the pipeline
docs/ai/quickstarts/text-to-image.md Removed unnecessary model name from user secrets configuration
docs/ai/quickstarts/structured-output.md Removed unnecessary model name from user secrets configuration
docs/ai/quickstarts/snippets/text-to-image/azure-openai/Program.cs Hardcoded model name instead of reading from user secrets
docs/ai/quickstarts/snippets/structured-output/Program.cs Hardcoded model name instead of reading from user secrets
docs/ai/how-to/snippets/access-data/ArgumentsExample.cs Hardcoded model name instead of reading from user secrets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants