Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/ai/how-to/snippets/access-data/ArgumentsExample.cs
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ public static async Task UseFICC()

string endpoint = config["AZURE_OPENAI_ENDPOINT"];
string apiKey = config["AZURE_OPENAI_API_KEY"];
string model = config["AZURE_OPENAI_GPT_NAME"];
string model = "gpt-4o";

// <SnippetUseAdditionalProperties>
FunctionInvokingChatClient client = new FunctionInvokingChatClient(
Expand Down
177 changes: 177 additions & 0 deletions docs/ai/quickstarts/process-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
---
title: Quickstart - Process custom data for AI
description: Create a data ingestion pipeline to process and prepare custom data for AI applications using Microsoft.Extensions.DataIngestion.
ms.date: 12/11/2025
ms.topic: quickstart
ai-usage: ai-assisted
---

# Process custom data for AI applications

In this quickstart, you learn how to create a data ingestion pipeline to process and prepare custom data for AI applications. The app uses the <xref:Microsoft.Extensions.DataIngestion> library to read documents, enrich content with AI, chunk text semantically, and store embeddings in a vector database for semantic search.

Data ingestion is essential for retrieval-augmented generation (RAG) scenarios where you need to process large amounts of unstructured data and make it searchable for AI applications.

[!INCLUDE [azure-openai-prereqs](includes/prerequisites-azure-openai.md)]

## Create the app

Complete the following steps to create a .NET console app.

1. In an empty directory on your computer, use the `dotnet new` command to create a new console app:

```dotnetcli
dotnet new console -o ProcessDataAI
```

1. Change directory into the app folder:

```dotnetcli
cd ProcessDataAI
```

1. Install the required packages:

```bash
dotnet add package Azure.AI.OpenAI
dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease
dotnet add package Microsoft.Extensions.Configuration
dotnet add package Microsoft.Extensions.Configuration.UserSecrets
dotnet add package Microsoft.Extensions.DataIngestion --prerelease
dotnet add package Microsoft.Extensions.DataIngestion.Markdig --prerelease
dotnet add package Microsoft.Extensions.Logging.Console
dotnet add package Microsoft.ML.Tokenizers.Data.O200kBase
dotnet add package Microsoft.SemanticKernel.Connectors.SqliteVec --prerelease
```

## Create the AI service

1. To provision an Azure OpenAI service and model, complete the steps in the [Create and deploy an Azure OpenAI Service resource](/azure/ai-services/openai/how-to/create-resource) article. For this quickstart, you need to provision two models: `gpt-4o` and `text-embedding-3-small`.

1. From a terminal or command prompt, navigate to the root of your project directory.

1. Run the following commands to configure your Azure OpenAI endpoint and API key for the sample app:

```bash
dotnet user-secrets init
dotnet user-secrets set AZURE_OPENAI_ENDPOINT <your-Azure-OpenAI-endpoint>
dotnet user-secrets set AZURE_OPENAI_API_KEY <your-Azure-OpenAI-API-key>
```

## Open the app in an editor

Open the app in Visual Studio Code (or your editor of choice).

```bash
code .
```

## Create the sample data

1. Copy the [sample.md](https://raw.githubusercontent.com/dotnet/docs/refs/heads/main/docs/ai/quickstarts/snippets/process-data/data/sample.md) file to a folder named `data` in your project directory.
1. Configure the project to copy this file to the output directory. If you're using Visual Studio, right-click on the file in Solution Explorer, select **Properties**, and then set **Copy to Output Directory** to **Copy if newer**.

## Add the app code

The data ingestion pipeline consists of several components that work together to process documents:

- **Document reader**: Reads Markdown files from a directory.
- **Document processor**: Enriches images with AI-generated alternative text.
- **Chunker**: Splits documents into semantic chunks using embeddings.
- **Chunk processor**: Generates AI summaries for each chunk.
- **Vector store writer**: Stores chunks with embeddings in a SQLite database.

1. In the `Program.cs` file, delete any existing code and add the following code to configure the document reader:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureReader":::

The <xref:Microsoft.Extensions.DataIngestion.MarkdownReader> class reads Markdown documents and converts them into a unified format that works well with large language models.

1. Add code to configure logging for the pipeline:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureLogging":::

1. Add code to configure the AI client for enrichment and chat:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureChatClient":::

1. Add code to configure the document processor that enriches images with AI-generated descriptions:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureDocumentProcessor":::

The <xref:Microsoft.Extensions.DataIngestion.ImageAlternativeTextEnricher> uses large language models to generate descriptive alternative text for images within documents. That text makes them more accessible and improves their semantic meaning.

1. Add code to configure the embedding generator for creating vector representations:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureEmbeddingGenerator":::

[Embeddings](../conceptual/embeddings.md) are numerical representations of the semantic meaning of text, which enables vector similarity search.

1. Add code to configure the chunker that splits documents into semantic chunks:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureChunker":::

The <xref:Microsoft.Extensions.DataIngestion.Chunkers.SemanticSimilarityChunker> intelligently splits documents by analyzing the semantic similarity between sentences, ensuring that related content stays together. This process produces chunks that preserve meaning and context better than simple character or token-based chunking.

1. Add code to configure the chunk processor that generates summaries:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureChunkProcessor":::

The <xref:Microsoft.Extensions.DataIngestion.SummaryEnricher> automatically generates concise summaries for each chunk, which can improve retrieval accuracy by providing a high-level overview of the content.

1. Add code to configure the SQLite vector store for storing embeddings:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ConfigureVectorStore":::

The vector store stores chunks along with their embeddings, enabling fast semantic search capabilities.

1. Add code to compose all the components into a complete pipeline:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ComposePipeline":::

The <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1> combines all the components into a cohesive workflow that processes documents from start to finish.

1. Add code to process documents from a directory:

:::code language="csharp" source="snippets/process-data/Program.cs" id="ProcessDocuments":::

The pipeline processes all Markdown files in the `./data` directory and reports the status of each document.

1. Add code to enable interactive search of the processed documents:

:::code language="csharp" source="snippets/process-data/Program.cs" id="SearchVectorStore":::

The search functionality converts user queries into embeddings and finds the most semantically similar chunks in the vector store.

## Run the app

1. Use the `dotnet run` command to run the app:

```dotnetcli
dotnet run
```

The app processes all Markdown files in the `./data` directory and displays the processing status for each document. Once processing is complete, you can enter natural language questions to search the processed content.

1. Enter a question at the prompt to search the data:

```output
Enter your question (or 'exit' to quit): What is data ingestion?
```

The app returns the most relevant chunks from your documents along with their similarity scores.

1. Type `exit` to quit the application.

## Clean up resources

If you no longer need them, delete the Azure OpenAI resource and model deployment.

1. In the [Azure Portal](https://aka.ms/azureportal), navigate to the Azure OpenAI resource.
1. Select the Azure OpenAI resource, and then select **Delete**.

## Next steps

- [Data ingestion concepts](../conceptual/data-ingestion.md)
- [Implement RAG using vector search](../tutorials/tutorial-ai-vector-search.md)
- [Build a .NET AI vector search app](build-vector-search-app.md)
29 changes: 29 additions & 0 deletions docs/ai/quickstarts/snippets/process-data/ProcessData.csproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net10.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<UserSecretsId>2e2133d7-2b33-48e1-9938-79092b54ead4</UserSecretsId>
</PropertyGroup>

<ItemGroup>
<PackageReference Include="Azure.AI.OpenAI" Version="2.8.0-beta.1" />
<PackageReference Include="Microsoft.Extensions.AI.OpenAI" Version="10.1.1-preview.1.25612.2" />
<PackageReference Include="Microsoft.Extensions.Configuration" Version="10.0.1" />
<PackageReference Include="Microsoft.Extensions.Configuration.UserSecrets" Version="10.0.1" />
<PackageReference Include="Microsoft.Extensions.DataIngestion" Version="10.1.1-preview.1.25612.2" />
<PackageReference Include="Microsoft.Extensions.DataIngestion.Markdig" Version="10.1.1-preview.1.25612.2" />
<PackageReference Include="Microsoft.Extensions.Logging.Console" Version="10.0.1" />
<PackageReference Include="Microsoft.ML.Tokenizers.Data.O200kBase" Version="2.0.0" />
<PackageReference Include="Microsoft.SemanticKernel.Connectors.SqliteVec" Version="1.68.0-preview" />
</ItemGroup>

<ItemGroup>
<None Update="data\sample.md">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
</ItemGroup>

</Project>
141 changes: 141 additions & 0 deletions docs/ai/quickstarts/snippets/process-data/Program.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
using Azure;
using Azure.AI.OpenAI;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DataIngestion;
using Microsoft.Extensions.DataIngestion.Chunkers;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.VectorData;
using Microsoft.ML.Tokenizers;
using Microsoft.SemanticKernel.Connectors.SqliteVec;

class DataIngestionExample
{
public static async Task Main()
{
// <ConfigureReader>
// Configure document reader.
IngestionDocumentReader reader = new MarkdownReader();
// </ConfigureReader>

// <ConfigureLogging>
using ILoggerFactory loggerFactory =
LoggerFactory.Create(builder => builder.AddSimpleConsole());
// </ConfigureLogging>

// <ConfigureChatClient>
// Configure IChatClient to use Azure OpenAI.
IConfigurationRoot config = new ConfigurationBuilder()
.AddUserSecrets<DataIngestionExample>()
.Build();

string endpoint = config["AZURE_OPENAI_ENDPOINT"];
string apiKey = config["AZURE_OPENAI_API_KEY"];
string chatModel = "gpt-4o";
string embeddingModel = "text-embedding-3-small";

AzureOpenAIClient azureClient = new(
new Uri(endpoint),
new AzureKeyCredential(apiKey));

IChatClient chatClient =
azureClient.GetChatClient(chatModel).AsIChatClient();
// </ConfigureChatClient>

// <ConfigureDocumentProcessor>
// Configure document processor.
EnricherOptions enricherOptions = new(chatClient)
{
// Enricher failures should not fail the whole ingestion pipeline,
// as they are best-effort enhancements.
// This logger factory can create loggers to log such failures.
LoggerFactory = loggerFactory
};

IngestionDocumentProcessor imageAlternativeTextEnricher =
new ImageAlternativeTextEnricher(enricherOptions);
// </ConfigureDocumentProcessor>

// <ConfigureEmbeddingGenerator>
// Configure embedding generator.
IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator =
azureClient.GetEmbeddingClient(embeddingModel).AsIEmbeddingGenerator();
// </ConfigureEmbeddingGenerator>

// <ConfigureChunker>
// Configure chunker to split text into semantic chunks.
IngestionChunkerOptions chunkerOptions = new(TiktokenTokenizer.CreateForModel(chatModel))
{
MaxTokensPerChunk = 2000,
OverlapTokens = 0
};

IngestionChunker<string> chunker =
new SemanticSimilarityChunker(embeddingGenerator, chunkerOptions);
// </ConfigureChunker>

// <ConfigureChunkProcessor>
// Configure chunk processor to generate summaries for each chunk.
IngestionChunkProcessor<string> summaryEnricher = new SummaryEnricher(enricherOptions);
// </ConfigureChunkProcessor>

// <ConfigureVectorStore>
// Configure SQLite Vector Store.
using SqliteVectorStore vectorStore = new(
"Data Source=vectors.db;Pooling=false",
new()
{
EmbeddingGenerator = embeddingGenerator
});

// The writer requires the embedding dimension count to be specified.
using VectorStoreWriter<string> writer = new(
vectorStore,
dimensionCount: 1536,
new VectorStoreWriterOptions { CollectionName = "data" });
// </ConfigureVectorStore>

// <ComposePipeline>
// Compose data ingestion pipeline
using IngestionPipeline<string> pipeline =
new(reader, chunker, writer, loggerFactory: loggerFactory)
{
DocumentProcessors = { imageAlternativeTextEnricher },
ChunkProcessors = { summaryEnricher }
};
// </ComposePipeline>

// <ProcessDocuments>
await foreach (IngestionResult result in pipeline.ProcessAsync(
new DirectoryInfo("./data"),
searchPattern: "*.md"))
{
Console.WriteLine($"Completed processing '{result.DocumentId}'. " +
$"Succeeded: '{result.Succeeded}'.");
}
// </ProcessDocuments>

// <SearchVectorStore>
// Search the vector store collection and display results
VectorStoreCollection<object, Dictionary<string, object?>> collection =
writer.VectorStoreCollection;

while (true)
{
Console.Write("Enter your question (or 'exit' to quit): ");
string? searchValue = Console.ReadLine();
if (string.IsNullOrEmpty(searchValue) || searchValue == "exit")
{
break;
}

Console.WriteLine("Searching...\n");
await foreach (VectorSearchResult<Dictionary<string, object?>> result in
collection.SearchAsync(searchValue, top: 3))
{
Console.WriteLine($"Score: {result.Score}\n\tContent: {result.Record["content"]}");
}
}
// </SearchVectorStore>
}
}
18 changes: 18 additions & 0 deletions docs/ai/quickstarts/snippets/process-data/data/sample.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Sample Document

This is a sample document for testing the data ingestion pipeline.

## Introduction

Data ingestion is the process of collecting and preparing data for AI applications.

## Key Features

- Document reading
- AI-powered enrichment
- Semantic chunking
- Vector storage

## Conclusion

These building blocks make it easy to create data ingestion pipelines.
3 changes: 2 additions & 1 deletion docs/ai/quickstarts/snippets/structured-output/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@
.Build();

string endpoint = config["AZURE_OPENAI_ENDPOINT"];
string model = config["AZURE_OPENAI_GPT_NAME"];
string tenantId = config["AZURE_TENANT_ID"];

string model = "gpt-4o";

// Get a chat client for the Azure OpenAI endpoint.
AzureOpenAIClient azureClient =
new(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

string endpoint = config["AZURE_OPENAI_ENDPOINT"];
string apiKey = config["AZURE_OPENAI_API_KEY"];
string model = config["AZURE_OPENAI_GPT_NAME"];
string model = "gpt-image-1";

// Create the Azure OpenAI client and convert to IImageGenerator.
AzureOpenAIClient azureClient = new(
Expand Down
Loading