Build a Document Q&A App with RAG and Semantic Kernel in C#
Keyword search breaks down the moment your question and the answer use different words. You ask "how does the system handle errors?" but the document says "exception management strategy" -- and you get nothing. Building a document Q&A app with RAG and Semantic Kernel in C# solves this by converting both your question and your documents into vectors and finding answers by meaning, not exact text match.
This is a complete walkthrough of a working .NET 9 console app that loads documents, chunks and embeds them using ITextEmbeddingGenerationService, stores them in an InMemoryVectorStore, retrieves the most relevant chunks via vector similarity search, and generates a grounded answer using a Semantic Kernel prompt function. Every file in the project is shown with real code -- no pseudocode, no gaps.
The full source is in the semantic-kernel-examples/ai-document-qa folder. You can clone it and run it against your own documents with a single command.
What RAG Does (and Why It Belongs in Your Toolkit)
RAG (Retrieval-Augmented Generation) with Semantic Kernel in C# is a pattern where you retrieve relevant context from your own data, inject it into a prompt, and let the model answer from that context only. The model never goes beyond what you give it -- which means accurate, grounded answers from your own documents instead of hallucinated generalities.
The pipeline has three stages:
- Indexing -- Load documents, split into chunks, embed each chunk, store in a vector database
- Retrieval -- Embed the question, find the top-k most similar chunks by cosine distance
- Generation -- Build a prompt that includes the retrieved chunks and ask the model to answer from that context only
Every step maps directly to SK types and packages you already have. If you want a deeper look at the concepts before diving into code, the Semantic Kernel in C# complete guide covers RAG alongside plugins and agents at a higher level.
Project Setup and Packages
The project targets .NET 9 and uses four packages:
<PackageReference Include="Microsoft.SemanticKernel" Version="1.72.0" />
<PackageReference Include="Microsoft.SemanticKernel.Connectors.InMemory" Version="1.72.0-preview" />
<PackageReference Include="Microsoft.Extensions.Configuration.Json" Version="9.0.2" />
<PackageReference Include="Microsoft.Extensions.Configuration.Binder" Version="9.0.2" />
<PackageReference Include="Microsoft.Extensions.VectorData.Abstractions" Version="10.0.0" />
Microsoft.SemanticKernel brings in the kernel, chat completion, and text embedding interfaces. Connectors.InMemory provides a zero-infrastructure vector store that lives entirely in process -- no Docker, no Azure resource, no setup. VectorData.Abstractions v10 is pulled as a transitive dependency but must be listed directly so the compiler can find the updated type names (VectorStoreKey, VectorStoreData, VectorStoreVector).
Configuration is split across two sections -- one for chat completion, one for embeddings -- because these typically live in different model deployments:
{
"ChatAI": {
"Type": "azureopenai",
"ModelId": "gpt-4.1",
"Endpoint": "https://your-resource.openai.azure.com/",
"ApiKey": ""
},
"EmbeddingAI": {
"Type": "azureopenai",
"ModelId": "text-embedding-ada-002",
"Endpoint": "https://your-resource.openai.azure.com/",
"ApiKey": ""
}
}
The text embeddings with Semantic Kernel in C# article covers why chat models and embedding models are different deployments and how to configure each. For this app, all secrets go in a gitignored appsettings.Development.json.
Defining the Vector Store Record
The vector store needs a data model with annotated properties. In Microsoft.Extensions.VectorData v10, the attribute names changed from earlier SK versions:
using Microsoft.Extensions.VectorData;
public sealed class DocumentChunk
{
[VectorStoreKey]
public string ChunkId { get; set; } = "";
[VectorStoreData]
public string DocumentName { get; set; } = "";
[VectorStoreData]
public string Content { get; set; } = "";
// 1536 dimensions: compatible with text-embedding-ada-002 and text-embedding-3-small
[VectorStoreVector(1536, DistanceFunction = DistanceFunction.CosineSimilarity)]
public ReadOnlyMemory<float> Embedding { get; set; }
}
[VectorStoreKey] marks the unique identifier. [VectorStoreData] marks filterable data fields. [VectorStoreVector] marks the embedding property -- the first argument is the dimension count, which must match your embedding model's output (1536 for text-embedding-ada-002 and text-embedding-3-small). The DistanceFunction is set as a named property because the positional syntax changed in v10.
The Semantic Kernel vector store in C# article covers the full attribute reference and how to swap InMemoryVectorStore for Azure AI Search or Qdrant when you move to production.
Loading and Chunking Documents
Splitting documents into the right chunk size is one of the most impactful decisions in a RAG pipeline. Too large and the retrieved context swamps the model. Too small and each chunk loses the surrounding meaning.
For this app, DocumentLoader splits on paragraph boundaries (double newlines) with a 400-word ceiling:
public static class DocumentLoader
{
private const int MaxWordsPerChunk = 400;
public static IEnumerable<DocumentChunk> LoadFromDirectory(string directoryPath)
{
var files = Directory.GetFiles(directoryPath, "*.txt")
.Concat(Directory.GetFiles(directoryPath, "*.md"))
.OrderBy(f => f);
foreach (var filePath in files)
{
var documentName = Path.GetFileName(filePath);
var text = File.ReadAllText(filePath);
foreach (var chunk in SplitIntoChunks(text, documentName))
yield return chunk;
}
}
private static IEnumerable<DocumentChunk> SplitIntoChunks(string text, string documentName)
{
var paragraphs = text
.Split(["
", "
"], StringSplitOptions.RemoveEmptyEntries)
.Select(p => p.Trim())
.Where(p => p.Length > 0);
var currentChunk = new List<string>();
int currentWordCount = 0;
int chunkIndex = 0;
foreach (var paragraph in paragraphs)
{
int paragraphWords = paragraph
.Split([' ', ' ', '
', '
'], StringSplitOptions.RemoveEmptyEntries).Length;
if (currentWordCount + paragraphWords > MaxWordsPerChunk && currentChunk.Count > 0)
{
yield return CreateChunk(documentName, chunkIndex++, currentChunk);
currentChunk.Clear();
currentWordCount = 0;
}
currentChunk.Add(paragraph);
currentWordCount += paragraphWords;
}
if (currentChunk.Count > 0)
yield return CreateChunk(documentName, chunkIndex, currentChunk);
}
private static DocumentChunk CreateChunk(string name, int index, List<string> paragraphs) =>
new()
{
ChunkId = $"{name}::chunk-{index:D4}",
DocumentName = name,
Content = string.Join("
", paragraphs)
};
}
Paragraph-based chunking keeps semantic units intact. A paragraph about "connection pooling" stays together instead of being split mid-explanation. The chunking strategies for RAG with Semantic Kernel in C# article compares paragraph, fixed-size, and semantic chunking with benchmark data -- worth reading when you tune for accuracy on larger document sets.
Embedding and Indexing with ITextEmbeddingGenerationService
DocumentIndexer handles the two-part embedding pipeline: batch embed all chunks during indexing, embed the query at search time. Both use ITextEmbeddingGenerationService:
using Microsoft.Extensions.VectorData;
using Microsoft.SemanticKernel.Embeddings;
public sealed class DocumentIndexer
{
private const string CollectionName = "documents";
private readonly ITextEmbeddingGenerationService _embeddingService;
private readonly VectorStoreCollection<string, DocumentChunk> _collection;
public DocumentIndexer(
ITextEmbeddingGenerationService embeddingService,
VectorStore vectorStore)
{
_embeddingService = embeddingService;
_collection = vectorStore.GetCollection<string, DocumentChunk>(CollectionName);
}
public async Task IndexDocumentsAsync(
IEnumerable<DocumentChunk> chunks,
CancellationToken cancellationToken = default)
{
await _collection.EnsureCollectionExistsAsync(cancellationToken);
var chunkList = chunks.ToList();
// Batch embed all content strings in a single API call
var contents = chunkList.Select(c => c.Content).ToList();
var embeddings = await _embeddingService
.GenerateEmbeddingsAsync(contents, cancellationToken: cancellationToken);
for (int i = 0; i < chunkList.Count; i++)
{
chunkList[i].Embedding = embeddings[i];
await _collection.UpsertAsync(chunkList[i], cancellationToken: cancellationToken);
}
}
}
GenerateEmbeddingsAsync accepts a list and returns embeddings in the same order -- one API call for all chunks rather than N separate calls. This matters for startup latency when you have dozens of documents. VectorStoreCollection<string, DocumentChunk> maps to the documents collection in whichever vector store is registered. When you swap InMemoryVectorStore for Azure AI Search in production, this class doesn't change.
VectorStore.GetCollection<K,V>(name) in v10 takes only the collection name when the data model uses attributes -- no schema argument needed. EnsureCollectionExistsAsync is a no-op for InMemoryVectorStore but is the correct call when using a persistent backend.
Vector Similarity Search
Search works by embedding the query and calling SearchAsync on the collection cast to IVectorSearchable<T>:
public async Task<IReadOnlyList<DocumentChunk>> SearchAsync(
string query,
int topK = 3,
CancellationToken cancellationToken = default)
{
var queryEmbeddings = await _embeddingService
.GenerateEmbeddingsAsync([query], cancellationToken: cancellationToken);
var queryVector = queryEmbeddings[0];
// Cast to IVectorSearchable to access SearchAsync in VectorData v10
var searchable = (IVectorSearchable<DocumentChunk>)_collection;
var results = searchable.SearchAsync(
queryVector,
topK, // top is the second positional argument in v10
cancellationToken: cancellationToken);
var chunks = new List<DocumentChunk>();
await foreach (var result in results.WithCancellation(cancellationToken))
chunks.Add(result.Record);
return chunks;
}
IVectorSearchable<T>.SearchAsync<TVector>(vector, top, options?, ct) is the v10 signature. top is now a positional parameter, not a property of VectorSearchOptions<T> -- a breaking change from earlier previews. The method returns IAsyncEnumerable<VectorSearchResult<T>> directly. Each result has a .Record property (the DocumentChunk) and a .Score property (cosine similarity, 0–1 where 1 is identical).
Generating Grounded Answers
QuestionAnswerer takes the retrieved chunks, builds a context string, and invokes a prompt function that constrains the model to answer only from that context:
public sealed class QuestionAnswerer
{
private readonly DocumentIndexer _indexer;
private readonly Kernel _kernel;
private readonly KernelFunction _answerFn;
public QuestionAnswerer(DocumentIndexer indexer, Kernel kernel)
{
_indexer = indexer;
_kernel = kernel;
// Inline prompt: model answers ONLY from the provided context
_answerFn = KernelFunctionFactory.CreateFromPrompt(
"""
You are a helpful assistant answering questions based only on the provided context.
If the context does not contain enough information to answer, say so clearly.
Do not use any knowledge outside the provided context.
Context:
{{$context}}
Question: {{$question}}
Answer:
""",
functionName: "AnswerFromContext");
}
public async Task<string> AnswerAsync(
string question,
int topK = 3,
CancellationToken cancellationToken = default)
{
var relevantChunks = await _indexer.SearchAsync(question, topK, cancellationToken);
if (relevantChunks.Count == 0)
return "No relevant context found in the indexed documents.";
// Format chunks with source attribution
var context = string.Join("
---
",
relevantChunks.Select((c, i) => $"[Source {i + 1}: {c.DocumentName}]
{c.Content}"));
var args = new KernelArguments
{
["context"] = context,
["question"] = question
};
var result = await _kernel.InvokeAsync(_answerFn, args, cancellationToken);
return result.GetValue<string>() ?? "Unable to generate an answer.";
}
}
The "answer only from context" instruction is the critical grounding constraint -- it prevents the model from supplementing retrieved content with training knowledge. The source attribution in the context string ([Source 1: dependency-injection.md]) lets the model reference which document it drew from, which helps when debugging retrieval accuracy.
KernelFunctionFactory.CreateFromPrompt() is the same pattern from the AI task planner and the AI code review bot -- a consistent SK pattern for any prompt-only function. The difference here is the dynamic {{$context}} argument populated from retrieval at runtime.
Wiring the Kernel and Running the App
Program.cs builds the kernel with both AI providers and registers the vector store via DI:
var builder = Kernel.CreateBuilder();
// Chat completion -- for generating answers
builder.AddAzureOpenAIChatCompletion(
deploymentName: chatConfig.ModelId,
endpoint: chatConfig.Endpoint,
apiKey: chatConfig.ApiKey);
// Text embedding -- for indexing and searching
builder.AddAzureOpenAITextEmbeddingGeneration(
deploymentName: embeddingConfig.ModelId,
endpoint: embeddingConfig.Endpoint,
apiKey: embeddingConfig.ApiKey);
// InMemoryVectorStore: zero setup, all in-process
builder.Services.AddSingleton<VectorStore, InMemoryVectorStore>();
var kernel = builder.Build();
// Resolve services from kernel's DI container
var embeddingService = kernel.GetRequiredService<ITextEmbeddingGenerationService>();
var vectorStore = kernel.Services.GetRequiredService<VectorStore>();
var indexer = new DocumentIndexer(embeddingService, vectorStore);
await indexer.IndexDocumentsAsync(DocumentLoader.LoadFromDirectory(docsPath));
var answerer = new QuestionAnswerer(indexer, kernel);
Both ChatAI and EmbeddingAI use the same AIProviderConfig class and the same Type/ModelId/Endpoint/ApiKey shape. The only difference is which builder method you call. AddOpenAITextEmbeddingGeneration and AddOpenAIChatCompletion work the same way for the non-Azure path.
With the kernel built, run the app in single-question mode:
dotnet run -- --docs ./sample-docs --question "What is dependency injection?"
Or in interactive mode for iterative Q&A:
dotnet run -- --docs ./sample-docs
# Interactive mode -- type a question and press Enter. Type 'exit' to quit.
RAG Pipeline at a Glance
Every component maps to a specific SK abstraction:
| Stage | Component | SK Type |
|---|---|---|
| Load + Chunk | DocumentLoader |
.NET standard file I/O |
| Embed + Store | DocumentIndexer.IndexDocumentsAsync |
ITextEmbeddingGenerationService, VectorStoreCollection<K,V> |
| Search | DocumentIndexer.SearchAsync |
IVectorSearchable<T>.SearchAsync |
| Answer | QuestionAnswerer.AnswerAsync |
KernelFunctionFactory.CreateFromPrompt, Kernel.InvokeAsync |
The InMemoryVectorStore can be swapped for Azure AI Search, Qdrant, or any other SK vector store connector by changing one line in Program.cs -- the indexer and searcher are written against VectorStore and IVectorSearchable<T>, not against any specific backend. The Semantic Kernel vector store article has the exact code for each backend.
Frequently Asked Questions
What embedding model should I use with this app?
text-embedding-ada-002 and text-embedding-3-small both output 1536-dimensional vectors and work without changes to DocumentChunk. text-embedding-3-large outputs 3072 dimensions and requires updating the [VectorStoreVector(3072)] argument. For most document Q&A workloads, text-embedding-3-small gives accuracy comparable to ada-002 at lower cost.
Can I use a different vector store in production instead of InMemoryVectorStore?
Yes. InMemoryVectorStore is for development and demos -- all data is lost when the process exits. To persist data, swap the registration in Program.cs for Azure AI Search (builder.Services.AddSingleton<VectorStore, AzureAISearchVectorStore>(...)), Qdrant, or any other SK connector. The vector store article covers the connector options.
How do I improve retrieval accuracy when the app returns wrong answers?
Start with chunk size. If chunks are too large (400+ words), the retrieved context dilutes the relevant passage. If too small (under 50 words), each chunk loses surrounding context. Also check the topK value -- increasing from 3 to 5 gives the model more context but increases prompt cost. The chunking strategies article covers fixed-size vs. paragraph vs. semantic chunking and their trade-offs in detail.
Why Must the Prompt Say to Use Only the Provided Context?
This is the grounding constraint. Without it, the model will blend retrieved context with training data, producing plausible-sounding answers that aren't actually from your documents. For document Q&A, you always want answers that trace back to specific source chunks -- if the context doesn't contain the answer, the model should say so rather than guess. This constraint is more important than it looks.
Can this app handle PDFs and Word documents?
Not out of the box -- DocumentLoader reads .txt and .md files. For PDFs, add PdfSharpCore or iTextSharp to extract text before passing to SplitIntoChunks. For Word documents, add DocumentFormat.OpenXml. The chunking and embedding logic stays the same either way.
What is ITextEmbeddingGenerationService and why is it marked obsolete?
ITextEmbeddingGenerationService is the SK embedding interface. In SK 1.72.0 it is marked [Obsolete] -- Microsoft is migrating toward IEmbeddingGenerator<string, Embedding<float>> from Microsoft.Extensions.AI. The current interface still works and is the simplest path for this app. To suppress the warning, add <NoWarn>CS0618</NoWarn> to your project file. The eventual replacement is covered in the text embeddings article.
How does this compare to the AI task planner built with Semantic Kernel?
The AI task planner uses SK purely for LLM-to-text pipelines with KernelFunctionFactory.CreateFromPrompt and structured JSON output -- no vector store, no embeddings. This app adds the full retrieval layer. The QuestionAnswerer class uses the same CreateFromPrompt pattern, but the input is dynamically assembled from vector search results rather than a fixed goal string. The two patterns complement each other: pipeline for structured output, RAG for knowledge retrieval.
Build the App, Then Extend It
Clone semantic-kernel-examples/ai-document-qa, copy appsettings.Development.json.example to appsettings.Development.json, fill in your Azure OpenAI or OpenAI credentials (you need both a chat deployment and an embedding deployment), and run:
dotnet run -- --docs ./sample-docs --question "What is semantic kernel?"
The sample docs cover dependency injection and Semantic Kernel basics. Replace them with your own documentation, architecture decision records, or internal knowledge base. The pipeline handles chunking, embedding, and retrieval automatically.
To move to production: swap InMemoryVectorStore for Azure AI Search, persist the indexed collection between restarts (by calling IndexDocumentsAsync once at startup and storing), and add a caching layer for repeated questions. The core RAG logic in DocumentIndexer and QuestionAnswerer does not change.
Building a document Q&A app with RAG and Semantic Kernel in C# is genuinely one of the faster AI integrations to stand up with SK. The hardest part is usually provisioning the embedding deployment -- once that's done, indexing a hundred-page knowledge base and getting accurate answers in under 500 lines of C# is straightforward.

