BrandGhost
Middleware in Microsoft Agent Framework: Logging, Caching, and Custom Pipelines

Middleware in Microsoft Agent Framework: Logging, Caching, and Custom Pipelines

When you start building AI agents that handle real user traffic, raw model calls are rarely enough. You need logging to understand what's happening inside each turn. You need caching to cut API costs on repeated queries. You need rate limiting to protect your infrastructure from runaway requests. This is exactly where microsoft agent framework middleware csharp pipelines come in -- and the middleware in Microsoft Agent Framework (MAF) gives you a composable, extensible way to layer these concerns cleanly on top of your IChatClient calls. Note that MAF is in public preview at version 1.0.0-rc1, so APIs may shift before the general availability release.

If you've worked with ASP.NET Core before, the concept of middleware will feel immediately familiar. Just like HTTP requests flow through a sequence of middleware components before hitting your controller -- and responses flow back through the same chain -- AI agent turns flow through a pipeline of IChatClient wrappers before reaching the model. Each wrapper can inspect, modify, short-circuit, or augment both the request and the response. This architecture makes it straightforward to add cross-cutting concerns without tangling them into your core agent logic.

What Middleware Means in AI Agent Pipelines

In a typical AI agent setup, you have a chat client that sends messages to a language model and returns completions. Middleware sits between your application code and that model call. Each middleware component wraps the next one in the chain, forming a pipeline.

Each step in the pipeline gets to:

  • Inspect the incoming messages before they reach the model
  • Modify the messages (add system context, sanitize content, inject metadata)
  • Short-circuit the call entirely (return a cached response, block a rate-limited request)
  • Post-process the response before it's returned to the caller
  • Observe timing and telemetry without altering behavior

This is the decorator or interceptor pattern, and it maps cleanly to what ASP.NET Core developers already know from IMiddleware. The key difference in the AI agent world is that the "request" is a list of ChatMessage objects and the "response" is a ChatResponse -- but the structural pattern is identical.

If you've explored Building AI Agents with Semantic Kernel, you'll recognize this composability concept. Semantic Kernel has its own filter and hook system. MAF approaches the same problem through the IChatClient abstraction from Microsoft.Extensions.AI, which keeps the middleware layer technology-agnostic and familiar to the broader .NET ecosystem.

MAF's Middleware Pipeline Types

Middleware in Microsoft Agent Framework flows through two main layers. The first is the IChatClient pipeline provided by Microsoft.Extensions.AI. The second is MAF-specific agent run hooks you attach to an AgentSession.

The IChatClient Pipeline (UseXxx Pattern)

The IChatClient pipeline is the most powerful option for cross-cutting infrastructure concerns. You build it using the fluent AsBuilder() + UseXxx() + Build() pattern. Each UseXxx() call wraps the previous client in a new middleware layer.

The built-in middleware available from Microsoft.Extensions.AI includes:

  • UseLogging() -- structured logging of every request and response turn
  • UseDistributedCache() -- semantic caching using any IDistributedCache implementation
  • UseFunctionInvocation() -- automatic tool and function call execution

You can also write completely custom middleware by extending DelegatingChatClient -- a base class that forwards calls to an inner client and lets you override GetResponseAsync and GetStreamingResponseAsync.

Agent Run Hooks

For agent-level concerns that span multiple turns, MAF exposes an AgentSession that you pass into agent.RunAsync(). You can attach pre- and post-turn processing logic at this layer to track conversational context, validate inputs across sessions, or emit telemetry events per agent turn rather than per raw model call.

Session Hooks and Event Handling in GitHub Copilot SDK covers a similar concept in the GitHub Copilot SDK -- the pattern of intercepting lifecycle events is common across modern .NET AI frameworks, and the mental model transfers well.

Implementing Logging Middleware

Logging is the first middleware most teams add. Without it, debugging agent behavior in production is nearly impossible. The UseLogging() extension from Microsoft.Extensions.AI handles this automatically when you wire it into the pipeline -- it logs request messages and response content at configurable verbosity levels using the standard ILogger abstraction you already know from ASP.NET Core.

Here's how to set up structured logging middleware for an MAF agent:

using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.Logging;
using OpenAI;

public class LoggingAgentSetup
{
    public static IChatClient BuildLoggingClient(
        string openAiApiKey,
        ILoggerFactory loggerFactory)
    {
        // Build a chat client with logging middleware injected
        IChatClient client = new OpenAIChatClient(
                new OpenAIClient(openAiApiKey),
                "gpt-4o")
            .AsBuilder()
            .UseLogging(loggerFactory)   // captures every request + response
            .UseFunctionInvocation()     // handles tool calls automatically
            .Build();

        return client;
    }

    public static async Task RunWithLogging(
        string openAiApiKey,
        ILoggerFactory loggerFactory)
    {
        var client = BuildLoggingClient(openAiApiKey, loggerFactory);

        // Wrap as an MAF agent with instructions
        var agent = client.AsAIAgent(
            instructions: "You are a helpful assistant that answers concisely.");

        var session = await agent.CreateSessionAsync();
        AgentResponse response = await agent.RunAsync(
            "Explain middleware in three sentences.",
            session);

        Console.WriteLine(response.Text);
    }
}

The UseLogging(loggerFactory) call does the heavy lifting. Every GetResponseAsync invocation going through this client logs the outbound messages, the model used, completion tokens consumed, and the returned content. When you route those logs to a structured sink -- Application Insights, Seq, or any other provider -- you get a full audit trail of every agent interaction without writing a single line of custom logging code.

One detail worth noting: UseLogging() logs at Debug level by default for request content, and Information level for metadata like token counts. Adjust your logging provider's minimum level accordingly so you don't flood production logs while still capturing what you need for diagnostics.

Implementing Caching Middleware

Caching is where middleware pays off quickly in AI applications. LLM API calls carry non-trivial cost in both latency and money. When users ask the same question repeatedly -- think FAQ bots, classification pipelines, or deterministic summarization tasks -- returning a cached response saves real resources.

MAF supports distributed caching through UseDistributedCache(), which hooks into any IDistributedCache implementation. Redis, SQL Server, in-memory, or any custom provider you wire up via dependency injection will work. The cache key is derived from the input messages, so identical prompts get cache hits automatically.

using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.Caching.Distributed;
using Microsoft.Extensions.DependencyInjection;
using OpenAI;

public class CachingAgentSetup
{
    public static IChatClient BuildCachingClient(
        string openAiApiKey,
        IDistributedCache cache,
        ILoggerFactory loggerFactory)
    {
        IChatClient client = new OpenAIChatClient(
                new OpenAIClient(openAiApiKey),
                "gpt-4o")
            .AsBuilder()
            .UseLogging(loggerFactory)
            .UseDistributedCache(cache)  // serve repeated prompts from cache
            .UseFunctionInvocation()
            .Build();

        return client;
    }

    public static IServiceCollection AddCachingAgent(
        this IServiceCollection services,
        string openAiApiKey)
    {
        // Register Redis as the distributed cache backend
        services.AddStackExchangeRedisCache(options =>
        {
            options.Configuration = "localhost:6379";
            options.InstanceName = "AgentCache:";
        });

        services.AddSingleton<IChatClient>(sp =>
        {
            var cache = sp.GetRequiredService<IDistributedCache>();
            var loggerFactory = sp.GetRequiredService<ILoggerFactory>();
            return BuildCachingClient(openAiApiKey, cache, loggerFactory);
        });

        return services;
    }
}

The caching middleware checks the distributed cache before forwarding the request downstream. On a cache hit, the full cached response is returned immediately -- no API call, no latency, no token cost. On a miss, the request proceeds through the pipeline as normal and the response is stored for future use.

By default, cached entries don't expire unless you configure cache entry options. For most agent use cases, you'll want a sliding expiration or an absolute TTL that matches your data freshness requirements. Pass a DistributedCacheEntryOptions configuration delegate to UseDistributedCache() to control this behavior.

For more on building tools that cooperate with these pipelines, see Custom AI Tools with AIFunctionFactory -- those tools plug directly into UseFunctionInvocation() in the same pipeline.

Custom Middleware: Rate Limiting and Content Filtering

The built-in middleware handles common infrastructure cases. For everything else, you extend DelegatingChatClient. This base class intercepts GetResponseAsync calls and lets you run arbitrary logic before and after forwarding to the inner client.

Here's a custom middleware that combines per-client rate limiting with basic input sanitization to guard against prompt injection:

using Microsoft.Extensions.AI;
using System.Collections.Concurrent;

public class RateLimitingChatClient : DelegatingChatClient
{
    private readonly int _maxRequestsPerMinute;
    private readonly ConcurrentQueue<DateTimeOffset> _requestTimestamps = new();

    public RateLimitingChatClient(IChatClient innerClient, int maxRequestsPerMinute)
        : base(innerClient)
    {
        _maxRequestsPerMinute = maxRequestsPerMinute;
    }

    public override async Task<ChatResponse> GetResponseAsync(
        IList<ChatMessage> chatMessages,
        ChatOptions? options = null,
        CancellationToken cancellationToken = default)
    {
        EnforceRateLimit();

        // Sanitize user inputs before forwarding downstream
        var sanitized = chatMessages
            .Select(m => new ChatMessage(
                m.Role,
                SanitizeContent(m.Text ?? string.Empty)))
            .ToList();

        return await base.GetResponseAsync(sanitized, options, cancellationToken);
    }

    private void EnforceRateLimit()
    {
        var now = DateTimeOffset.UtcNow;
        var windowStart = now.AddMinutes(-1);

        // Prune timestamps older than the rolling window
        while (_requestTimestamps.TryPeek(out var oldest) && oldest < windowStart)
            _requestTimestamps.TryDequeue(out _);

        if (_requestTimestamps.Count >= _maxRequestsPerMinute)
        {
            throw new InvalidOperationException(
                $"Rate limit exceeded -- max {_maxRequestsPerMinute} requests per minute.");
        }

        _requestTimestamps.Enqueue(now);
    }

    private static string SanitizeContent(string input)
    {
        // Strip common prompt injection patterns -- extend for your threat model
        return input
            .Replace("ignore previous instructions", string.Empty,
                StringComparison.OrdinalIgnoreCase)
            .Replace("disregard all prior", string.Empty,
                StringComparison.OrdinalIgnoreCase)
            .Trim();
    }
}

This pattern is flexible. You can extend DelegatingChatClient to implement content classification (route sensitive queries to a human review queue), retry logic with exponential backoff, custom observability (emit OpenTelemetry spans), or A/B testing (split traffic between model versions based on feature flags). Each concern stays in its own class and composes cleanly with everything else in the pipeline.

Streaming Responses with GitHub Copilot SDK shows how token-level streaming works in a similar SDK context. If you need streaming support in your custom middleware, override GetStreamingResponseAsync in addition to GetResponseAsync -- otherwise streaming callers will bypass your middleware entirely and fall back to the base implementation.

Middleware Registration and Ordering

Order matters in middleware pipelines. The order you register middleware in the builder is the order it executes on the inbound request path. On the response path, execution unwinds in reverse order. This means the outermost middleware sees both the first outbound request and the last inbound response.

For practical agent pipelines, the recommended ordering is:

  1. Logging -- outermost, so it captures everything including cache hits and rate limit rejections
  2. Rate limiting -- block excessive requests early before they consume cache or downstream resources
  3. Caching -- short-circuit before making expensive API calls
  4. Function invocation -- innermost before the raw model call, so tool execution happens closest to the model

Here's a complete pipeline setup that wires everything together with correct ordering, including an extension method for reusability:

using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.Caching.Distributed;
using Microsoft.Extensions.Logging;
using OpenAI;

// Reusable extension method for consistent pipeline setup across services
public static class ChatClientBuilderExtensions
{
    public static ChatClientBuilder UseProductionDefaults(
        this ChatClientBuilder builder,
        IDistributedCache cache,
        ILoggerFactory loggerFactory,
        int maxRequestsPerMinute = 120)
    {
        return builder
            .UseLogging(loggerFactory)                        // 1 - outermost
            .Use(inner => new RateLimitingChatClient(         // 2 - reject early
                inner, maxRequestsPerMinute))
            .UseDistributedCache(cache)                       // 3 - cache before API
            .UseFunctionInvocation();                         // 4 - innermost
    }
}

public static class AgentPipelineFactory
{
    public static IChatClient CreateProductionPipeline(
        string openAiApiKey,
        IDistributedCache cache,
        ILoggerFactory loggerFactory,
        int maxRequestsPerMinute = 120)
    {
        return new OpenAIChatClient(
                new OpenAIClient(openAiApiKey),
                "gpt-4o")
            .AsBuilder()
            .UseProductionDefaults(cache, loggerFactory, maxRequestsPerMinute)
            .Build();
    }

    public static async Task<string> RunAgentAsync(
        IChatClient pipeline,
        string userPrompt,
        string systemInstructions)
    {
        var agent = pipeline.AsAIAgent(instructions: systemInstructions);
        var session = await agent.CreateSessionAsync();

        AgentResponse response = await agent.RunAsync(userPrompt, session);
        return response.Text ?? string.Empty;
    }
}

The UseProductionDefaults extension method encapsulates the entire standard pipeline. Any team member can adopt it with a single call and inherit all the infrastructure guarantees -- logging, rate limiting, caching, and tool execution -- without having to understand the internals.

A detail worth double-checking: if you put logging inside the cache layer (closer to the model), you only log actual API calls and miss cache hits entirely. Put logging outside the cache (as shown above) and every request gets logged regardless of whether it was served from cache. Most production teams want the outer position to get a complete picture of total agent traffic and cache hit rates.

How Middleware Composes in Practice

The AsBuilder() pattern is the glue that makes all of this work. Calling AsBuilder() on any IChatClient returns a ChatClientBuilder, which maintains an ordered list of factory delegates. Each UseXxx() call prepends a factory to that list. When you call Build(), the builder instantiates the pipeline from innermost (the original model client) to outermost (the first middleware registered), passing each instance into the next as its inner client.

This composability also means you can create environment-specific pipelines easily. A development pipeline might use only UseLogging() with verbose settings and an in-memory cache, while production gets Redis, rate limiting, and structured telemetry. Both share the same UseProductionDefaults extension -- or you define separate extension methods for each environment and swap them at startup.

For teams building more complex systems, Build a Multi-Agent Analysis System covers how multiple agents coordinate work. Having a consistent middleware pipeline across all agents in a multi-agent system is important for coherent observability and uniform policy enforcement -- every agent should log, cache, and rate-limit using the same standards.

If you're coming from a Semantic Kernel background, Semantic Kernel Agents in C# and Semantic Kernel Plugin Best Practices cover Semantic Kernel's filter-based approach to cross-cutting concerns. MAF's IChatClient pipeline maps more directly to the ASP.NET Core middleware mental model, which many .NET developers will find intuitive right away.

Keep in mind: MAF is in public preview at 1.0.0-rc1. The UseXxx API surface, the DelegatingChatClient base class, and the AgentSession API are all subject to change before general availability. Pin your package version explicitly and watch the Microsoft Agents SDK repository for release notes and breaking changes as the framework matures toward GA.


Frequently Asked Questions

What is the difference between IChatClient middleware and agent run hooks in MAF?

IChatClient middleware operates at the model call level -- it wraps every request to GetResponseAsync, regardless of which agent or session initiated it. Agent run hooks (attached via AgentSession) operate at the agent turn level -- they fire once per full agent interaction, not once per underlying model call. For infrastructure concerns like logging and caching, IChatClient middleware is the right layer. For conversational logic like turn validation or session-scoped tracking, hooks are more appropriate.

Does UseDistributedCache work with any IDistributedCache provider?

Yes. UseDistributedCache() accepts any IDistributedCache implementation registered in your DI container. This includes the built-in in-memory provider (AddDistributedMemoryCache()), Redis via StackExchange.Redis, SQL Server, and any third-party provider that implements the interface. The cache key is derived from the input message list, so the middleware works the same way regardless of the backing store.

How do I add streaming support to a custom DelegatingChatClient?

Override GetStreamingResponseAsync in addition to GetResponseAsync. If you only override GetResponseAsync, streaming callers will skip your middleware and use the base DelegatingChatClient implementation, which just forwards to the inner client without your custom logic. For rate limiting, this means your limits won't apply to streaming requests -- so always override both methods if you want complete coverage.

What happens when middleware throws an exception?

Exceptions thrown inside middleware propagate up the call stack normally. Middleware registered above (closer to the caller) can catch them. Middleware below (closer to the model) won't see them. If you want centralized error handling -- logging exceptions, translating error types, or triggering fallback behavior -- put that logic in your outermost middleware layer where it can intercept exceptions from the entire pipeline.

Can I unit test custom DelegatingChatClient middleware in isolation?

Yes, and it's straightforward. DelegatingChatClient takes an IChatClient as its constructor argument, so you can inject a mock or a fake implementation during tests. Use a simple stub that returns a fixed ChatResponse to verify that your middleware logic (rate limiting, sanitization, caching) behaves correctly without making real API calls.

How does middleware ordering affect cache behavior and logging completeness?

If logging is registered after the cache in the builder chain (closer to the model), it only fires on cache misses -- you lose visibility into cache hits. If logging is registered before the cache (farther from the model, as recommended), it fires on every request and you can measure your cache hit rate accurately. As a rule: metrics and logging belong on the outside; short-circuit optimizations like caching belong closer to the model.

Is MAF middleware compatible with Semantic Kernel pipelines?

They operate at different layers. MAF middleware wraps IChatClient from Microsoft.Extensions.AI. Semantic Kernel uses its own IKernelFilter and IPromptRenderFilter interfaces for a similar purpose. If you're using Semantic Kernel on top of MAF (which is possible since Semantic Kernel can consume any IChatClient), MAF middleware fires first at the transport layer, and Semantic Kernel filters fire at the orchestration layer above it. They don't conflict -- they complement each other for different concerns.

Getting Started with Microsoft Agent Framework in C#

Getting started with Microsoft Agent Framework in C# is fast. Install packages, configure OpenAI or Azure OpenAI, and build your first streaming agent.

Multi-Agent Orchestration in Microsoft Agent Framework in C#

Master multi-agent orchestration in Microsoft Agent Framework in C# with a real research app. Revision loops, quality gates, and context passing explained.

Microsoft Agent Framework in C#: Complete Developer Guide

Complete guide to Microsoft Agent Framework in C#. Core abstractions, architecture, tool registration, sessions, and where MAF fits in the .NET AI ecosystem.

An error has occurred. This application may no longer respond until reloaded. Reload