PTU Spillover and Token Tracking in Microsoft Foundry

Why spillover matters for cost attribution, chargeback, and observability in provisioned AI deployments

In this post, I’ll walk through how PTU spillover works in Microsoft Foundry, why it complicates token tracking, and what you can do to improve cost attribution when traffic overflows from a provisioned deployment to a standard one.

Let’s start with the basics. When you deploy models in Microsoft Foundry, you can choose different deployment types depending on your throughput, latency, and data residency requirements. At a high level, the common options are:

Global: requests can be processed in any supported Azure region where the model is available.
Data zone: requests are processed within a defined geographic data zone, such as the United States or European Union, depending on the model offering.
Regional: requests are processed only in the specific Azure region where the model is deployed.

For standard deployments, capacity is shared and billing is based on token consumption. For provisioned deployments, you reserve model processing capacity up front using Provisioned Throughput Units (PTUs). PTUs are best suited for workloads that need more predictable throughput and latency, especially when traffic patterns are steady enough to justify reserved capacity. Microsoft documents PTUs as a unit of model-processing capacity, and throughput can vary based on the model, prompt size, completion size, and overall call shape.

That difference in billing is important. Standard deployments are billed per token, while provisioned deployments are billed hourly based on the number of PTUs deployed. For longer-running production workloads, reservations can reduce PTU costs, but the reservation is a billing construct rather than a guarantee of capacity. In practice, that means many teams allocate PTU cost over a fixed interval, such as a month, and then distribute that cost across applications based on measured usage.

A simple chargeback model might look like this:

Cost per application = (Application token usage ÷ Total token usage) × Total PTU cost

If you use reservations, the reserved rate should be reflected in your total PTU cost before you allocate that cost back to applications. That can materially change your chargeback model, especially at larger scale. Microsoft’s guidance also notes that reservations for Global, Data Zone, and Regional provisioned deployments are not interchangeable, so your financial model should match the deployment type you actually run.

Now for the wrinkle: spillover. Spillover lets a provisioned deployment automatically send requests to a paired standard deployment when the provisioned deployment can’t process the request. According to Microsoft’s current documentation, spillover can occur not only when PTU capacity is exhausted and a request would otherwise receive a 429, but also for certain long-context requests that would fail on PTU and for some server-side errors such as 500 or 503. That makes spillover a useful resiliency and continuity feature, but it also creates a cost attribution challenge.

If you rely only on your existing LLM request logs, spillover traffic can be easy to misattribute. A request may originate against the PTU deployment, but the actual inference can be serviced by the standard deployment after spillover occurs. Microsoft exposes response headers to help identify this scenario, including x-ms-spillover-from-deployment and x-ms-deployment-name. Logging those headers gives you the data you need to distinguish between traffic served by PTU and traffic that spilled to standard capacity.

One practical way to capture those headers is through Azure API Management diagnostics.

In APIM, you can enable diagnostics and configure Azure Monitor logging for frontend and backend responses, then explicitly add the spillover headers to the list of headers to log.

Because APIM requires you to specify individual headers, this gives you a focused way to capture only the metadata you need for downstream reporting and reconciliation. From there, you can update your token accounting pipeline so requests that spill to a standard deployment are attributed correctly instead of being overstated against PTU usage.

I’ve also been updating my supporting automation to account for this pattern so the reporting pipeline reflects spillover-aware attribution. If you’re using PTU for predictable performance but still need burst handling, spillover is a valuable feature—but only if your observability and chargeback model keep pace with how requests are actually served.

Here is a link to my APIM-with-LLMDIagnostics repo that takes into account this pattern in a new feature branch – feature/spilloverHeader. I’ll be updating the PTUChargeBackWorkflow repo to take into account this new telemetry in the near future.