Architecting in Azure

Explore effective Azure architecture strategies, learn what works and what breaks, and understand the reasons behind success and failure in real-world applications.

Implementing Token‑Based Chargeback for Azure OpenAI PTU Deployments Using API Management

Disclaimer: The views and opinions expressed here are my own and do not necessarily reflect those of my employer or any organization with which I am affiliated.

Hello, my name is Scott Ray, and I’m a Principal Cloud Solution Architect at Microsoft. Recently, I’ve been working with customers who are adopting Provisioned Throughput Unit (PTU) reservations to reduce cost and deliver more consistent performance for their AI workloads.

There is existing guidance to help customers size PTUs, deploy models, and analyze overall throughput utilization. However, one topic that repeatedly surfaces—especially for large enterprises—is how to handle chargeback when multiple applications share a single PTU deployment.

In this article, I’ll walk through a practical, repeatable solution for implementing token‑based chargeback using Azure API Management (APIM), Log Analytics, and a Logic App to automate reporting.


The Chargeback Challenge with PTUs

Many of the customers I work with need to allocate costs associated with their PTU deployments back to individual applications or business units. This becomes challenging because Azure OpenAI—like other Cognitive Services—only exposes usage metrics at the deployment level.

Out of the box, there is no native mechanism to determine which application consumed which tokens when multiple callers share the same deployment. For FinOps and platform teams, this lack of visibility makes accurate cost allocation difficult.


Unlocking Token‑Level Visibility with API Management

This is where Azure API Management comes into play.

By placing Azure OpenAI behind an APIM instance, you can enable LLM logging on the API gateway. This allows you to capture detailed telemetry for every request, including:

  • Prompt tokens
  • Completion tokens
  • Total token usage
  • Prompt and response payloads
  • Model name

In addition to this LLM‑specific data, APIM provides standard gateway telemetry that includes information such as the caller’s IP address, backend ID (useful when failing over to Pay‑As‑You‑Go deployments), region, and custom trace records—such as client or application identifiers.

My colleague Matt Felton has an excellent walkthrough on enabling prompt and response logging in APIM. I won’t repeat that material here but instead focus on how to use the logged data to build chargeback reports.  Feel free to review this repository by Matt on how to enable LLM logging, which is the first step on this journey.


Understanding the LLM Gateway Logs

After enabling LLM logging in APIM, a new table becomes available in your Log Analytics workspace: ApiManagementGatewayLogsLLM.

This table provides granular insight into each request, including token counts and payload details. One important nuance to be aware of is that these log records can be chunked, meaning a single request may generate multiple entries with different sequence numbers.

The key detail for chargeback purposes is that sequence number 0 always contains token usage values. Subsequent records related to the same request will have zero values for token counts.


Correlating Token Usage with APIM Products

To attribute token consumption to applications, we take advantage of two APIM features:

  • Products, which group APIs into logical offerings
  • Subscriptions, which tie callers to specific products

By enabling a diagnostic setting for API Management Gateway logs and sending that data to Log Analytics, we can correlate gateway telemetry with LLM logs using the CorrelationId property.

This allows us to join the two tables and summarize token usage per product and per model, which maps cleanly to application or team ownership.


Summarizing Token Usage with KQL

Below is an example KQL query that joins the gateway and LLM logs, filters on successful requests, and produces a per‑product token summary over the last 24 hours:

KQL

ApiManagementGatewayLogs

| where TimeGenerated >= ago(24h)

| join kind=inner ApiManagementGatewayLogsLLM

on CorrelationId

| where SequenceNumber == 0 and IsRequestSuccess == true

| summarize

TotalTokens = sum(TotalTokens),

CompletionTokens = sum(CompletionTokens),

PromptTokens = sum(PromptTokens),

FirstSeen = min(TimeGenerated),

LastSeen = max(TimeGenerated),

Regions = make_set(Region, 8),

CallerIPAddresses = make_set(CallerIPaddress, 8),

Caches = make_set(Cache, 8),

BackendIds = make_set(BackendId, 8),

Calls = count()

by ProductId, ModelName

| project

ProductId,

ModelName,

PromptTokens,

CompletionTokens,

TotalTokens,

Calls,

FirstSeen,

LastSeen,

Regions,

CallerIPAddresses,

Caches,

BackendIds

| order by TotalTokens desc

The result is a clean, auditable dataset showing exact token consumption per APIM product, which can be used directly for chargeback calculations.


Automating Chargeback Reporting with Logic Apps

Once the data is aggregated, the final step is automation.

Using Azure Logic Apps, we can:

  1. Run the Log Analytics query on a schedule
  2. Create a CSV from the query results
  3. Store the output in Azure Blob Storage
  4. Handle failures gracefully at each step

This enables daily or hourly chargeback reporting without manual intervention. The output can easily be consumed by downstream systems such as Power BI, FinOps tooling, or internal billing workflows.


Deployment and Automation Resources

I’ve published a GitHub repository that includes:

  • Bicep modules for deploying the required resources
  • A deployment script to configure the Logic App and API connections
  • Guidance on parameterizing the solution for existing environments

The deployment assumes you already have an API configured in APIM with LLM logging enabled and diagnostic settings pointing to a Log Analytics workspace.

You simply provide:

  • The target resource group
  • Azure region
  • The resource group containing the existing workspace
  • The workspace name

sbray779/PTUChargeBackWorkflow: Logic App workflow that provides token utilization data for charge back


Closing Thoughts

PTU reservations are a powerful way to control costs and improve performance for Azure OpenAI workloads, but shared capacity introduces chargeback complexity. By combining API Management, LLM logging, Log Analytics, and Logic Apps, you can implement a transparent, defensible, and automated chargeback model based on actual token usage.

This approach scales well across multiple applications, supports FinOps initiatives, and provides the data foundation for deeper reporting and optimization over time.