How to Track LLM Costs by Customer and Feature: A Complete Guide

Why LLM Cost Tracking Matters: The Hidden Billing Problem

If you're building an AI product, you already know the basics: OpenAI charges per token, Anthropic charges per token, and your monthly bill keeps climbing. What most teams discover too late is that tracking LLM costs by customer is a fundamentally different problem than tracking your total spend. Your OpenAI dashboard shows you one number. It tells you nothing about which customer is responsible for 40% of that bill, or which feature is quietly burning your margins.

This isn't a small gap. A SaaS company running GPT-4o for 500 customers might see a $12,000 monthly AI bill that looks manageable — until they realize their top 10 accounts consume 70% of those tokens, and none of that usage is reflected in their pricing tiers. Without customer-level AI billing data, you're flying blind on unit economics.

Understanding Cost Attribution Layers: From API to Customer

LLM cost attribution operates across three distinct layers, and most teams only instrument one of them.

Layer 1 — Provider level: What OpenAI, Anthropic, or Cohere charges you. This is the raw token spend, available through API usage logs or billing exports.

Layer 2 — Framework level: What LangChain, LlamaIndex, or your custom orchestration layer tracks. Some frameworks expose callback hooks that log model calls, but this data is typically aggregated per session, not per customer.

Layer 3 — Application level: Who triggered the call, what feature they used, and what business context surrounds the cost. This is where OpenAI cost attribution becomes actionable — and this is the layer almost no off-the-shelf tool reaches.

The gap between Layer 1 and Layer 3 is where AI margins get destroyed. You need infrastructure that bridges them.

Setting Up Cost Tracking Infrastructure: Tools and Architecture

A practical AI cost tracking stack has four components:

A tagging layer that attaches customer_id, feature_id, and session_id to every LLM call before it leaves your application
A cost calculation engine that converts raw token counts into dollar figures using current provider pricing
A storage layer — a simple Postgres table or a time-series database like TimescaleDB works well
A reporting layer for dashboards, alerts, and billing exports

The critical design decision is where the tagging layer lives. It must sit above your LLM framework, not inside it. If you instrument only at the LangChain callback level, you lose context the moment a request crosses a service boundary. Tag at the request entry point — your API gateway or your background job handler — and propagate that context through every downstream call.

Implementing Customer-Level Cost Attribution with Code Examples

Here's a minimal implementation using Apeiros Protocol's open-source Python SDK. Install it with pip install apeiros-sdk, then wrap any task with three calls:

from apeiros import ApeirosAgent

# Tag the task to a customer and model
agent = ApeirosAgent(customer_id="acme-corp", model="gpt-4o")
agent.start_task("summarize document", priority="low")

# Your existing LLM call — unchanged
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": f"Summarize: {text}"}]
)

# Report the token count Apeiros needs to calculate cost
agent.update_tokens(
    response.usage.prompt_tokens + response.usage.completion_tokens
)
agent.end_task()

# See the per-task cost
print(agent.cost_estimate)  # → $0.0023

Three calls. No changes to your underlying LLM calls. No framework lock-in. Apeiros works with any provider that returns token counts in its response — OpenAI, Anthropic, Gemini, and others.

After running tasks across multiple customers, call customer_report() to see the full breakdown:

# After processing tasks for several customers
print(ApeirosAgent.customer_report(plan_price=299.0))

# Output:
# ══════════════════════════════════════════════════════════════════════
#   CUSTOMER COST REPORT
# ══════════════════════════════════════════════════════════════════════
#   Customer                Tasks      Tokens   Est. Cost  Signal
# ──────────────────────────────────────────────────────────────────────
#   hyperscale-co              18,700  74,800k   $598.40   Burst  ⚠ high throttle
#   nova-ventures               5,580  229,800k  $184.00   Burst
#   titan-retail                2,640   8,448k   $ 67.58   Burst
#   greenleaf-inc                 880  61,600k   $ 49.28   Burst
#   acme-corp                     396   7,128k   $  5.70   Burst
# ──────────────────────────────────────────────────────────────────────
#   TOTAL                      28,196  381,776k  $904.96
#
#   MARGIN ANALYSIS  ($299.00/month flat plan)
# ──────────────────────────────────────────────────────────────────────
#   hyperscale-co    $598.40   $299.00  -100.1%  ✗  underwater
#   nova-ventures    $184.00   $299.00   38.5%   ⚠  thin
#   titan-retail     $ 67.58   $299.00   77.4%   ✓  healthy
#   greenleaf-inc    $ 49.28   $299.00   83.5%   ✓  healthy
#   acme-corp        $  5.70   $299.00   98.1%   ✓  healthy

The margin analysis tells you immediately who is profitable and who isn't — before the invoice arrives.

Tracking Costs Per Feature: Granular Cost Breakdown Strategies

Cost per feature LLM tracking unlocks decisions that aggregate data simply cannot. Imagine a legal tech team that discovers their AI contract review feature costs $0.18 per document on average — but their AI Q&A chat feature costs $0.004 per query. Both look fine in aggregate. At scale, the chat feature is the profitable one. The review feature needs a pricing change.

Tag every feature your product exposes: document_review, email_draft, search_augmentation, onboarding_assistant. When you query your cost data by feature, you get a direct read on gross margin per product surface. That's the input your pricing team actually needs.

Common Cost Tracking Pitfalls and How to Avoid Them

Pitfall 1 — Trusting framework-level token counts exclusively. LangChain's token callback fires per chain step, not per user request. One user interaction can trigger multiple chain steps, and without aggregation logic, you double-count.

Pitfall 2 — Ignoring retry and fallback costs. If your retry logic silently retries a failed GPT-4o call with GPT-4o-mini, that second call still costs money. Both calls need to land on the same customer record.

Pitfall 3 — Sampling instead of capturing everything. Unlike application traces, LLM costs are not uniform. A single outlier call with a 50,000-token prompt skews your customer cost data significantly. Capture 100% of calls.

Pitfall 4 — No alert thresholds. Without per-customer spend alerts, a single misconfigured prompt loop can run up hundreds of dollars before anyone notices. Set daily spend limits per customer at the tracking layer.

Querying Your Cost Data

Once you have per-customer, per-feature cost records in storage, discovery is straightforward. A query like SELECT SUM(cost) FROM llm_events WHERE customer_id = 'cust_1042' GROUP BY feature tells you exactly which product surface is burning margin for a given account. This is the kind of signal that prompts a pricing conversation with a customer — before the invoice arrives.

Integrating with OpenAI and Other LLM Providers

Apeiros Protocol integrates with OpenAI, Anthropic Claude, Google Gemini, Mistral, and any provider that exposes token usage in its API response. For providers that don't return usage data by default (some Bedrock configurations, for example), Apeiros falls back to tiktoken-based estimation with configurable pricing overrides.

For OpenAI cost attribution specifically, Apeiros reads the usage object from every completion response and maps it to current tier pricing — including cached input token discounts introduced in late 2024. Your stored costs stay accurate as pricing changes, because Apeiros separates raw token counts from computed cost so you can reprocess historical data with updated rates.

Automating Cost Allocation and Billing

Once you have per-customer cost data, automation is straightforward. A daily cron job queries your cost store, computes each customer's AI spend, and writes it to your billing platform. If you use Stripe, the Stripe Billing API accepts usage records that feed directly into metered subscriptions.

import time
import stripe
from apeiros import ApeirosAgent

# Get the full cost breakdown at the end of the billing period
report = ApeirosAgent._registry  # raw per-customer task records

for customer_id, tasks in report.items():
    ai_spend = sum(t["cost"] for t in tasks)
    stripe.SubscriptionItem.create_usage_record(
        STRIPE_ITEM_IDS[customer_id],
        quantity=int(ai_spend * 1000),  # convert dollars to millicents
        timestamp=int(time.time()),
    )

This closes the loop: your LLM costs feed your billing system automatically, and every dollar of AI spend maps to a paying customer record. The customer_report() output becomes your billing source of truth.

Optimizing Costs Once You Can Track Them

Tracking enables optimization. With per-customer, per-feature cost data, four strategies become available immediately:

Model routing: Route low-complexity requests to cheaper models. If your quick_reply feature averages 300 tokens and doesn't require deep reasoning, GPT-4o-mini at $0.15/M input tokens outperforms GPT-4o at $2.50/M for the classification step.
Prompt compression: Identify which features send the longest prompts and trim system instructions that don't demonstrably improve output quality.
Caching: Flag features with high query repetition rates as caching candidates. Even 20% cache hit rate on a high-traffic feature can cut monthly costs significantly.
Customer tier adjustments: Customers whose AI cost consistently exceeds their plan price are candidates for usage-based pricing or tier upgrades — a conversation your sales team can have with data instead of guesswork.

Cost visibility doesn't just protect margins. It turns pricing into a data problem you can actually solve.

← Back to Blog