Guide · updated 2026-07-04

How to estimate LLM API costs (before you get the bill)

LLM pricing looks simple — dollars per million tokens — but real bills surprise teams because the cost drivers hide in the workload, not the price sheet. Here is the estimation method that survives contact with production.

1. Measure a real request, don't guess

Take ten representative requests from your prototype and log actual token counts from the API response's usage field. The common mistake is counting only the user's message: your system prompt, retrieved context, tool definitions, and conversation history all bill as input tokens on every single call. A "short question" in a RAG app is routinely 3,000–8,000 input tokens.

2. Input and output are different prices

Output tokens typically cost 4–8× more than input. A summarization workload (huge input, tiny output) and a generation workload (tiny input, huge output) can differ 5× in cost on the same model at the same "price". Estimate the two separately:

monthly = (in_tokens × in_price + out_tokens × out_price) / 1e6
          × requests_per_day × 30

Our calculator does this across every model we track in one shot.

3. Model the traffic shape, not the average

Multiply by realistic growth and concentration: most consumer apps see 5–10% of users generate more than half of all requests. If you charge a flat subscription, estimate cost for your heaviest-decile user, not the mean — that number decides whether your unit economics survive.

AdYour AI product, in front of engineers budgeting LLM spend.See sponsorship options →

4. Then cut the bill

Prompt caching — providers discount repeated prefixes (system prompts, static context) by up to 90%. If your system prompt is long, this is the single biggest lever.
Batch APIs — non-interactive jobs (embeddings, backfills, evals) usually qualify for ~50% off with a 24-hour window.
Model routing — send the easy 80% of traffic to a fast-tier model and escalate the hard 20%. A frontier model for classification tasks is burned money.
Output limits — cap max output tokens per endpoint. Unbounded generation is unbounded spend.

5. Re-check monthly — prices actually move

This market reprices constantly and almost always downward: frontier-class pricing has fallen roughly 10× in two years. A model choice that was correct in March can be wrong by June. Our change log records every move we detect, and the newsletter sends an alert when one lands.