> For the complete documentation index, see [llms.txt](https://docs.tensorx.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorx.ai/api-reference/prompt-caching.md).

# Prompt Caching

## Does TensorX offer prompt caching?

Yes. Our serving stack uses KV caching at the inference layer, which is how modern inference engines work in general. When repeated requests share a common prefix, or land on the same replica that has handled a similar prefix recently, reuse of cached state makes those requests faster.

This is **best-effort rather than guaranteed**. We run multiple replicas of popular models for load and resilience, so a follow-up request can land on a different replica without that prefix warm, in which case it is recomputed from scratch.

## Cache-hit pricing

Models that support prompt caching have separate, lower rates for cached input tokens. When a request hits a warm replica with a matching prefix, the cached portion of the input is billed at the model's **cache-read** rate rather than the standard input rate. Across the catalogue the cache-read rate is roughly a quarter of the input rate (about a 75% discount); the exact per-model figure is available from the API shown below. Populating the cache on the first hit uses the **cache-write** rate. You don't need to do anything to opt in; pricing is applied automatically based on what the serving layer reports back to the billing layer.

Because cache hits are best-effort across replicas, the saving applies on a per-request basis rather than as a guaranteed discount on every call. In steady traffic with consistent prefixes you can expect the cache-read rate to apply most of the time; under heavy load or after idle periods, more of your requests will be billed at the standard input rate.

We will surface the per-model cache rates on the public pricing page shortly. In the meantime, you can retrieve them yourself today directly from the API:

```bash
curl -H "Authorization: Bearer $TENSORX_API_KEY" \
  https://api.tensorx.ai/v1/model/info \
  | jq '.data[] | select(.model_info.supports_prompt_caching == true) | {
      model: .model_name,
      input_cost_per_token: .model_info.input_cost_per_token,
      cache_read_input_token_cost: .model_info.cache_read_input_token_cost,
      cache_creation_input_token_cost: .model_info.cache_creation_input_token_cost
    }'
```

The `supports_prompt_caching` flag tells you whether a given model has caching enabled. Audio models and a small number of LLMs don't expose caching today.

## Measuring cache usage

Because hits are best-effort, you can confirm what actually got cached from the `usage` object the API returns on each response. On chat completions, `prompt_tokens_details.cached_tokens` reports how many of the prompt tokens were served from cache; comparing that against `prompt_tokens` tells you the share of the input that hit the cache (and so was billed at the cheaper cache-read rate). Aggregate these fields across calls to see how often steady, shared-prefix traffic is hitting the cache.

## What this means in practice

* **Repeated identical or near-identical prompts**, especially when sent in close succession, are likely to benefit from KV cache reuse on a warm replica.
* **Long shared system prompts** across many calls (a common pattern for agentic workflows) will often hit the cache when traffic is steady, and will miss when load spreads requests across replicas.
* **Cold starts** (the first request to a freshly scaled replica, or after a long idle period) will not benefit from caching.

## Caching and zero data retention

Prompt caching reuses a matching prompt prefix when a recent request has warmed it, on a best-effort basis across the serving fleet, and is unrelated to our zero-data-retention policy. ZDR governs what we do not retain after a request completes (prompts, completions, logs of either).

KV state lives in GPU memory only and is never written to disk or included in any retained dataset. Prompt caching works by reusing this state, so cached blocks do persist in memory beyond the single request that created them: a later request that shares a prefix can reuse them. They remain transient in-memory state, evicted under memory pressure or after idle periods, and nothing about the prompt or completion is persisted to storage at any point.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorx.ai/api-reference/prompt-caching.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
