> For the complete documentation index, see [llms.txt](https://docs.tensorx.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorx.ai/api-reference/quantisation.md).

# Quantisation

How models are quantised on TensorX, and how to check the quantisation for any specific model.

***

## Overview

Every Large Language Model on the TensorX platform is served in a **quantised** format. There is no separate "full-precision" or "non-quantised" variant - the published deployment is already quantised to the right precision for production serving on TensorX hardware.

This page explains which quantisation levels are in use, why, and how to read the value for any specific model.

{% hint style="info" %}
**TL;DR.** Almost every LLM on TensorX runs at `fp8`. The Kimi K2 family runs at native `int4`, and the `deepseek-v4-flash-backup` model runs at `fp4`. Audio models do not carry a quantisation label.
{% endhint %}

***

## Quantisation levels in use

The TensorX catalogue uses three quantisation levels today:

| Level                            | Used by                                                                                                                                                                                                                                         | Why                                                                                                                                                                                         |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`fp8`** (8-bit floating point) | Almost every LLM on the platform, including the GLM family (`z-ai/glm-4.6`, `z-ai/glm-4.7`, `z-ai/glm-5`, `z-ai/glm-5-turbo`, `z-ai/glm-5.1`, `z-ai/glm-5v-turbo`), DeepSeek, Llama, Qwen, Mixtral, MiniMax, and the open-source GPT-OSS models | Best precision-vs-throughput trade-off on NVIDIA B200 and B300 hardware. Quality difference versus higher precisions is negligible on standard evaluation suites.                           |
| **`int4`** (4-bit integer)       | The Kimi K2 family only (`moonshotai/Kimi-K2.6`, `moonshotai/kimi-k2.5`, `moonshotai/kimi-k2.7-code`)                                                                                                                                           | These models ship with Moonshot's native INT4 quantisation-aware-trained weights. Using the QAT weights preserves quality at much lower memory and latency than post-training quantisation. |
| **`fp4`** (4-bit floating point) | The `deepseek-v4-flash-backup` model only                                                                                                                                                                                                       | Served at `fp4` for maximum throughput on the backup deployment.                                                                                                                            |

Non-LLM deployments (text-to-speech, speech-to-text) do not carry a quantisation label.

A small number of LLM deployments also report no quantisation label on `/v1/model/info` (the field is `null`).

***

## Checking the quantisation for a specific model

The model catalogue endpoint exposes the quantisation per model. This is the canonical answer and will reflect any future change to how a model is served.

```bash
curl https://api.tensorx.ai/v1/model/info \
  -H "Authorization: Bearer $TENSORX_API_KEY" \
  | jq '.data[] | {model: .model_name, quantization: .model_info.quantization}'
```

### Example response (excerpt)

```json
[
  { "model": "z-ai/glm-4.7",         "quantization": "fp8"  },
  { "model": "z-ai/glm-5.1",         "quantization": "fp8"  },
  { "model": "minimax/minimax-m2.5", "quantization": "fp8"  },
  { "model": "moonshotai/kimi-k2.5", "quantization": "int4" },
  { "model": "moonshotai/Kimi-K2.6", "quantization": "int4" },
  { "model": "chatterbox-turbo",     "quantization": null   }
]
```

Each entry in the full `/v1/model/info` response carries a `model_info.quantization` field. If the field is `null`, the model is a non-LLM deployment (audio).

***

## FAQ

### Do you offer this model at a different quantisation level?

Not on the shared platform. TensorX does not publish the same model at multiple quantisation levels - there is one served version per model ID, at the quantisation shown on `/v1/model/info`.

If you have a specific requirement (for example, a particular model served at `bf16` or `fp16` for a regulated workload or for a benchmark methodology that requires it), that is a [Dedicated Inference](https://tensorx.ai/dedicated-inference) engagement rather than a parallel SKU on the shared platform.

### Is there a quality difference between `fp8` and higher precisions?

For the models on the TensorX catalogue, the quality difference between the `fp8` deployment and a higher-precision version of the same weights is negligible on standard evaluation suites (MMLU, HumanEval, GSM8K, AIME, and similar). Where benchmarks are run on quantised serving, we report numbers from the served `fp8` version, not from a hypothetical higher-precision copy.

### Why is Kimi K2 served at `int4` rather than `fp8`?

The Kimi K2 family is published by Moonshot with native INT4 quantisation-aware-trained weights. Using the QAT version preserves the model's quality while reducing memory footprint and latency significantly. Quantising those weights up to `fp8` would offer no quality benefit and would cost throughput.

### Will the quantisation for a model ever change?

It can. If a model provider releases an updated weight format, or if a new quantisation method gives a clear quality-vs-throughput improvement, we will update the served version. The `model_info.quantization` field on `/v1/model/info` is always the source of truth.

***

## See also

* [Models](/api-reference/models.md) - Available models on the platform
* [Chat Completions](/api-reference/chat-completions.md) - API endpoint documentation
* [Rate Limits](/api-reference/rate-limits.md) - Throughput and request limits


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorx.ai/api-reference/quantisation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
