# Model Serving - Active Models

This is the list of AI models currently live and available on the Phoeniqs Model Service. All models are served on Phoeniqs infrastructure through an OpenAI-compatible API and are ready for production use.

# Active Models

Model Name	Model	Type	Input (Credits/M Tokens)	Output (Credits/M Tokens)	TTPS	Description
`inference-apertus-70b`	RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16	Chat	0,6195	2,2201	3134	Optimized for multilingual dialogue use cases.
`inference-bge-m3`	BAAI/bge-m3	Embedding	0,4309	---	1053	Optimized for Embeddings and sparse retrieval with support for Multi-Functionality, Multi-Linguality, and Multi-Granularity.
`inference-bge-reranker`	BAAI/bge-reranker-v2-m3	Reranker	0,0077	---	9048	Optimized for Reranker to get relevance score.
`inference-deepseek-ocr`	inference-deepseek-ocr	ocr	0,3848	1,5391	1121	Optimized for Contexts Optical Compression.
`inference-deepseek-v32`	deepseek-ai/DeepSeek-V3.2	Chat	0,6156	1,8469	5981	Optimized for Reasoning chat completions.
`inference-gemma-12b-it`	RedHatAI/gemma-3-12b-it-quantized.w4a16	Multimodal	0,2693	0,4309	7900	Optimized for handling text and image input and generating text output.
`inference-gemma4-31b`	RedHatAI/gemma-4-31B-it-FP8-block	Multimodal	0,118	0,325	4524	Optimized for handling text and image input and generating text output.
`inference-glm45-air-110b`	zai-org/GLM-4.5-Air-FP8	Chat	0,4232	1,6853	5470	Optimized for Reasoning chat completions.
`inference-glm-46-357B`	zai-org/GLM-4.6V-FP8	Chat	TBD	TBD	TBD	Optimized for Reasoning chat completions. Model will be deployed on demand.
`inference-gpt-oss-120b`	openai/gpt-oss-120b	Chat	0,1154	0,4617	4636	Optimized for powerful reasoning, agentic tasks, and versatile developer use cases.
`inference-granite-33-8b`	ibm-granite/granite-3.3-8b-instruct	Chat	0,1539	0,1539	9998	Optimized for Reasoning and instruction-following capabilities.
`inference-granite-emb-278m`	ibm-granite/granite-embedding-278m-multilingual	Embedding	0,0770	---	911	Optimized for Embeddings.
`inference-granite-vision-2b`	ibm-granite/granite-vision-3.2-2b	Multimodal	0,0770	0,0770	7660	Optimized for compact and efficient vision-language model
`inference-llama4-maverick`	RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16	Chat and multi-modal	0,2693	1,0773	7118	Optimized for text and multimodal experiences. Max images per prompt is 4 and no video prompts
`inference-llama4-scout-17b`	RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16	Chat and multi-modal	0,1924	0,6387	4678	Optimized for text and multimodal experiences.
`inference-mistral-v03-7b`	mistralai/Mistral-7B-Instruct-v0.3	Chat only	0,1539	0,1539	11436	Optimized for multilingual dialogue use cases.
`inference-miner-u25`	opendatalab/MinerU2.5-2509-1.2B	vision-language	0,38	0,23	913	Optimized for document parsing that achieves state-of-the-art accuracy with high computational efficiency.
`inference-qwen3-8b`	RedHatAI/Qwen3-8B-quantized.w4a16	Reasoning	0,0269	0,1062	9707	Optimized for thinking and reasoning.
`inference-qwq-32b`	RedHatAI/QwQ-32B-quantized.w8a8	Reasoning	0,9234	0,9234	8304	Optimized for thinking and reasoning.
`inference-qwen3-vl-235b`	RedHatAI/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic	Multimodal	0,7003	2,0000	8111	Optimized for text and multimodal experiences.
`inference-whisper-large-v3`	openai/whisper-large-v3	Speech to Text	0.006 per minute	NA	TBD	Optimized for automatic speech recognition (ASR) and speech translation. WhisperX model support high throughput, ASR, long audio files support, speaker diarization & attribution and word-level alignments.

Token Throughput Disclaimer : The token-per-second throughput figures provided are based on controlled testing conditions and are intended for benchmarking and comparison purposes only. Actual performance in production environments may vary significantly depending on workload characteristics, system configuration, model hosting provider, network conditions, and other operational factors. These results should not be interpreted as a guarantee of real-world performance.

Model Updates and Deprecation Disclaimer We reserve the right to modify, upgrade, or replace any AI models used in our services at any time. This may include deprecating older models and introducing newer versions as we deem necessary to maintain performance, security, and service quality. While we aim to provide notice when feasible, changes may occur without prior notification.

NOTE Pricing is subject to change at our discretion.

# Active Models by Use Case

Agent Workflows- Models designed for autonomous agent operations:
- DeepSeek V32
- Qwen3 VL 235B
- GPT OSS 120B
Conversational, Code Generation & Multilingual Interactions- Models optimized for chat, coding, and multilingual tasks:
- Llama4 (Maverick, Scout)
- Qwen (8B, 32B)
- Gemma (12B)
- Appertus (70B)
- Granite (8B)
RAG (Retrieval-Augmented Generation) Workflows- Models tailored for memory and document retrieval:
- BGE Models (M3, Re-ranker)
- Granite 278M

# Using the models

Looking for ready-to-run examples? See the Model Service Guides:

How to inference an AI model — what you need to make a call (Base URL, Model Name, API Key).
Sample API calls — cURL examples for chat, embeddings, multimodal, OCR, and more.

maas ai model serving