# Model Serving - Active Models

This is the list of AI models currently live and available on the Phoeniqs Model Service. All models are served on Phoeniqs infrastructure through an OpenAI-compatible API and are ready for production use.


# Active Models

Model Name Model Type Input (Credits/M Tokens) Output (Credits/M Tokens) TTPS Description
inference-apertus-70b RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16 Chat 0,6195 2,2201 3134 Optimized for multilingual dialogue use cases.
inference-bge-m3 BAAI/bge-m3 Embedding 0,4309 --- 1053 Optimized for Embeddings and sparse retrieval with support for Multi-Functionality, Multi-Linguality, and Multi-Granularity.
inference-bge-reranker BAAI/bge-reranker-v2-m3 Reranker 0,0077 --- 9048 Optimized for Reranker to get relevance score.
inference-deepseek-ocr inference-deepseek-ocr ocr 0,3848 1,5391 1121 Optimized for Contexts Optical Compression.
inference-deepseek-v32 deepseek-ai/DeepSeek-V3.2 Chat 0,6156 1,8469 5981 Optimized for Reasoning chat completions.
inference-gemma-12b-it RedHatAI/gemma-3-12b-it-quantized.w4a16 Multimodal 0,2693 0,4309 7900 Optimized for handling text and image input and generating text output.
inference-gemma4-31b RedHatAI/gemma-4-31B-it-FP8-block Multimodal 0,118 0,325 4524 Optimized for handling text and image input and generating text output.
inference-glm45-air-110b zai-org/GLM-4.5-Air-FP8 Chat 0,4232 1,6853 5470 Optimized for Reasoning chat completions.
inference-glm-46-357B zai-org/GLM-4.6V-FP8 Chat TBD TBD TBD Optimized for Reasoning chat completions. Model will be deployed on demand.
inference-gpt-oss-120b openai/gpt-oss-120b Chat 0,1154 0,4617 4636 Optimized for powerful reasoning, agentic tasks, and versatile developer use cases.
inference-granite-33-8b ibm-granite/granite-3.3-8b-instruct Chat 0,1539 0,1539 9998 Optimized for Reasoning and instruction-following capabilities.
inference-granite-emb-278m ibm-granite/granite-embedding-278m-multilingual Embedding 0,0770 --- 911 Optimized for Embeddings.
inference-granite-vision-2b ibm-granite/granite-vision-3.2-2b Multimodal 0,0770 0,0770 7660 Optimized for compact and efficient vision-language model
inference-llama4-maverick RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 Chat and multi-modal 0,2693 1,0773 7118 Optimized for text and multimodal experiences. Max images per prompt is 4 and no video prompts
inference-llama4-scout-17b RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 Chat and multi-modal 0,1924 0,6387 4678 Optimized for text and multimodal experiences.
inference-mistral-v03-7b mistralai/Mistral-7B-Instruct-v0.3 Chat only 0,1539 0,1539 11436 Optimized for multilingual dialogue use cases.
inference-miner-u25 opendatalab/MinerU2.5-2509-1.2B vision-language 0,38 0,23 913 Optimized for document parsing that achieves state-of-the-art accuracy with high computational efficiency.
inference-qwen3-8b RedHatAI/Qwen3-8B-quantized.w4a16 Reasoning 0,0269 0,1062 9707 Optimized for thinking and reasoning.
inference-qwq-32b RedHatAI/QwQ-32B-quantized.w8a8 Reasoning 0,9234 0,9234 8304 Optimized for thinking and reasoning.
inference-qwen3-vl-235b RedHatAI/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic Multimodal 0,7003 2,0000 8111 Optimized for text and multimodal experiences.
inference-whisper-large-v3 openai/whisper-large-v3 Speech to Text 0.006 per minute NA TBD Optimized for automatic speech recognition (ASR) and speech translation. WhisperX model support high throughput, ASR, long audio files support, speaker diarization & attribution and word-level alignments.

Token Throughput Disclaimer : The token-per-second throughput figures provided are based on controlled testing conditions and are intended for benchmarking and comparison purposes only. Actual performance in production environments may vary significantly depending on workload characteristics, system configuration, model hosting provider, network conditions, and other operational factors. These results should not be interpreted as a guarantee of real-world performance.

Model Updates and Deprecation Disclaimer We reserve the right to modify, upgrade, or replace any AI models used in our services at any time. This may include deprecating older models and introducing newer versions as we deem necessary to maintain performance, security, and service quality. While we aim to provide notice when feasible, changes may occur without prior notification.

NOTE Pricing is subject to change at our discretion.


# Active Models by Use Case

  • Agent Workflows- Models designed for autonomous agent operations:

    • DeepSeek V32
    • Qwen3 VL 235B
    • GPT OSS 120B
  • Conversational, Code Generation & Multilingual Interactions- Models optimized for chat, coding, and multilingual tasks:

    • Llama4 (Maverick, Scout)
    • Qwen (8B, 32B)
    • Gemma (12B)
    • Appertus (70B)
    • Granite (8B)
  • RAG (Retrieval-Augmented Generation) Workflows- Models tailored for memory and document retrieval:

    • BGE Models (M3, Re-ranker)
    • Granite 278M

# Using the models

Looking for ready-to-run examples? See the Model Service Guides: