# Model Serving - Supported AI Models

# Overview

This section outlines the AI models been deployed using OpenAI's API and Phoeniqs infrastructure. It is intended for engineers, data scientists, and DevOps teams involved in managing model integrations.

# Supported Models

Model Name	Model	Type	Input (CHF/M Tokens)	Output (CHF/M Tokens)	TTPS	Description
`inference-apertus-8b`	swiss-ai/Apertus-8B-Instruct-2509	Chat	0,1324	0,1431	11041	Optimized for multilingual dialogue use cases.
`inference-apertus-70b`	RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16	Chat	0,6195	2,2201	3134	Optimized for multilingual dialogue use cases.
`inference-bge-m3`	BAAI/bge-m3	Embedding	0,4309	---	1053	Optimized for Embeddings and sparse retrieval with support for Multi-Functionality, Multi-Linguality, and Multi-Granularity.
`inference-bge-reranker`	BAAI/bge-reranker-v2-m3	Reranker	0,0077	---	9048	Optimized for Reranker to get relevance score.
`inference-deepseekr1-70b`	RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16	Chat	0,4617	0,4617	3159	Optimized for Reasoning chat completions.
`inference-deepseekr1-670b`	RedHatAI/DeepSeek-R1-0528-quantized.w4a16	Chat	1,96	4,57	5233	Optimized for Reasoning chat completions.
`inference-deepseek-ocr`	inference-deepseek-ocr	ocr	0,3848	1,5391	1121	Optimized for Contexts Optical Compression.
`inference-deepseek-v32`	deepseek-ai/DeepSeek-V3.2	Chat	0,6156	1,8469	5981	Optimized for Reasoning chat completions.
`inference-gemma-12b-it`	RedHatAI/gemma-3-12b-it-quantized.w4a16	Multimodal	0,2693	0,4309	7900	Optimized for handling text and image input and generating text output.
`inference-gemma4-31b`	RedHatAI/gemma-4-31B-it-FP8-block	Multimodal	0,118	0,325	4524	Optimized for handling text and image input and generating text output.
`inference-glm45-air-110b`	zai-org/GLM-4.5-Air-FP8	Chat	0,4232	1,6853	5470	Optimized for Reasoning chat completions.
`inference-glm-46-357B`	zai-org/GLM-4.6V-FP8	Chat	TBD	TBD	TBD	Optimized for Reasoning chat completions. Model will be deployed on demand.
`inference-gpt-oss-120b`	openai/gpt-oss-120b	Chat	0,1154	0,4617	4636	Optimized for powerful reasoning, agentic tasks, and versatile developer use cases.
`inference-granite-33-8b`	ibm-granite/granite-3.3-8b-instruct	Chat	0,1539	0,1539	9998	Optimized for Reasoning and instruction-following capabilities.
`inference-granite-emb-278m`	ibm-granite/granite-embedding-278m-multilingual	Embedding	0,0770	---	911	Optimized for Embeddings.
`inference-granite-vision-2b`	ibm-granite/granite-vision-3.2-2b	Multimodal	0,0770	0,0770	7660	Optimized for compact and efficient vision-language model
`inference-kimi-k2`	RedHatAI/Kimi-K2-Instruct-quantized.w4a16	Chat only	0,77	2,31	3035	Optimized for multilingual dialogue use cases.
`inference-llama33-70b`	RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16	Chat only	0,5464	0,5464	3035	Optimized for multilingual dialogue use cases.
`inference-llama4-maverick`	RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16	Chat and multi-modal	0,2693	1,0773	7118	Optimized for text and multimodal experiences. Max images per prompt is 4 and no video prompts
`inference-llama4-scout-17b`	RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16	Chat and multi-modal	0,1924	0,6387	4678	Optimized for text and multimodal experiences.
`inference-mistral-v03-7b`	mistralai/Mistral-7B-Instruct-v0.3	Chat only	0,1539	0,1539	11436	Optimized for multilingual dialogue use cases.
`inference-qwen3-8b`	RedHatAI/Qwen3-8B-quantized.w4a16	Reasoning	0,0269	0,1062	9707	Optimized for thinking and reasoning.
`inference-qwq-32b`	RedHatAI/QwQ-32B-quantized.w8a8	Reasoning	0,9234	0,9234	8304	Optimized for thinking and reasoning.
`inference-qwq25-vl-72b`	RedHatAI/Qwen2.5-VL-72B-Instruct-quantized.w4a16	Multimodal	0,8465	0,8465	3076	Optimized for compact and efficient vision-language model.
`inference-qwen3-vl-235b`	RedHatAI/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic	Multimodal	0,7003	2,0000	8111	Optimized for text and multimodal experiences.
`whisperx`	`openai/whisper-large-v3`	Speech to Text	0.006 per minute	NA	TBD	Optimized for automatic speech recognition (ASR) and speech translation. WhisperX model support high throughput, ASR, long audio files support, speaker diarization & attribution and word-level alignments.

Token Throughput Disclaimer : The token-per-second throughput figures provided are based on controlled testing conditions and are intended for benchmarking and comparison purposes only. Actual performance in production environments may vary significantly depending on workload characteristics, system configuration, model hosting provider, network conditions, and other operational factors. These results should not be interpreted as a guarantee of real-world performance.

Model Updates and Deprecation Disclaimer We reserve the right to modify, upgrade, or replace any AI models used in our services at any time. This may include deprecating older models and introducing newer versions as we deem necessary to maintain performance, security, and service quality. While we aim to provide notice when feasible, changes may occur without prior notification.

NOTE Pricing is subject to change at our discretion.

# Supported Models by Use Case

Agent Workflows- Models designed for autonomous agent operations:
- DeepSeek V32
- Qwen3 VL 235B
- GPT OSS 120B
Conversational, Code Generation & Multilingual Interactions- Models optimized for chat, coding, and multilingual tasks:
- Llama4 (Maverick, Scout), Llama3 (70B)
- DeepSeek (70B)
- Qwen (8B, 32B)
- Gemma (12B)
- Appertus (8B, 70B)
- Granite (8B)
RAG (Retrieval-Augmented Generation) Workflows- Models tailored for memory and document retrieval:
- BGE Models (M3, Re-ranker)
- Granite 278M

# How to inference an AI model

To perform inference (i.e., generate responses or predictions) using a deployed AI model, you typically need the following components:

Model Base URL

This is the API endpoint or base URL where your inference requests are sent.

** BASE API URL : https://maas.phoeniqs.com/ **

Model Name

This specifies the exact model you're using.

Use the Model Name parameter provided in table Supported Models in API calls to specify the desired model name.

Examples:

inference-llama4-maverick
inference-bge-m3

API Key

A secure token used to authenticate your requests.

Pass it in the HTTP header: Authorization: Bearer YOUR_API_KEY

# Different Sample Calls

# 1. Call to list available models

curl --location 'https://maas.phoeniqs.com/v1/models' \
--header 'Authorization: Bearer <API_Key>'

# 2. Sample Calls to Chat Completion

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-llama4-maverick",
  "messages": [
    { "role": "user", "content": "How do I make sourdough bread?" }
  ],
  "temperature": 0.7
}'

# 3. Sample Calls to Embeddings

Option - A

curl --location 'https://maas.phoeniqs.com/v1/embeddings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-bge-m3",
  "input": "OpenAI develops AI models that understand and generate text."
}'

Option - B

curl --location 'https://maas.phoeniqs.com/embeddings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-bge-m3",
  "input": "OpenAI develops AI models that understand and generate text."
}'

NOTE Normally v1/embeddings should work if not please try embeddings in the model URL path.

# 4. Sample Calls to MultiModal

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-granite-vision-2b",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is shown in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Example.jpg/800px-Example.jpg"
          }
        }
      ]
    }
  ],
  "temperature": 0.7
}'

# 5. Sample Calls to OCR model

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-deepseek-ocr",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
          }
        },
        {
          "type": "text",
          "text": "Free OCR."
        }
      ]
    }
  ],
  "max_tokens": 2048,
  "temperature": 0.0
}'

# 6. Sample Calls to Kimi-K2 model

Kimi-K2 has a special requirement while making call to it, where it requires additional argument called stop_token_ids and it must be paased with value 163586

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
     "model": "inference-kimi-k2",
     "messages": [
        {"role": "user", "content": "How are you?"}
     ],
     "temperature": 0.7,
     "stop_token_ids": [163586]
}'

maas ai model serving