# Model Serving - Supported AI Models

# Overview

This section outlines the AI models been deployed using OpenAI's API and Phoeniqs infrastructure. It is intended for engineers, data scientists, and DevOps teams involved in managing model integrations.


# Supported Models

Model Name Model Type Input (CHF/M Tokens) Output (CHF/M Tokens) TTPS Description
inference-apertus-8b swiss-ai/Apertus-8B-Instruct-2509 Chat 0,1324 0,1431 11041 Optimized for multilingual dialogue use cases.
inference-apertus-70b RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16 Chat 0,6195 2,2201 3134 Optimized for multilingual dialogue use cases.
inference-bge-m3 BAAI/bge-m3 Embedding 0,4309 --- 1053 Optimized for Embeddings and sparse retrieval with support for Multi-Functionality, Multi-Linguality, and Multi-Granularity.
inference-bge-reranker BAAI/bge-reranker-v2-m3 Reranker 0,0077 --- 9048 Optimized for Reranker to get relevance score.
inference-deepseekr1-70b RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16 Chat 0,4617 0,4617 3159 Optimized for Reasoning chat completions.
inference-deepseekr1-670b RedHatAI/DeepSeek-R1-0528-quantized.w4a16 Chat 1,96 4,57 5233 Optimized for Reasoning chat completions.
inference-deepseek-ocr inference-deepseek-ocr ocr 0,3848 1,5391 1121 Optimized for Contexts Optical Compression.
inference-deepseek-v32 deepseek-ai/DeepSeek-V3.2 Chat 0,6156 1,8469 5981 Optimized for Reasoning chat completions.
inference-gemma-12b-it RedHatAI/gemma-3-12b-it-quantized.w4a16 Multimodal 0,2693 0,4309 7900 Optimized for handling text and image input and generating text output.
inference-gemma4-31b RedHatAI/gemma-4-31B-it-FP8-block Multimodal 0,118 0,325 4524 Optimized for handling text and image input and generating text output.
inference-glm45-air-110b zai-org/GLM-4.5-Air-FP8 Chat 0,4232 1,6853 5470 Optimized for Reasoning chat completions.
inference-glm-46-357B zai-org/GLM-4.6V-FP8 Chat TBD TBD TBD Optimized for Reasoning chat completions. Model will be deployed on demand.
inference-gpt-oss-120b openai/gpt-oss-120b Chat 0,1154 0,4617 4636 Optimized for powerful reasoning, agentic tasks, and versatile developer use cases.
inference-granite-33-8b ibm-granite/granite-3.3-8b-instruct Chat 0,1539 0,1539 9998 Optimized for Reasoning and instruction-following capabilities.
inference-granite-emb-278m ibm-granite/granite-embedding-278m-multilingual Embedding 0,0770 --- 911 Optimized for Embeddings.
inference-granite-vision-2b ibm-granite/granite-vision-3.2-2b Multimodal 0,0770 0,0770 7660 Optimized for compact and efficient vision-language model
inference-kimi-k2 RedHatAI/Kimi-K2-Instruct-quantized.w4a16 Chat only 0,77 2,31 3035 Optimized for multilingual dialogue use cases.
inference-llama33-70b RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 Chat only 0,5464 0,5464 3035 Optimized for multilingual dialogue use cases.
inference-llama4-maverick RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 Chat and multi-modal 0,2693 1,0773 7118 Optimized for text and multimodal experiences. Max images per prompt is 4 and no video prompts
inference-llama4-scout-17b RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 Chat and multi-modal 0,1924 0,6387 4678 Optimized for text and multimodal experiences.
inference-mistral-v03-7b mistralai/Mistral-7B-Instruct-v0.3 Chat only 0,1539 0,1539 11436 Optimized for multilingual dialogue use cases.
inference-qwen3-8b RedHatAI/Qwen3-8B-quantized.w4a16 Reasoning 0,0269 0,1062 9707 Optimized for thinking and reasoning.
inference-qwq-32b RedHatAI/QwQ-32B-quantized.w8a8 Reasoning 0,9234 0,9234 8304 Optimized for thinking and reasoning.
inference-qwq25-vl-72b RedHatAI/Qwen2.5-VL-72B-Instruct-quantized.w4a16 Multimodal 0,8465 0,8465 3076 Optimized for compact and efficient vision-language model.
inference-qwen3-vl-235b RedHatAI/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic Multimodal 0,7003 2,0000 8111 Optimized for text and multimodal experiences.
whisperx openai/whisper-large-v3 Speech to Text 0.006 per minute NA TBD Optimized for automatic speech recognition (ASR) and speech translation. WhisperX model support high throughput, ASR, long audio files support, speaker diarization & attribution and word-level alignments.

Token Throughput Disclaimer : The token-per-second throughput figures provided are based on controlled testing conditions and are intended for benchmarking and comparison purposes only. Actual performance in production environments may vary significantly depending on workload characteristics, system configuration, model hosting provider, network conditions, and other operational factors. These results should not be interpreted as a guarantee of real-world performance.

Model Updates and Deprecation Disclaimer We reserve the right to modify, upgrade, or replace any AI models used in our services at any time. This may include deprecating older models and introducing newer versions as we deem necessary to maintain performance, security, and service quality. While we aim to provide notice when feasible, changes may occur without prior notification.

NOTE Pricing is subject to change at our discretion.


# Supported Models by Use Case

  • Agent Workflows- Models designed for autonomous agent operations:

    • DeepSeek V32
    • Qwen3 VL 235B
    • GPT OSS 120B
  • Conversational, Code Generation & Multilingual Interactions- Models optimized for chat, coding, and multilingual tasks:

    • Llama4 (Maverick, Scout), Llama3 (70B)
    • DeepSeek (70B)
    • Qwen (8B, 32B)
    • Gemma (12B)
    • Appertus (8B, 70B)
    • Granite (8B)
  • RAG (Retrieval-Augmented Generation) Workflows- Models tailored for memory and document retrieval:

    • BGE Models (M3, Re-ranker)
    • Granite 278M

# How to inference an AI model

To perform inference (i.e., generate responses or predictions) using a deployed AI model, you typically need the following components:


  1. Model Base URL

This is the API endpoint or base URL where your inference requests are sent.

** BASE API URL : https://maas.phoeniqs.com/ **


  1. Model Name

This specifies the exact model you're using.

Use the Model Name parameter provided in table Supported Models in API calls to specify the desired model name.

Examples:

  • inference-llama4-maverick
  • inference-bge-m3

  1. API Key

A secure token used to authenticate your requests.

  • Pass it in the HTTP header: Authorization: Bearer YOUR_API_KEY

# Different Sample Calls

# 1. Call to list available models

curl --location 'https://maas.phoeniqs.com/v1/models' \
--header 'Authorization: Bearer <API_Key>'

# 2. Sample Calls to Chat Completion

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-llama4-maverick",
  "messages": [
    { "role": "user", "content": "How do I make sourdough bread?" }
  ],
  "temperature": 0.7
}'

# 3. Sample Calls to Embeddings

  1. Option - A
curl --location 'https://maas.phoeniqs.com/v1/embeddings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-bge-m3",
  "input": "OpenAI develops AI models that understand and generate text."
}'
  1. Option - B
curl --location 'https://maas.phoeniqs.com/embeddings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-bge-m3",
  "input": "OpenAI develops AI models that understand and generate text."
}'

NOTE Normally v1/embeddings should work if not please try embeddings in the model URL path.

# 4. Sample Calls to MultiModal

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-granite-vision-2b",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is shown in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Example.jpg/800px-Example.jpg"
          }
        }
      ]
    }
  ],
  "temperature": 0.7
}'

# 5. Sample Calls to OCR model

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
  "model": "inference-deepseek-ocr",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
          }
        },
        {
          "type": "text",
          "text": "Free OCR."
        }
      ]
    }
  ],
  "max_tokens": 2048,
  "temperature": 0.0
}'

# 6. Sample Calls to Kimi-K2 model

Kimi-K2 has a special requirement while making call to it, where it requires additional argument called stop_token_ids and it must be paased with value 163586

curl --location 'https://maas.phoeniqs.com/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <API_Key>' \
--data '{
     "model": "inference-kimi-k2",
     "messages": [
        {"role": "user", "content": "How are you?"}
     ],
     "temperature": 0.7,
     "stop_token_ids": [163586]
}'