In Part 1 we provisioned a Foundry resource and deployed GPT-4.1 mini. In Part 2 we hardened the infrastructure with private endpoints, RBAC, and explored deployment types. But we never asked: is GPT-4.1 mini the right model for the job?

The Foundry model catalog contains dozens of models โ from OpenAI’s latest GPT-5 family, to GPT-4.1, to open-weight models like Meta Llama, Mistral, and DeepSeek. In this post we will:
- Map the catalog โ what’s available and how models are categorised
- Compare the GPT families โ GPT-4.1 vs GPT-5 vs GPT-4o, capabilities and trade-offs
- Explore open-weight models โ Llama, Mistral, DeepSeek, Phi and when they shine
- Benchmark and price โ cost per million tokens, latency, and quality
- Build a decision framework โ a practical flowchart for picking the right model
- Deploy multiple models with Bicep โ extend our template to support side-by-side deployments
All code samples from this series are available in this repository (coming soon).
The Foundry model catalog at a glance #
The model catalog organises models into two buckets:
| Category | Description | Examples |
|---|---|---|
| Models sold directly by Azure | OpenAI models hosted and billed by Microsoft โ deepest integration, highest SLA | GPT-5, GPT-4.1, GPT-4o, o3, o4-mini |
| Models sold by partners | Open-weight and partner models deployed via serverless API or managed compute | Llama 4, Mistral Large, DeepSeek-R1, Phi-4 |
Full catalog: Foundry model catalog
Models sold directly by Azure (OpenAI) #
These are the “first-party” models. You deploy them as format: 'OpenAI' in Bicep and call them via the familiar OpenAI-compatible API.
| Model family | Latest models | Strengths |
|---|---|---|
| GPT-5 | gpt-5, gpt-5-mini | Most capable, strongest reasoning, tools, multi-modal |
| GPT-4.1 | gpt-4.1, gpt-4.1-mini, gpt-4.1-nano | Long context (1M tokens), fast, great for coding |
| GPT-4o | gpt-4o, gpt-4o-mini | Balanced multi-modal, image + audio, widely deployed |
| o-series | o3, o4-mini | Deep reasoning (chain-of-thought), math, science |
Models sold by partners (open-weight) #
These models are available via serverless API (pay-per-token, no infrastructure to manage) or managed compute (dedicated VMs):
| Model | Provider | Parameters | Strengths |
|---|---|---|---|
| Llama 4 Maverick | Meta | 400B (MoE) | Open-weight flagship, strong multilingual |
| Llama 4 Scout | Meta | 109B (MoE) | 10M token context, efficient |
| Mistral Large (25.07) | Mistral AI | 123B | Strong coding and European language support |
| DeepSeek-R1 | DeepSeek | 671B (MoE) | Exceptional reasoning and math |
| Phi-4 | Microsoft | 14B | Small, fast, competitive with much larger models |
| Phi-4-reasoning | Microsoft | 14B | Phi-4 optimised for chain-of-thought reasoning |
GPT family comparison #
Let’s zoom in on the three GPT generations currently available in Foundry.
Capability matrix #
| Capability | GPT-4.1 | GPT-4.1 mini | GPT-5 | GPT-5 mini | GPT-4o |
|---|---|---|---|---|---|
| Max context (input) | 1,047,576 tokens | 1,047,576 tokens | 1,047,576 tokens | 1,047,576 tokens | 128,000 tokens |
| Max output | 32,768 tokens | 32,768 tokens | 64,000 tokens | 16,000 tokens | 16,384 tokens |
| Multi-modal (image) | โ | โ | โ | โ | โ |
| Audio input/output | โ | โ | โ | โ | โ |
| Tool / function calling | โ | โ | โ | โ | โ |
| Structured output (JSON) | โ | โ | โ | โ | โ |
| Reasoning (thinking) | โ | โ | โ (built-in) | โ (built-in) | โ |
| Coding strength | โ โ โ โ โ | โ โ โ โ | โ โ โ โ โ | โ โ โ โ | โ โ โ โ |
| Instruction following | โ โ โ โ โ | โ โ โ โ | โ โ โ โ โ | โ โ โ โ โ | โ โ โ โ |
GPT-5 is the most capable model overall โ it incorporates reasoning natively (no separate o-series call needed) and supports audio. GPT-4.1 excels at coding and long-context tasks with lower cost. GPT-4o remains a solid general-purpose choice with broad multi-modal support.
When to use which GPT #
1 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2 โ What's your primary need? โ
3 โโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโฌโโโโ
4 โ โ โ
5 Reasoning Coding/Long Multi-modal
6 & complex context (audio/image)
7 tasks
8 โ โ โ
9 โผ โผ โผ
10 GPT-5 GPT-4.1 GPT-5 or
11 GPT-4o
12 โ โ
13 โโโโโโโโโโโโดโโโ โโโโโโโโดโโโโโโโ
14 โ โ โ โ
15 High budget Budget High volume Budget
16 โ aware โ aware
17 โผ โ โผ โ
18 GPT-5 โผ GPT-4.1 โผ
19 GPT-5 (full) GPT-4.1
20 mini mini/nano
Pricing comparison #
Pricing changes frequently โ always check the official pricing page. The table below shows Global Standard pay-per-token rates as of June 2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best for |
|---|---|---|---|
| gpt-5 | $10.00 | $30.00 | Hardest tasks, agentic workflows |
| gpt-5-mini | $1.50 | $6.00 | Reasoning on a budget |
| gpt-4.1 | $2.00 | $8.00 | Coding, long-context, production |
| gpt-4.1-mini | $0.40 | $1.60 | High-volume, cost-sensitive |
| gpt-4.1-nano | $0.10 | $0.40 | Classification, extraction, simple tasks |
| gpt-4o | $2.50 | $10.00 | Multi-modal, image understanding |
| gpt-4o-mini | $0.15 | $0.60 | Lightweight multi-modal |
Key insight: GPT-4.1 mini is 25ร cheaper than GPT-5 on input and 18ร cheaper on output. For our product description generator from Part 1, GPT-4.1 mini delivers excellent quality at a fraction of the cost.
Open-weight model pricing (serverless API) #
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| Llama 4 Maverick | $0.50 | $0.70 | Competitive with GPT-4.1 mini |
| Llama 4 Scout | $0.27 | $0.35 | Cheapest large-context model |
| Mistral Large (25.07) | $2.00 | $6.00 | Strong European language support |
| DeepSeek-R1 | $0.55 | $2.19 | Best open-weight reasoning |
| Phi-4 | $0.07 | $0.14 | Extremely cheap, good for simple tasks |
When to pick open-weight models #
Open-weight models make sense in specific scenarios:
| Scenario | Recommended model | Why |
|---|---|---|
| Data sovereignty / on-prem needed | Llama 4 or Mistral | Can be deployed on managed compute in your region |
| Extreme cost optimisation | Phi-4 or Llama 4 Scout | Lowest cost per token |
| Advanced reasoning (open-source) | DeepSeek-R1 | Rivals o3 in math/science benchmarks |
| European language focus | Mistral Large | Trained with strong French, German, Spanish support |
| Custom fine-tuning | Phi-4 or Llama 4 | Open weights allow LoRA / full fine-tuning |
| Regulatory / licence requirements | Varies | Some orgs require inspectable model weights |
When to stick with OpenAI models #
- Tightest Azure integration โ Responses API, Agents, built-in content safety
- SLA and support โ Microsoft-backed SLA for GPT models
- Agentic workflows โ GPT-5’s native reasoning + tool use is hard to beat
- Audio and real-time โ only GPT-5 and GPT-4o support audio in/out
Deploying multiple models with Bicep #
In production, you often want multiple models deployed side-by-side โ a powerful model for complex tasks and a cheap one for simple classification. Let’s extend our Bicep template from Part 2.
Define models as an array parameter #
1@description('Models to deploy')
2param models array = [
3 {
4 name: 'gpt-4.1-mini'
5 version: '2025-04-14'
6 sku: 'GlobalStandard'
7 capacity: 30
8 }
9 {
10 name: 'gpt-4.1'
11 version: '2025-04-14'
12 sku: 'GlobalStandard'
13 capacity: 10
14 }
15 {
16 name: 'gpt-5-mini'
17 version: '2025-06-02'
18 sku: 'GlobalStandard'
19 capacity: 10
20 }
21]
Loop over deployments #
1@batchSize(1) // deploy sequentially to avoid race conditions
2resource deployments 'Microsoft.CognitiveServices/accounts/deployments@2025-04-01-preview' = [
3 for model in models: {
4 parent: foundry
5 name: model.name
6 sku: {
7 name: model.sku
8 capacity: model.capacity
9 }
10 properties: {
11 model: {
12 name: model.name
13 format: 'OpenAI'
14 version: model.version
15 }
16 }
17 }
18]
@batchSize(1)is important โ Foundry deployments must be created sequentially. Without it, parallel creation may cause conflicts.
Output all deployment names #
1output deploymentNames array = [for (model, i) in models: deployments[i].name]
Updated parameter file #
1using 'main.bicep'
2
3param baseName = 'foundry-demo'
4param location = 'swedencentral'
5param models = [
6 {
7 name: 'gpt-4.1-mini'
8 version: '2025-04-14'
9 sku: 'GlobalStandard'
10 capacity: 30
11 }
12 {
13 name: 'gpt-5-mini'
14 version: '2025-06-02'
15 sku: 'GlobalStandard'
16 capacity: 10
17 }
18]
Routing requests to the right model in Python #
With multiple models deployed, you can route requests based on task complexity. Here’s a simple router:
1import os
2import json
3from dotenv import load_dotenv
4from azure.identity import DefaultAzureCredential
5from azure.ai.projects import AIProjectClient
6
7load_dotenv()
8
9project = AIProjectClient(
10 endpoint=os.environ["PROJECT_ENDPOINT"],
11 credential=DefaultAzureCredential(),
12)
13openai = project.get_openai_client()
14
15# Model routing map
16MODELS = {
17 "simple": "gpt-4.1-nano", # classification, extraction
18 "standard": "gpt-4.1-mini", # product descriptions, summaries
19 "complex": "gpt-5-mini", # multi-step reasoning, planning
20}
21
22
23def classify_complexity(task: str) -> str:
24 """Use the cheapest model to classify task complexity."""
25 response = openai.responses.create(
26 model=MODELS["simple"],
27 instructions=(
28 "Classify the following task as 'simple', 'standard', or 'complex'. "
29 "Return only the classification word."
30 ),
31 input=task,
32 )
33 classification = response.output_text.strip().lower()
34 return classification if classification in MODELS else "standard"
35
36
37def execute_task(task: str, system_prompt: str) -> str:
38 """Route to the appropriate model based on complexity."""
39 complexity = classify_complexity(task)
40 model = MODELS[complexity]
41
42 print(f"Task complexity: {complexity} โ using {model}")
43
44 response = openai.responses.create(
45 model=model,
46 instructions=system_prompt,
47 input=task,
48 )
49 return response.output_text
50
51
52if __name__ == "__main__":
53 # Simple task โ routed to nano
54 print(execute_task(
55 task="Classify this product as Electronics, Clothing, or Home: 'Wireless Bluetooth Headphones'",
56 system_prompt="You are a product classifier. Return only the category name.",
57 ))
58
59 print("---")
60
61 # Standard task โ routed to mini
62 print(execute_task(
63 task="Write a product description for 'TrailBlazer Pro Hiking Boots' made of leather with Vibram soles",
64 system_prompt="You are an e-commerce copywriter. Write 2-3 compelling sentences.",
65 ))
66
67 print("---")
68
69 # Complex task โ routed to GPT-5 mini
70 print(execute_task(
71 task="Analyse our product catalog and suggest a pricing strategy that maximises revenue while maintaining competitiveness. Consider seasonal trends, competitor pricing, and customer segments.",
72 system_prompt="You are a retail pricing strategist. Provide a structured analysis with recommendations.",
73 ))
Example output:
1Task complexity: simple โ using gpt-4.1-nano
2Electronics
3---
4Task complexity: standard โ using gpt-4.1-mini
5Conquer any terrain with the TrailBlazer Pro Hiking Boots, crafted from premium
6leather for durability that lasts. Equipped with Vibram soles for unmatched grip
7on rocky trails, these boots are your ultimate companion for every adventure.
8---
9Task complexity: complex โ using gpt-5-mini
10## Pricing Strategy Analysis
11...
This pattern can cut costs by 50โ80% compared to sending every request to GPT-5 โ the nano model handles classification for fractions of a cent.
Model comparison cheat sheet #
Here’s a quick-reference decision table:
| Use case | Recommended model | Why |
|---|---|---|
| Product descriptions (our series) | gpt-4.1-mini | Great quality, low cost |
| Code generation / review | gpt-4.1 | Best coding benchmarks at the price |
| Complex multi-step reasoning | gpt-5 | Native reasoning, strongest overall |
| Classification / extraction | gpt-4.1-nano | Cheapest, fast, accurate for simple tasks |
| Math / science problems | o3 or DeepSeek-R1 | Chain-of-thought specialised |
| Chat with images | gpt-4o or gpt-5 | Multi-modal input/output |
| Budget-sensitive high volume | Phi-4 or Llama 4 Scout | Lowest cost per token |
| European language content | Mistral Large | Strong multilingual training |
| Must inspect model weights | Llama 4 / Mistral / Phi-4 | Open-weight licence |
Clean up #
1az group delete --name rg-foundry-demo --yes --no-wait
What’s next? #
In Part 4 we will explore Foundry services โ the Agents framework, Responses API, tools, memory, and how to build real-world agentic applications.
Full series outline #
| # | Topic |
|---|---|
| 1 | Getting started โ Provision with Bicep, deploy GPT, generate descriptions |
| 2 | Bicep deep dive โ networking, RBAC, deployment types, region selection |
| 3 | Foundry model catalog โ comparing GPT-4.1, GPT-5, open-weight models (this post) |
| 4 | Foundry services overview โ agents, Responses API, tools, memory and real-world use cases |
| 5 | Prompt engineering and structured JSON output for product descriptions |
| 6 | Building the Python API โ FastAPI backend with Foundry SDK |
| 7 | Adding a database โ product catalog with PostgreSQL and RAG via Azure AI Search |
| 8 | Content safety, guardrails and Responsible AI |
| 9 | Building the Vue.js frontend โ a full-stack product description generator |
| 10 | CI/CD with GitLab, cost optimization and monitoring |
Stay tuned!