Foundry Model Catalog — Comparing GPT-4.1, GPT-5, Open-Weight Models and When to Use What - DevOps and application security enthusiast's notes

In Part 1 we provisioned a Foundry resource and deployed GPT-4.1 mini. In Part 2 we hardened the infrastructure with private endpoints, RBAC, and explored deployment types. But we never asked: is GPT-4.1 mini the right model for the job?

The Foundry model catalog contains dozens of models — from OpenAI’s latest GPT-5 family, to GPT-4.1, to open-weight models like Meta Llama, Mistral, and DeepSeek. In this post we will:

Map the catalog — what’s available and how models are categorised
Compare the GPT families — GPT-4.1 vs GPT-5 vs GPT-4o, capabilities and trade-offs
Explore open-weight models — Llama, Mistral, DeepSeek, Phi and when they shine
Benchmark and price — cost per million tokens, latency, and quality
Build a decision framework — a practical flowchart for picking the right model
Deploy multiple models with Bicep — extend our template to support side-by-side deployments

All code samples from this series are available in this repository (coming soon).

The Foundry model catalog at a glance #

The model catalog organises models into two buckets:

Category	Description	Examples
Models sold directly by Azure	OpenAI models hosted and billed by Microsoft — deepest integration, highest SLA	GPT-5, GPT-4.1, GPT-4o, o3, o4-mini
Models sold by partners	Open-weight and partner models deployed via serverless API or managed compute	Llama 4, Mistral Large, DeepSeek-R1, Phi-4

Full catalog: Foundry model catalog

Models sold directly by Azure (OpenAI) #

These are the “first-party” models. You deploy them as format: 'OpenAI' in Bicep and call them via the familiar OpenAI-compatible API.

Model family	Latest models	Strengths
GPT-5	gpt-5, gpt-5-mini	Most capable, strongest reasoning, tools, multi-modal
GPT-4.1	gpt-4.1, gpt-4.1-mini, gpt-4.1-nano	Long context (1M tokens), fast, great for coding
GPT-4o	gpt-4o, gpt-4o-mini	Balanced multi-modal, image + audio, widely deployed
o-series	o3, o4-mini	Deep reasoning (chain-of-thought), math, science

Models sold by partners (open-weight) #

These models are available via serverless API (pay-per-token, no infrastructure to manage) or managed compute (dedicated VMs):

Model	Provider	Parameters	Strengths
Llama 4 Maverick	Meta	400B (MoE)	Open-weight flagship, strong multilingual
Llama 4 Scout	Meta	109B (MoE)	10M token context, efficient
Mistral Large (25.07)	Mistral AI	123B	Strong coding and European language support
DeepSeek-R1	DeepSeek	671B (MoE)	Exceptional reasoning and math
Phi-4	Microsoft	14B	Small, fast, competitive with much larger models
Phi-4-reasoning	Microsoft	14B	Phi-4 optimised for chain-of-thought reasoning

GPT family comparison #

Let’s zoom in on the three GPT generations currently available in Foundry.

Capability matrix #

Capability	GPT-4.1	GPT-4.1 mini	GPT-5	GPT-5 mini	GPT-4o
Max context (input)	1,047,576 tokens	1,047,576 tokens	1,047,576 tokens	1,047,576 tokens	128,000 tokens
Max output	32,768 tokens	32,768 tokens	64,000 tokens	16,000 tokens	16,384 tokens
Multi-modal (image)	✅	✅	✅	✅	✅
Audio input/output	—	—	✅	—	✅
Tool / function calling	✅	✅	✅	✅	✅
Structured output (JSON)	✅	✅	✅	✅	✅
Reasoning (thinking)	—	—	✅ (built-in)	✅ (built-in)	—
Coding strength	★★★★★	★★★★	★★★★★	★★★★	★★★★
Instruction following	★★★★★	★★★★	★★★★★	★★★★★	★★★★

GPT-5 is the most capable model overall — it incorporates reasoning natively (no separate o-series call needed) and supports audio. GPT-4.1 excels at coding and long-context tasks with lower cost. GPT-4o remains a solid general-purpose choice with broad multi-modal support.

When to use which GPT #

 1                         ┌───────────────────────────────┐
 2                         │  What's your primary need?     │
 3                         └──────┬────────────┬───────┬───┘
 4                                │            │       │
 5                          Reasoning    Coding/Long   Multi-modal
 6                          & complex    context       (audio/image)
 7                          tasks                     
 8                                │            │       │
 9                                ▼            ▼       ▼
10                            GPT-5       GPT-4.1    GPT-5 or
11                                                   GPT-4o
12                                │            │
13                     ┌──────────┴──┐  ┌──────┴──────┐
14                     │             │  │             │
15                High budget   Budget  High volume  Budget
16                     │        aware   │            aware
17                     ▼             │  ▼                │
18                  GPT-5            ▼  GPT-4.1          ▼
19                              GPT-5   (full)      GPT-4.1
20                              mini                 mini/nano

Pricing comparison #

Pricing changes frequently — always check the official pricing page. The table below shows Global Standard pay-per-token rates as of June 2026:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best for
gpt-5	$10.00	$30.00	Hardest tasks, agentic workflows
gpt-5-mini	$1.50	$6.00	Reasoning on a budget
gpt-4.1	$2.00	$8.00	Coding, long-context, production
gpt-4.1-mini	$0.40	$1.60	High-volume, cost-sensitive
gpt-4.1-nano	$0.10	$0.40	Classification, extraction, simple tasks
gpt-4o	$2.50	$10.00	Multi-modal, image understanding
gpt-4o-mini	$0.15	$0.60	Lightweight multi-modal

Key insight: GPT-4.1 mini is 25× cheaper than GPT-5 on input and 18× cheaper on output. For our product description generator from Part 1, GPT-4.1 mini delivers excellent quality at a fraction of the cost.

Open-weight model pricing (serverless API) #

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
Llama 4 Maverick	$0.50	$0.70	Competitive with GPT-4.1 mini
Llama 4 Scout	$0.27	$0.35	Cheapest large-context model
Mistral Large (25.07)	$2.00	$6.00	Strong European language support
DeepSeek-R1	$0.55	$2.19	Best open-weight reasoning
Phi-4	$0.07	$0.14	Extremely cheap, good for simple tasks

When to pick open-weight models #

Open-weight models make sense in specific scenarios:

Scenario	Recommended model	Why
Data sovereignty / on-prem needed	Llama 4 or Mistral	Can be deployed on managed compute in your region
Extreme cost optimisation	Phi-4 or Llama 4 Scout	Lowest cost per token
Advanced reasoning (open-source)	DeepSeek-R1	Rivals o3 in math/science benchmarks
European language focus	Mistral Large	Trained with strong French, German, Spanish support
Custom fine-tuning	Phi-4 or Llama 4	Open weights allow LoRA / full fine-tuning
Regulatory / licence requirements	Varies	Some orgs require inspectable model weights

When to stick with OpenAI models #

Tightest Azure integration — Responses API, Agents, built-in content safety
SLA and support — Microsoft-backed SLA for GPT models
Agentic workflows — GPT-5’s native reasoning + tool use is hard to beat
Audio and real-time — only GPT-5 and GPT-4o support audio in/out

Deploying multiple models with Bicep #

In production, you often want multiple models deployed side-by-side — a powerful model for complex tasks and a cheap one for simple classification. Let’s extend our Bicep template from Part 2.

Define models as an array parameter #

 1@description('Models to deploy')
 2param models array = [
 3  {
 4    name: 'gpt-4.1-mini'
 5    version: '2025-04-14'
 6    sku: 'GlobalStandard'
 7    capacity: 30
 8  }
 9  {
10    name: 'gpt-4.1'
11    version: '2025-04-14'
12    sku: 'GlobalStandard'
13    capacity: 10
14  }
15  {
16    name: 'gpt-5-mini'
17    version: '2025-06-02'
18    sku: 'GlobalStandard'
19    capacity: 10
20  }
21]

Loop over deployments #

 1@batchSize(1) // deploy sequentially to avoid race conditions
 2resource deployments 'Microsoft.CognitiveServices/accounts/deployments@2025-04-01-preview' = [
 3  for model in models: {
 4    parent: foundry
 5    name: model.name
 6    sku: {
 7      name: model.sku
 8      capacity: model.capacity
 9    }
10    properties: {
11      model: {
12        name: model.name
13        format: 'OpenAI'
14        version: model.version
15      }
16    }
17  }
18]

@batchSize(1) is important — Foundry deployments must be created sequentially. Without it, parallel creation may cause conflicts.

Output all deployment names #

1output deploymentNames array = [for (model, i) in models: deployments[i].name]

Updated parameter file #

 1using 'main.bicep'
 2
 3param baseName = 'foundry-demo'
 4param location = 'swedencentral'
 5param models = [
 6  {
 7    name: 'gpt-4.1-mini'
 8    version: '2025-04-14'
 9    sku: 'GlobalStandard'
10    capacity: 30
11  }
12  {
13    name: 'gpt-5-mini'
14    version: '2025-06-02'
15    sku: 'GlobalStandard'
16    capacity: 10
17  }
18]

Routing requests to the right model in Python #

With multiple models deployed, you can route requests based on task complexity. Here’s a simple router:

 1import os
 2import json
 3from dotenv import load_dotenv
 4from azure.identity import DefaultAzureCredential
 5from azure.ai.projects import AIProjectClient
 6
 7load_dotenv()
 8
 9project = AIProjectClient(
10    endpoint=os.environ["PROJECT_ENDPOINT"],
11    credential=DefaultAzureCredential(),
12)
13openai = project.get_openai_client()
14
15# Model routing map
16MODELS = {
17    "simple": "gpt-4.1-nano",     # classification, extraction
18    "standard": "gpt-4.1-mini",   # product descriptions, summaries
19    "complex": "gpt-5-mini",      # multi-step reasoning, planning
20}
21
22
23def classify_complexity(task: str) -> str:
24    """Use the cheapest model to classify task complexity."""
25    response = openai.responses.create(
26        model=MODELS["simple"],
27        instructions=(
28            "Classify the following task as 'simple', 'standard', or 'complex'. "
29            "Return only the classification word."
30        ),
31        input=task,
32    )
33    classification = response.output_text.strip().lower()
34    return classification if classification in MODELS else "standard"
35
36
37def execute_task(task: str, system_prompt: str) -> str:
38    """Route to the appropriate model based on complexity."""
39    complexity = classify_complexity(task)
40    model = MODELS[complexity]
41
42    print(f"Task complexity: {complexity} → using {model}")
43
44    response = openai.responses.create(
45        model=model,
46        instructions=system_prompt,
47        input=task,
48    )
49    return response.output_text
50
51
52if __name__ == "__main__":
53    # Simple task → routed to nano
54    print(execute_task(
55        task="Classify this product as Electronics, Clothing, or Home: 'Wireless Bluetooth Headphones'",
56        system_prompt="You are a product classifier. Return only the category name.",
57    ))
58
59    print("---")
60
61    # Standard task → routed to mini
62    print(execute_task(
63        task="Write a product description for 'TrailBlazer Pro Hiking Boots' made of leather with Vibram soles",
64        system_prompt="You are an e-commerce copywriter. Write 2-3 compelling sentences.",
65    ))
66
67    print("---")
68
69    # Complex task → routed to GPT-5 mini
70    print(execute_task(
71        task="Analyse our product catalog and suggest a pricing strategy that maximises revenue while maintaining competitiveness. Consider seasonal trends, competitor pricing, and customer segments.",
72        system_prompt="You are a retail pricing strategist. Provide a structured analysis with recommendations.",
73    ))

Example output:

 1Task complexity: simple → using gpt-4.1-nano
 2Electronics
 3---
 4Task complexity: standard → using gpt-4.1-mini
 5Conquer any terrain with the TrailBlazer Pro Hiking Boots, crafted from premium 
 6leather for durability that lasts. Equipped with Vibram soles for unmatched grip 
 7on rocky trails, these boots are your ultimate companion for every adventure.
 8---
 9Task complexity: complex → using gpt-5-mini
10## Pricing Strategy Analysis
11...

This pattern can cut costs by 50–80% compared to sending every request to GPT-5 — the nano model handles classification for fractions of a cent.

Model comparison cheat sheet #

Here’s a quick-reference decision table:

Use case	Recommended model	Why
Product descriptions (our series)	gpt-4.1-mini	Great quality, low cost
Code generation / review	gpt-4.1	Best coding benchmarks at the price
Complex multi-step reasoning	gpt-5	Native reasoning, strongest overall
Classification / extraction	gpt-4.1-nano	Cheapest, fast, accurate for simple tasks
Math / science problems	o3 or DeepSeek-R1	Chain-of-thought specialised
Chat with images	gpt-4o or gpt-5	Multi-modal input/output
Budget-sensitive high volume	Phi-4 or Llama 4 Scout	Lowest cost per token
European language content	Mistral Large	Strong multilingual training
Must inspect model weights	Llama 4 / Mistral / Phi-4	Open-weight licence

Clean up #

1az group delete --name rg-foundry-demo --yes --no-wait

What’s next? #

In Part 4 we explore Foundry services — the Agents framework, Responses API, tools, memory, and how to build real-world agentic applications.

Full series outline #

#	Topic
1	Getting started — Provision with Bicep, deploy GPT, generate descriptions
2	Bicep deep dive — networking, RBAC, deployment types, region selection
3	Foundry model catalog — comparing GPT-4.1, GPT-5, open-weight models (this post)
4	Foundry services overview — agents, Responses API, tools, memory and real-world use cases
5	Prompt engineering and structured JSON output for product descriptions
6	Building the Python API — FastAPI backend with Foundry SDK
7	Adding a database — product catalog with PostgreSQL and RAG via Azure AI Search
8	Content safety, guardrails and Responsible AI
9	Building the Vue.js frontend — a full-stack product description generator
10	CI/CD with GitLab, cost optimization and monitoring

Stay tuned!

Microsoft Foundry Series (Part 3) — Model Catalog: GPT-4.1 vs GPT-5 vs Open-Weight Models, Benchmarks, Pricing and Guidance