Microsoft Foundry Series (Part 3) โ€” Model Catalog: GPT-4.1 vs GPT-5 vs Open-Weight Models, Benchmarks, Pricing and Guidance

In Part 1 we provisioned a Foundry resource and deployed GPT-4.1 mini. In Part 2 we hardened the infrastructure with private endpoints, RBAC, and explored deployment types. But we never asked: is GPT-4.1 mini the right model for the job?

Post hero

The Foundry model catalog contains dozens of models โ€” from OpenAI’s latest GPT-5 family, to GPT-4.1, to open-weight models like Meta Llama, Mistral, and DeepSeek. In this post we will:

  1. Map the catalog โ€” what’s available and how models are categorised
  2. Compare the GPT families โ€” GPT-4.1 vs GPT-5 vs GPT-4o, capabilities and trade-offs
  3. Explore open-weight models โ€” Llama, Mistral, DeepSeek, Phi and when they shine
  4. Benchmark and price โ€” cost per million tokens, latency, and quality
  5. Build a decision framework โ€” a practical flowchart for picking the right model
  6. Deploy multiple models with Bicep โ€” extend our template to support side-by-side deployments

All code samples from this series are available in this repository (coming soon).

The Foundry model catalog at a glance #

The model catalog organises models into two buckets:

CategoryDescriptionExamples
Models sold directly by AzureOpenAI models hosted and billed by Microsoft โ€” deepest integration, highest SLAGPT-5, GPT-4.1, GPT-4o, o3, o4-mini
Models sold by partnersOpen-weight and partner models deployed via serverless API or managed computeLlama 4, Mistral Large, DeepSeek-R1, Phi-4

Full catalog: Foundry model catalog

Models sold directly by Azure (OpenAI) #

These are the “first-party” models. You deploy them as format: 'OpenAI' in Bicep and call them via the familiar OpenAI-compatible API.

Model familyLatest modelsStrengths
GPT-5gpt-5, gpt-5-miniMost capable, strongest reasoning, tools, multi-modal
GPT-4.1gpt-4.1, gpt-4.1-mini, gpt-4.1-nanoLong context (1M tokens), fast, great for coding
GPT-4ogpt-4o, gpt-4o-miniBalanced multi-modal, image + audio, widely deployed
o-serieso3, o4-miniDeep reasoning (chain-of-thought), math, science

Models sold by partners (open-weight) #

These models are available via serverless API (pay-per-token, no infrastructure to manage) or managed compute (dedicated VMs):

ModelProviderParametersStrengths
Llama 4 MaverickMeta400B (MoE)Open-weight flagship, strong multilingual
Llama 4 ScoutMeta109B (MoE)10M token context, efficient
Mistral Large (25.07)Mistral AI123BStrong coding and European language support
DeepSeek-R1DeepSeek671B (MoE)Exceptional reasoning and math
Phi-4Microsoft14BSmall, fast, competitive with much larger models
Phi-4-reasoningMicrosoft14BPhi-4 optimised for chain-of-thought reasoning

GPT family comparison #

Let’s zoom in on the three GPT generations currently available in Foundry.

Capability matrix #

CapabilityGPT-4.1GPT-4.1 miniGPT-5GPT-5 miniGPT-4o
Max context (input)1,047,576 tokens1,047,576 tokens1,047,576 tokens1,047,576 tokens128,000 tokens
Max output32,768 tokens32,768 tokens64,000 tokens16,000 tokens16,384 tokens
Multi-modal (image)โœ…โœ…โœ…โœ…โœ…
Audio input/outputโ€”โ€”โœ…โ€”โœ…
Tool / function callingโœ…โœ…โœ…โœ…โœ…
Structured output (JSON)โœ…โœ…โœ…โœ…โœ…
Reasoning (thinking)โ€”โ€”โœ… (built-in)โœ… (built-in)โ€”
Coding strengthโ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…
Instruction followingโ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…โ˜…

GPT-5 is the most capable model overall โ€” it incorporates reasoning natively (no separate o-series call needed) and supports audio. GPT-4.1 excels at coding and long-context tasks with lower cost. GPT-4o remains a solid general-purpose choice with broad multi-modal support.

When to use which GPT #

 1                         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 2                         โ”‚  What's your primary need?     โ”‚
 3                         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
 4                                โ”‚            โ”‚       โ”‚
 5                          Reasoning    Coding/Long   Multi-modal
 6                          & complex    context       (audio/image)
 7                          tasks                     
 8                                โ”‚            โ”‚       โ”‚
 9                                โ–ผ            โ–ผ       โ–ผ
10                            GPT-5       GPT-4.1    GPT-5 or
11                                                   GPT-4o
12                                โ”‚            โ”‚
13                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
14                     โ”‚             โ”‚  โ”‚             โ”‚
15                High budget   Budget  High volume  Budget
16                     โ”‚        aware   โ”‚            aware
17                     โ–ผ             โ”‚  โ–ผ                โ”‚
18                  GPT-5            โ–ผ  GPT-4.1          โ–ผ
19                              GPT-5   (full)      GPT-4.1
20                              mini                 mini/nano

Pricing comparison #

Pricing changes frequently โ€” always check the official pricing page. The table below shows Global Standard pay-per-token rates as of June 2026:

ModelInput (per 1M tokens)Output (per 1M tokens)Best for
gpt-5$10.00$30.00Hardest tasks, agentic workflows
gpt-5-mini$1.50$6.00Reasoning on a budget
gpt-4.1$2.00$8.00Coding, long-context, production
gpt-4.1-mini$0.40$1.60High-volume, cost-sensitive
gpt-4.1-nano$0.10$0.40Classification, extraction, simple tasks
gpt-4o$2.50$10.00Multi-modal, image understanding
gpt-4o-mini$0.15$0.60Lightweight multi-modal

Key insight: GPT-4.1 mini is 25ร— cheaper than GPT-5 on input and 18ร— cheaper on output. For our product description generator from Part 1, GPT-4.1 mini delivers excellent quality at a fraction of the cost.

Open-weight model pricing (serverless API) #

ModelInput (per 1M tokens)Output (per 1M tokens)Notes
Llama 4 Maverick$0.50$0.70Competitive with GPT-4.1 mini
Llama 4 Scout$0.27$0.35Cheapest large-context model
Mistral Large (25.07)$2.00$6.00Strong European language support
DeepSeek-R1$0.55$2.19Best open-weight reasoning
Phi-4$0.07$0.14Extremely cheap, good for simple tasks

When to pick open-weight models #

Open-weight models make sense in specific scenarios:

ScenarioRecommended modelWhy
Data sovereignty / on-prem neededLlama 4 or MistralCan be deployed on managed compute in your region
Extreme cost optimisationPhi-4 or Llama 4 ScoutLowest cost per token
Advanced reasoning (open-source)DeepSeek-R1Rivals o3 in math/science benchmarks
European language focusMistral LargeTrained with strong French, German, Spanish support
Custom fine-tuningPhi-4 or Llama 4Open weights allow LoRA / full fine-tuning
Regulatory / licence requirementsVariesSome orgs require inspectable model weights

When to stick with OpenAI models #

  • Tightest Azure integration โ€” Responses API, Agents, built-in content safety
  • SLA and support โ€” Microsoft-backed SLA for GPT models
  • Agentic workflows โ€” GPT-5’s native reasoning + tool use is hard to beat
  • Audio and real-time โ€” only GPT-5 and GPT-4o support audio in/out

Deploying multiple models with Bicep #

In production, you often want multiple models deployed side-by-side โ€” a powerful model for complex tasks and a cheap one for simple classification. Let’s extend our Bicep template from Part 2.

Define models as an array parameter #

 1@description('Models to deploy')
 2param models array = [
 3  {
 4    name: 'gpt-4.1-mini'
 5    version: '2025-04-14'
 6    sku: 'GlobalStandard'
 7    capacity: 30
 8  }
 9  {
10    name: 'gpt-4.1'
11    version: '2025-04-14'
12    sku: 'GlobalStandard'
13    capacity: 10
14  }
15  {
16    name: 'gpt-5-mini'
17    version: '2025-06-02'
18    sku: 'GlobalStandard'
19    capacity: 10
20  }
21]

Loop over deployments #

 1@batchSize(1) // deploy sequentially to avoid race conditions
 2resource deployments 'Microsoft.CognitiveServices/accounts/deployments@2025-04-01-preview' = [
 3  for model in models: {
 4    parent: foundry
 5    name: model.name
 6    sku: {
 7      name: model.sku
 8      capacity: model.capacity
 9    }
10    properties: {
11      model: {
12        name: model.name
13        format: 'OpenAI'
14        version: model.version
15      }
16    }
17  }
18]

@batchSize(1) is important โ€” Foundry deployments must be created sequentially. Without it, parallel creation may cause conflicts.

Output all deployment names #

1output deploymentNames array = [for (model, i) in models: deployments[i].name]

Updated parameter file #

 1using 'main.bicep'
 2
 3param baseName = 'foundry-demo'
 4param location = 'swedencentral'
 5param models = [
 6  {
 7    name: 'gpt-4.1-mini'
 8    version: '2025-04-14'
 9    sku: 'GlobalStandard'
10    capacity: 30
11  }
12  {
13    name: 'gpt-5-mini'
14    version: '2025-06-02'
15    sku: 'GlobalStandard'
16    capacity: 10
17  }
18]

Routing requests to the right model in Python #

With multiple models deployed, you can route requests based on task complexity. Here’s a simple router:

 1import os
 2import json
 3from dotenv import load_dotenv
 4from azure.identity import DefaultAzureCredential
 5from azure.ai.projects import AIProjectClient
 6
 7load_dotenv()
 8
 9project = AIProjectClient(
10    endpoint=os.environ["PROJECT_ENDPOINT"],
11    credential=DefaultAzureCredential(),
12)
13openai = project.get_openai_client()
14
15# Model routing map
16MODELS = {
17    "simple": "gpt-4.1-nano",     # classification, extraction
18    "standard": "gpt-4.1-mini",   # product descriptions, summaries
19    "complex": "gpt-5-mini",      # multi-step reasoning, planning
20}
21
22
23def classify_complexity(task: str) -> str:
24    """Use the cheapest model to classify task complexity."""
25    response = openai.responses.create(
26        model=MODELS["simple"],
27        instructions=(
28            "Classify the following task as 'simple', 'standard', or 'complex'. "
29            "Return only the classification word."
30        ),
31        input=task,
32    )
33    classification = response.output_text.strip().lower()
34    return classification if classification in MODELS else "standard"
35
36
37def execute_task(task: str, system_prompt: str) -> str:
38    """Route to the appropriate model based on complexity."""
39    complexity = classify_complexity(task)
40    model = MODELS[complexity]
41
42    print(f"Task complexity: {complexity} โ†’ using {model}")
43
44    response = openai.responses.create(
45        model=model,
46        instructions=system_prompt,
47        input=task,
48    )
49    return response.output_text
50
51
52if __name__ == "__main__":
53    # Simple task โ†’ routed to nano
54    print(execute_task(
55        task="Classify this product as Electronics, Clothing, or Home: 'Wireless Bluetooth Headphones'",
56        system_prompt="You are a product classifier. Return only the category name.",
57    ))
58
59    print("---")
60
61    # Standard task โ†’ routed to mini
62    print(execute_task(
63        task="Write a product description for 'TrailBlazer Pro Hiking Boots' made of leather with Vibram soles",
64        system_prompt="You are an e-commerce copywriter. Write 2-3 compelling sentences.",
65    ))
66
67    print("---")
68
69    # Complex task โ†’ routed to GPT-5 mini
70    print(execute_task(
71        task="Analyse our product catalog and suggest a pricing strategy that maximises revenue while maintaining competitiveness. Consider seasonal trends, competitor pricing, and customer segments.",
72        system_prompt="You are a retail pricing strategist. Provide a structured analysis with recommendations.",
73    ))

Example output:

 1Task complexity: simple โ†’ using gpt-4.1-nano
 2Electronics
 3---
 4Task complexity: standard โ†’ using gpt-4.1-mini
 5Conquer any terrain with the TrailBlazer Pro Hiking Boots, crafted from premium 
 6leather for durability that lasts. Equipped with Vibram soles for unmatched grip 
 7on rocky trails, these boots are your ultimate companion for every adventure.
 8---
 9Task complexity: complex โ†’ using gpt-5-mini
10## Pricing Strategy Analysis
11...

This pattern can cut costs by 50โ€“80% compared to sending every request to GPT-5 โ€” the nano model handles classification for fractions of a cent.

Model comparison cheat sheet #

Here’s a quick-reference decision table:

Use caseRecommended modelWhy
Product descriptions (our series)gpt-4.1-miniGreat quality, low cost
Code generation / reviewgpt-4.1Best coding benchmarks at the price
Complex multi-step reasoninggpt-5Native reasoning, strongest overall
Classification / extractiongpt-4.1-nanoCheapest, fast, accurate for simple tasks
Math / science problemso3 or DeepSeek-R1Chain-of-thought specialised
Chat with imagesgpt-4o or gpt-5Multi-modal input/output
Budget-sensitive high volumePhi-4 or Llama 4 ScoutLowest cost per token
European language contentMistral LargeStrong multilingual training
Must inspect model weightsLlama 4 / Mistral / Phi-4Open-weight licence

Clean up #

1az group delete --name rg-foundry-demo --yes --no-wait

What’s next? #

In Part 4 we will explore Foundry services โ€” the Agents framework, Responses API, tools, memory, and how to build real-world agentic applications.

Full series outline #

#Topic
1Getting started โ€” Provision with Bicep, deploy GPT, generate descriptions
2Bicep deep dive โ€” networking, RBAC, deployment types, region selection
3Foundry model catalog โ€” comparing GPT-4.1, GPT-5, open-weight models (this post)
4Foundry services overview โ€” agents, Responses API, tools, memory and real-world use cases
5Prompt engineering and structured JSON output for product descriptions
6Building the Python API โ€” FastAPI backend with Foundry SDK
7Adding a database โ€” product catalog with PostgreSQL and RAG via Azure AI Search
8Content safety, guardrails and Responsible AI
9Building the Vue.js frontend โ€” a full-stack product description generator
10CI/CD with GitLab, cost optimization and monitoring

Stay tuned!

comments powered by Disqus