DiscoverThe Cloud Pod329: Azure Front Door: Please Use the Side Entrance
329: Azure Front Door: Please Use the Side Entrance

329: Azure Front Door: Please Use the Side Entrance

Update: 2025-11-12
Share

Description

Welcome to episode 329 of The Cloud Pod, where the forecast is always cloudy! Justin, Jonathan, and special guest Elise are in the studio to bring you all the latest in AI and cloud news, including – you guessed it – more outages, and more OpenAI team-ups. We’ve also got GPUs, K8 news, and Cursor updates. Let’s get started! 


Titles we almost went with this week



  • Azure Front Door: Please Use the Side Entrance – el -jb

  • Azure and NVIDIA: A Match Made in GPU Heaven – mk

  • Azure Goes Down Under the Weight of Its Own Configuration – el

  • GitHub Turns Your Copilot Subscription Into an All-You-Can-Eat Agent Buffet – mk, el

  • Microsoft Goes Full Blackwell: No Regrets, Just GPUs

  • Jules Verne Would Be Proud: Google’s CLI Goes 20,000 Bugs Under the Codebase

  • RAG to Riches: AWS Makes Retrieval Augmented Generation Turnkey

  • Kubectl Gets a Gemini Twin: Google Teaches AI to Speak Kubernetes

  • I’m Not a Robot: Azure WAF Finally Learns to Ask the Important Questions

  • OpenAI Puts 38 Billion Eggs in Amazon’s Basket: Multi-Cloud Gets Complicated

  • The Root Cause They’ll Never Root Out: Why Attrition Stays Off the RCA

  • Google’s New Extension Lets You Deploy Kubernetes by Just Asking Nicely

  • Cursor 2.0: Now With More Agents Than a Hollywood Talent Agency


Follow Up 


04:46 Massive Azure outage is over, but problems linger – here’s what happened | ZDNET 



  • Azure experienced a global outage on October 29, affecting all regions simultaneously, unlike the recent AWS outage that was limited to a single region. 

  • The incident lasted approximately eight hours from noon to 8 PM ET, impacting major services including Microsoft 365, Teams, Xbox Live, and critical infrastructure for Alaska Airlines, Vodafone UK, and Heathrow Airport, among others.

  • The root cause was an inadvertent tenant configuration change in Azure Front Door that bypassed safety validations due to a software defect. Microsoft’s protection mechanisms failed to catch the erroneous deployment, allowing invalid configurations to propagate across the global fleet and cause HTTP timeouts, server errors, and elevated packet loss at network edges.

  • Recovery required rolling back to the last known good configuration and gradually rebalancing traffic across nodes to prevent overload conditions. 

  • Some customers experienced lingering issues even after the official recovery time, with Microsoft temporarily blocking configuration changes to Azure Front Door while completing the restoration process.

  • The incident highlights concentration risk in cloud infrastructure, as this marks the second major cloud provider outage in October 2025. 

  • Despite Azure revenue growing 40 percent in the latest quarterly report, Microsoft’s stock declined in after-hours trading as the company acknowledged capacity constraints in meeting AI and cloud demands.

  • Affected Azure services included App Service, Azure SQL Database, Microsoft Entra ID, Container Registry, Azure Databricks, and approximately 15 other core platform services. Microsoft has implemented additional validation and rollback controls to prevent similar configuration deployment failures, though the full post-incident report remains pending.


07:06 Matt – “The fact that you’re plus one week and still can’t actually make changes or even do simple things like purge a cache makes me think this is a lot bigger on the backend than they let on at the beginning.”


AI Is Going Great – Or How ML Makes Money


08:30 AWS and OpenAI announce multi-year strategic partnership | OpenAI



  • AWS and OpenAI formalized a 38 billion dollar multi-year partnership providing OpenAI immediate access to hundreds of thousands of NVIDIA GPUs (GB200s and GB300s) clustered via Amazon EC2 UltraServers, with capacity deployment targeted by the end of 2026. 

  • The infrastructure supports both ChatGPT inference serving and next-generation model training with the ability to scale to tens of millions of CPUs for agentic workloads.

  • The partnership builds on existing integration where OpenAI’s open weight foundation models became available on Amazon Bedrock earlier this year, making OpenAI one of the most popular model providers on the platform. Thousands of customers, including Thomson Reuters, Peloton, and Verana Health, are already using these models for agentic workflows, coding, and scientific analysis.

  • AWS positions this as validation of their large-scale AI infrastructure capabilities, noting they have experience running clusters exceeding 500,000 chips with the security, reliability, and scale required for frontier model development. 

  • The low-latency network architecture of EC2 UltraServers enables optimal performance for interconnected GPU systems.

  • This represents a significant shift in OpenAI’s infrastructure strategy, moving substantial compute workloads to AWS while maintaining its existing Microsoft Azure relationship. 

  • The seven-year commitment timeline with continued growth provisions indicates long-term capacity planning for increasingly compute-intensive AI model development.


09:53 Elise – “It sort of feels like OpenAI has a strategic partnership with everyone right now, so I’m sure this will help them, just like everything else that they have done will help them. We’re banking a lot on OpenAI being very successful.” 


17:11 Google removes Gemma models from AI Studio after GOP senators complaint – Ars Technica



  • Google removed its open Gemma AI models from AI Studio following a complaint from Senator Marsha Blackburn, who reported the model hallucinated false sexual misconduct allegations against her when prompted with leading questions. 

  • The model allegedly fabricated detailed false claims and generated fake news article links, demonstrating the persistent hallucination problem across generative AI systems.

  • The removal only affects non-developer access through AI Studio’s user interface, where model behavior tweaking tools could increase hallucination likelihood. 

  • Developers can still access Gemma through the API and download models for local development, suggesting Google is limiting casual experimentation rather than pulling the model entirely.

  • This incident highlights the ongoing challenge of AI hallucinations in production systems, which no AI firm has successfully eliminated despite mitigation efforts. 

  • Google’s response indicates a shift toward restricting open model access when inflammatory outputs could result from user
Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

329: Azure Front Door: Please Use the Side Entrance

329: Azure Front Door: Please Use the Side Entrance

Justin Brodley, Jonathan Baker, Ryan Lucas and Matt Kohn