AWS Outage

In this episode, we discuss the October 2025 Amazon outage. The conversation took place during the outage, and though it’s been a few months now, the insights and discussions are still very interesting. We trace how a DynamoDB and DNS-related failure cascaded through core AWS services and had a larger blast radius than expected. We also look at whether the outage was accidental or malicious and compare it to previous large cloud outages caused by internal errors or cascading failures. Some really interesting ideas come up around redundancy, failover, local infrastructure, and how data-centered business models change priorities around accountability, compliance, and valuation.

Transcript: https://otter.ai/u/0SHTGqt3cmSEDX5v8YLsK7eDyIE?utm_source=copy_url

Back After a Break

In this episode, we discuss the rising cost of using AI and how usage-based pricing, model changes, and capacity limits are affecting daily work as AI moves from experimentation into operational use. We also talk about multi-model workflows, hybrid infrastructure, and examples of using hosted models alongside open models locally for tasks such as writing and named entity resolution. We get into the need for enterprises to run their own AI infrastructure, including questions around GPU pooling, routing, reservation, data sovereignty, and service levels.

MCP Agents and Context

In this episode, we continue our journey even deeper into how agentic vibe coding and other AI-based automation. This time we focus on Model Control Protocol (MCP) and its application in our bare metal automation solution, Digital Rebar. We examine deterministic versus stochastic AI approaches and the importance of reliable system integration without competing with other agentic systems. We highlight MCP’s role in streamlining interactions across data sources, with a focus on practical applications in finance and infrastructure resilience. The episode ends with a preview of future conversations on user experience transformation in infrastructure operations. Enjoy!

Transcript here: otter.ai/u/LmtQ9QAc79izN0PacE…=transcript&tab=chat

Rob Weinhold: The Art of Crisis Leadership [Cloud 2030 Book Club]

In this episode, we talk about Rob Weinhold’s book, “The Art of Crisis Leadership.” We explore the vital principle of “owning your narrative” in crisis management, and we share some personal stories related to the themes in the book. We analyze the differences between personal and organizational crises, emphasizing storytelling, transparency, and trust as keys to effective leadership. Even if you haven’t read the book, there’s a lot to get out of this great conversation.

Transcript: otter.ai/u/acAsrwSOpObslvjSVg…?utm_source=copy_url

Vibe Coding for Ops [TechOps]

In this episode, we do some live vibe coding– using AI to write code. We share tips and tricks on having the best vibe coding experience and avoiding some common pitfalls. You’ll get to hear what we do, how we discover what the steps are, just how easy it is to interact with the system, to set up a basic environment. We also start to explore the limitations of vibe coding. We encourage you to listen along and try on your own!

Transcript: otter.ai/u/CqKdtWZWYb3AdPtcb-…?utm_source=copy_url

Model Context Protocol Exploration

Today we continue our exploration of vibe coding by digging into the Model Context Protocol, or MCP. We look at how MCPs connect chatbots to backend systems, why natural language matters for complex queries, and what it takes to build smarter, more adaptable interfaces. The discussion covers practical strategies for refining and automating these systems using API docs, making this a solid deep dive into the future of human-to-machine interaction.

Transcript: otter.ai/u/5zn42OkdumP-HXIdi5…?utm_source=copy_url

TechOps Scaling Challenges

In this episode, we talk about scale and the hard realities of system failure in large tech operations. We explore why rare failures become common at scale, and what it takes to build systems that can handle that pressure. From predictive diagnostics to component redundancy, we share practical insights on keeping high-performance and AI infrastructure resilient. This is not theory, it is grounded in real-world lessons from managing complex environments and learning how to plan, isolate, and adapt when things go wrong.

Transcript: otter.ai/u/X8JYiADfPPLEfQ-gge…?utm_source=copy_url

The Opportunity for OpenShift Infrastructure

Today we tackle the generational infrastructure shift that’s keeping IT leaders awake at night: OpenShift virtualization adoption. We dig deep into why organizations are struggling to migrate from traditional VM-focused infrastructure to Kubernetes-managed infrastructure. We explore the real hurdles blocking this transition and unpack the strategic positioning that matters when you’re moving to container-orchestrated infrastructure. This isn’t about dumping everything into Kubernetes and calling it done, we examine what it really takes to use Kubernetes as your infrastructure abstraction layer while navigating the operational realities that make or break these migrations.

Transcript: otter.ai/u/IY2Y0a4aFN99ILg9da…?utm_source=copy_url