In this episode, we debrief several industry events I went to last year, including Supercomputing, KubeCon, Stack, the AI Infrastructure Show, and the Red Hat AI Infrastructure Summit. We dive deep into some observations from the shows and what they tell us about the gaps and fractures in how we are working to build AI infrastructure. We focus on how observability is being used for evaluation, tuning, performance issues, GPU dropouts, and cluster management, while anomaly detection and root cause analysis remain less common, and we note that networking is still underserved. We also get into the shift from building clusters to observing and fixing them after deployment, especially for agentic systems, and we end by highlighting the need for observability across application, identity, networking, and infrastructure layers.
Transcript: https://otter.ai/u/y6FNvERJRe_8qnmAgVlmvd6kwb8?utm_source=copy_url