HA Troubleshooting [Tech Ops]

This episode of the TechOps series goes into high availability troubleshooting. Not just high availability, not just troubleshooting, but actually talking through what it takes to manage and maintain and fix HA systems. This is part of a longer discussion we’ve been having and so there’s some really interesting ideas in the middle of these discussions that I hope will shape your thinking as you build high availability systems, diagnostics and troubleshooting for people who are in high availability very complex environments.

Transcript: otter.ai/u/wM__4w1YIzZnhVdgLu…?utm_source=copy_url

References:
status.openai.com/incidents/ctrsv3lwd797\

Writing Great Test Scripts [TechOps]

We deep dive into something seemingly very small, but with a lot of repercussions for how you manage and run a data center, and that is test scripts for servers.

As you’re going through a production cycle or a provisioning cycle, how do you test? What do you test? This topic was from a Reddit thread that we answered and then had a whole hour conversation about just how important and impactful this type of script is.

Transcript: otter.ai/u/Cb3yac8JHvlM2yqh72…?utm_source=copy_url

High Availability Technology in DRP [TechOps]

Today we dive into RackN high availability technology and what we did to build consensus based raft HA capabilities directly into Digital Rebar. This is one of those episodes where we are talking specifically and only about Digital Rebar, so it is a vendored conversation from that perspective.

If you are building HA systems, or are interested in how HA systems work, this is a great session to learn firsthand from our experience!

Transcript: otter.ai/u/9lA9djczp5GkJbj12k…?utm_source=copy_url

Gitops and Immutability [TechOps Series]

The cloud2030 Tech Ops series is an ongoing discussion for us to create what I think of as 200 level content for tech and operations leaders, exploring really complex, deep topics in a thoughtful way to really extend your knowledge base and capabilities in the data center and infrastructure space.

Today’s episode talks about gitops and immutability, and what we’re doing here is connecting together the operational concepts between controls and desired state communications and how that gets executed in infrastructure in an operations sense. Rather than a developer approach, this takes an operations approach. So if you are interested in how to manage immutability and what that means in infrastructure, this discussion is for you.

BootC created Bare Metal Containers [TechOps]

We dive deep into the technical details of BootC – a Red Hat-led technology that uses container-like definitions to describe machine boot processes. BootC is an important development, especially as companies embrace containers and seek a unified approach to machine configuration.

RackN CTO, Greg Althaus, provides an in-depth overview of how BootC works, its key capabilities, and the potential benefits and challenges for operations teams. They explore topics like BootC’s relationship to containers, the concept of immutability, different deployment methods, and the operational considerations around managing BootC at scale.

This conversation offers a balanced, non-Red Hat perspective on BootC, highlighting both its technical merits and the significant operational work required to successfully adopt and integrate it. Listeners will come away with a nuanced understanding of this emerging technology and the factors organizations should weigh as they evaluate BootC for their infrastructure.

Logging [TechOps Series]

We dive deep into logging, tracing, metrics, observability, with a specific filter for automation and systems and infrastructure.

There’s a real challenge here of how you capture information from a running system in a way that provides the right information at the right time. That fundamentally is the question that we are working to answer throughout this really fascinating discussion about logging.

Transcript: otter.ai/u/msNO2gn1b0FP2lK7rS…?utm_source=copy_url

Reading Logs and Events [TechOps]

This TechOps episode explores the challenges of processing events and logs in technical operations.

The discussion covers the importance of understanding the intent and purpose of building systems downstream from eventing and logging systems. Key topics include the trade-offs between real-time and delayed event processing, the principle of least privilege, and strategies for handling event buffering and dropping. The conversation also touches on security concerns related to event and log data.

The episode concludes with plans for future discussions on adding events and logging to scripts to make them more useful.

Containers Manager [TechOps]

In this episode, we continue our TechOps series, diving deep into the topic of container management. As containers become increasingly mainstream, the need to effectively manage and orchestrate these lightweight, purpose-built environments is crucial.

We’ll explore the distinctions between container management and orchestration, discussing the different tools, techniques and trade-offs involved. We’ll also hear insights from the RackN team on how they’ve approached container lifecycle management within their own infrastructure management platform, Digital Rebar.

This is a rich discussion that touches on everything from Kubernetes to system design trade-offs. So let’s jump in and learn how to wrangle those containers!

API Consumption [TechOps 003]

TechOps series episode 3 covers how to automate against API’s. We discuss exactly the ways in which you can use API’s effectively, and ways you can run into trouble. We also discuss how we should be consuming API’s, both as a consumer but also in times when we have produced API’s. Many ideas discussed were pulled from learning how people consume our API’s and what we can do to help make them better and safer.

Enjoy this broader TechOps series where we are diving in deep in tips and techniques that improve your journey as an Automator.

otter.ai/u/5akxcG83FBS1m9PBUn…?utm_source=copy_url
Image by Dall-E