Cassidy Reynolds

Logging [TechOps Series]

We dive deep into logging, tracing, metrics, observability, with a specific filter for automation and systems and infrastructure.

There’s a real challenge here of how you capture information from a running system in a way that provides the right information at the right time. That fundamentally is the question that we are working to answer throughout this really fascinating discussion about logging.

Transcript: otter.ai/u/msNO2gn1b0FP2lK7rS…?utm_source=copy_url

Eve Of A Nuclear Renaissance?

Reference: en.wikipedia.org/wiki/Pebble-bed_reactor

Do nuclear power and a potential renaissance in nuclear power, driven by the voracious power demands for data centers, have the potential of becoming accepted, local and an economic boom for communities? If you’re scratching your head thinking, no way, maybe this conversation will change your mind. Enjoy!

Transcript: otter.ai/u/yUJapxBhVAhGnPUKgJ…?utm_source=copy_url

Reading Logs and Events [TechOps]

This TechOps episode explores the challenges of processing events and logs in technical operations.

The discussion covers the importance of understanding the intent and purpose of building systems downstream from eventing and logging systems. Key topics include the trade-offs between real-time and delayed event processing, the principle of least privilege, and strategies for handling event buffering and dropping. The conversation also touches on security concerns related to event and log data.

The episode concludes with plans for future discussions on adding events and logging to scripts to make them more useful.

Training Small LLMs

In this episode, we dive deep into the emerging world of building and training small language models. We’ll discuss the benefits, risks, and challenges companies face as they work to create more targeted and efficient AI models. From managing hardware and power requirements to ensuring data privacy and governance, we’ll cover the key considerations for enterprises looking to leverage the power of small language models. Join us as we unpack this fascinating topic and consider the implications for the future of AI and infrastructure operations.

Transcript otter.ai/u/xJ5T-x70WUFQ55ZAsRQr57q6zwE
Reference: www.composabl.com/

Process: Good, Bad And Ugly

This podcast episode explores the challenges of process improvement in IT operations, using examples from data centers, automotive, and cybersecurity.

The discussion covers the slow evolution of secure boot, the difficulties cloud providers face in translating their processes to the broader market, and the emergence of vehicle-to-anything ecosystems. The group delves into the need for standardization and security in vehicle ecosystems, as well as the policy management and automation challenges enterprises face.

The conversation also examines the balance of trust in technology versus human expertise, particularly around the use of AI and the risks of generative AI. The CrowdStrike incident is analyzed, with debate around the responsibility of CrowdStrike, Microsoft, and Delta’s operational controls. The impact on cyber insurance and the need for broader risk management approaches are also discussed, highlighting the interconnectedness of process improvement and risk management, and the call for greater industry collaboration to address these challenges.

Transcript: otter.ai/u/93JhNjmekqf0ttX21g…?utm_source=copy_url

Containers Manager [TechOps]

In this episode, we continue our TechOps series, diving deep into the topic of container management. As containers become increasingly mainstream, the need to effectively manage and orchestrate these lightweight, purpose-built environments is crucial.

We’ll explore the distinctions between container management and orchestration, discussing the different tools, techniques and trade-offs involved. We’ll also hear insights from the RackN team on how they’ve approached container lifecycle management within their own infrastructure management platform, Digital Rebar.

This is a rich discussion that touches on everything from Kubernetes to system design trade-offs. So let’s jump in and learn how to wrangle those containers!

Supply Chain Security [TechOps]

In this episode, we dive deep into a recent and highly sophisticated SSH intrusion attack that was discovered in the Linux kernel. We’ll discuss how the attackers were able to inject a backdoor into a critical compression library, leveraging social engineering tactics to become a trusted maintainer over several years.

Software Bill of Materials [TechOps]

A software bill of materials is the idea that we can define and document exactly what goes into a system. We look at governance today and SBOMs as we put it together, both from a software and an operation side.

Advanced SSH [TechOps]

SSH and Secure Shell is one of those topics that people take for granted because it is a ubiquitous way to log in and access systems. True to form for the TechOps series, though, we break that down into much more detailed and granular components.

We talk about how to secure it and what best practices are. We also discuss how to use it for tunneling, or, more specifically, not use it for tunneling, and why all of this matters to your operations environment. Listen to what new things we’re doing that avoid having to have network access at all.

Transcript: otter.ai/u/XSRBfnifZOF0-nlNU5…?utm_source=copy_url

High Availability [TechOps Series]

Is high availability always a good thing? Today our discussion takes an operations perspective. We look at places where you were over or under committing high availability, where you were confusing disaster recovery for high availability, and perhaps even securing the wrong service or looking at it the wrong way. We cover all of these scenarios with practical, hands-on examples that I know you will get a lot out of.

This is good prep for talking about HA clusters, because the idea of coordinating and monitoring systems is core to HA and HA clusters. In our journey with RackN, a lot of customers who thought they needed very aggressive HA systems, once they are confronted with the overhead of maintaining an HA system, have to ask if you really need it. We started with an active/passive HA implementation using third party monitoring to monitor for when the system failed and spin up the second system, creating a live streaming back up to the failover system.

Transcript: otter.ai/u/vOVZadHvRTFCZGqcI2…?utm_source=copy_url