Is Open Source Working?

Is open source driving innovation? And Is it a necessary component of Right to Repair and ownership? Are there commercial drivers where people want those open capabilities?

We transition into a deeper conversation about what’s going on with open source. Is it being innovative? Who is leading? How is it working?

Transcript: otter.ai/u/vto0yPpBuZtqngkc_zqMDp9J39M
Photo by Jeffrey Czum from Pexels [ID 4118958]

Terraform Usage Patterns (Gitops, IaC, Templates)

Cloud provisioning is very difficult when you go beyond simple provisioning and start thinking about how to to stitch together infrastructure in a repeatable way!

Specifically, today’s episode is a deep dive into Terraform usage patterns.

We get very hands on as we talk about how you manage state files and how you connect things together with Terraform.

We will spend a significant amount of time discussing in the fall because building infrastructure in a scalable automatable way, is a critical topic for the group.

This is an ongoing topic for us – stay tuned for more episodes!

Transcript: otter.ai/u/A-NgZOfa1xeIPA1uQOh8_bSStck
Photo by Artem Beliaikin from Pexels [ID 1079033]

That’s Not Terraform Orchestration!

This episode is about Terraform orchestration, what some people might call a TACO, in which we actually tried to do cloud provisioning in a orchestrated way. But this is a really challenging thing to do!

Orchestration is really hard so our discussion kept coming back to saying that this isn’t orchestration at all: it’s Infrastructure as Code and management.

We need to find a consistent way to to run a workflow or a control plane. We’re not even getting to the point where we’re coordinating or orchestrating aspects of different systems and using remote or API driven infrastructure.

Even if you use Terraform, you will get a lot out of this discussion!

Transcript: otter.ai/u/Ohbfr0Uprm95WYYI4357IdUodOU
Photo by Gabriel Santos Fotografia from Pexels [ID 2102568]

Distributed Infrastructure

With Distributed Infrastructure and the Edge, we cover the challenges of managing applications that are, by definition spread out throughout heterogeneous infrastructure.

Distributed Control is designed to control systems that are are not in cloud data centers with localized compute and storage. But then how do we manage it?

We discussed details about how these systems get built, and kept coming back to “do we need to have localized processing?” If we do, how do we manage it?

Transcript: otter.ai/u/BkxvOrQMmmQiYQpxa-OogrMyNNw
Photo by KEHN HERMANO from Pexels [ID 3881034]

Edge Impact of Digital Twins

We talk about Digital Twins and the Edge with Simon Crosby from Swim.AI. They are literally building digital twins in edge locations so he has a lot to share.

We work to expand and understand how Simon’s experience translates into general cases and what we’re seeing in the edge. The systems that we’re trying to build are at the intersection of models and “connectedness” of all the components for the edge.

These designs don’t fit traditional models and it is what makes edge unique. Edge is not a single application, but a connected system that going to have to emerge to make all this work together.

Transcript: otter.ai/u/-uFSclONwRhhc4QlFywiSJAIF10
Photo by Dmitriy Ganin from Pexels [ID 7538096]

Topics for a Security Training Course

DevOps Lunch and Learn was about security practices. Specifically, we built an outline of topics in security that we think are necessary for developers and operators to build secure applications.

We basically built a week long course curriculum!

As we go through what this course curriculum we walk through who needs to know this information and why.

If you want to see all of the detail here, please see: docs.google.com/document/d/1x5QLP…ng=h.c2phqte5q4pl

Transcript: otter.ai/u/UyMAmiHi-rRAreMa0FjxaVNomhQ
Photo by PhotoMIX Company from Pexels [ID 226746]

Are you impatient enough to be an SRE?

sre-seriesOur focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

SRE minded teams are very impatient about eliminating manual, routine and non-differentiated work.

I’ve been talking to a lot of people about SRE lately in the context of helping Ops get out of the way while coping with increasing load and complexity.  Why are they so impatient? Because they know that ops demand is constantly increasing, there’s no “good enough” when it comes to finding ways to automate tasks and move up stack. Without consistent improvement in automation, teams will get buried (my post about Ops Debt).

The core SRE mantra needs to be “Own Ops, don’t be owned by Ops.”

Yet, outsourcing ops responsibility to a service is equally problematic for an SRE.  They cannot give up responsibility for the integrated system.  In fact, that’s one of the basic reasons why Google’s SRE teams went from just “web site reliability” to full system thinking.  Every aspect of the infrastructure stack needs to be considered when looking at system performance and reliability.  For example, something deep like SSD drive write behavior or GPU BIOS could make a critical difference.  SREs need to be able to root cause issues and black box infrastructure (a.k.a. Cloud) can get in the way.

SRE teams must balance owning the full stack versus focusing on what makes their job unique.

That’s why we have been rethinking about how SRE teams approach infrastructure.  Instead of trying to turn infrastructure into a black box services; we’ve designed the Digital Rebar composable Ops platform that embraces and contains heterogeneity with a high degree of transparency and control.  This is critical because SREs cannot afford to keep reinventing automation at the bottom of the stack.  We must be able to share and leverage best-practices on infrastructure provisioning and platform deployment.  

Like the hardware that runs it, the foundation automation layer must be commoditized.

That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.

That’s what we are building at RackN.  Our primary goal is to reuse automation whenever possible.  That was our top design priority for Digital Rebar and it drives our customer engagement models.  If you’d like to hear more, download our SRE white paper.

More information:

Shouldn’t we have Standard Automation for Commodity Infrastructure?

sre-seriesOur focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

bookAn entire chapter of the Google SRE book was dedicated to the benefits of improving data center provisioning via automation; however, the description was abstract with a focus on the importance of validation testing and self-healing. That lack of detail is not surprising: Google’s infrastructure automation is highly specialized and considered a competitive advantage.

Shouldn’t everyone be able to do this?

After all, data centers are built from the same basic components with the same protocols.

Unfortunately, the stack of small (but critical) variations between these components makes it very difficult to build a universal solution. Reasonable variations like hardware configuration, vendor out-of-band management protocol, operating system, support systems and networking topologies add up quickly. Even Google, with their tremendous SRE talent and time investments, only built a solution for their specific needs.

To handle this variation, our SRE teams bake assumptions about their infrastructure directly into their automation. That’s expedient because there’s generally little operational reward for creating generic solutions for specific problems. I see this all the time in data centers that have server naming conventions and IP address schemes that are the automation glue between their tools and processes. While this may be a practical tactic for integration, it is fragile and site specific.

Hard coding your operational environment into automation has serious downsides.

First, it creates operational debt [reference] just like hard coding values in regular development. Please don’t mistake this as a call for yak shaving provisioning scripts into open ended models! There’s a happy medium where the scripts can be robust about infrastructure like ips, NIC ordering, system names and operating system behavior without compromising readability and development time.

Second, it eliminates reuse because code that works in one place must be forked (or copied) to be used again.  Forking creates a proliferation of truth and technical debt.  Unlike a shared script, the forked scripts do not benefit from mutual improvements.  This is true for both internal use and when external communities advance.  I have seen many cases where a company’s decision to fork away from open source code to “adjust it for their needs” cause them to forever lose the benefits accrued in the upstream community.

Consequently, Ops debt is quickly created when these infrastructure specific items are coded into the scripts because you have to touch a lot of code to make small changes. You also end up with hidden dependencies

However, until recently, we have not given SRE teams an alternative to site customization.

Of course, the alternative requires some additional investment up front.  Hard coding and forking are faster out of the gate; however, the SRE mandate is to aggressively reduce ongoing maintenance tasks wherever possible.  When core automation is site customized, Ops loses the benefits of reuse both internally and externally.

That’s why we believe SRE teams work to reuse automation whenever possible.

rebar-1Digital Rebar was built from our frustration watching the OpenStack community struggle with exactly this lesson.  We felt that having a platform for sharing code was essential; however, we also observed that differences between sites made it impossible to share code.  Our solution was to isolate those changes into composable units.  That isolation allowed us take a system integration view that did not break when inevitable changes were introduced.

If you are interested in breaking out of the script customization death spiral then review what the RackN team has done with Digital Rebar.

Even if you don’t use the code, the approach could save your SRE team a lot of heartburn down the road.  Of course, if you do want to use it then just contact us at sre@rackn.com.