Defending Ops without Killing Unicorns

By Rob Hirschfeld

Cloud Ops is a brutal business: operators are expected to maintain a stable and robust operating environment while also embracing waves of disruptive changes using unproven technologies. While we want to promote these promising new technologies, the unicorns, operators still have to keep the lights on; consequently, most companies turn to outside experts or internal strike teams to get this new stuff working.

Our experience is that doing an on-site deployment by professional services (PS) is often much harder than expected. Why? Because of inherent mission conflict. The PS paratrooper team sent to accomplish the “install Foo!” mission are at odds with the operators’ maintain and defend mission. Where the short-term team is willing to blast open a wall for access, the long-term team is highly averse to collateral damage. Both teams are faced with an impossible situation.

I’ve been promoting Open Ops around a common platform (obviously, OpenCrowbar in my opinion) as a way to solve address cross-site standardization.

Why would a physical automation standard help? Generally, the pros expect to arrive with everything “ready state” including OS installed and all the networking ready. Unfortunately, there’s a significant gap between an OS installed and … installs are always a journey of discovery as the teams figure out the real requirements.

Here are some questions that we’ve put together to gauge is the installs are really going the way you think:

  • How often is the customer site ready for deployment?  If not, how long does that take to correct?
  • How far into a deployment do you get before an error in deployment is detected?  How often that error repeated across all the systems?
  • How often is an “error” actually an operational requirement at the site that cannot be changed without executive approval and weeks of work?
  • How often are issues found after deployment is started that cause a install restart?
  • Can the deployment be recreated on another site?  Can the install be recreated in a test environment for troubleshooting?
  • How often are systems hand or custom updated as part of a routine installation?
  • How often are developers needed to troubleshoot issues that end up being configuration related?
  • How often are back-doors left in place to help with troubleshooting?
  • What is the upgrade process like?  Is the state of the “as left” system sufficiently documented to plan an upgrade?
  • What happens if there’s a major OS upgrade or patch required that impacts the installed application?
  • Can changes to the site be rolled out in stages?
  • Can the upgrade be automated and rehearsed?

Leave a Reply