Platform Engineering and DevOps

08 Aug, 2023

A Mental Model for Platform Engineering and DevOps

A co-worker recently asked me when is the right moment to build a platform when building a software organization what made me thinking about the exact term and when to actually start taking care nurtoring a platform in a software organization. This post is an attempt to bring certain aspects I know and have read about for multiple years into the equation and provide a holistic view on DevOps and Platform Engineering.

The DevOps Hype and Delusion, and Science

Depending on different sources, the DevOps movement really ignited in 2009 at the Velocity Conference when John Allspaw and Paul Hammond demonstrated how Flickr deployed to production ten times a day, a number that seemed almost like Science Fiction for the audience¹.

After that the usual hype cycle set in while the unicorns and FAANG companies started making their mark in the industry and pushing the needle further in regards to velocity metrics (Amazon deploying to production every 11.6 seconds in 2023²)

Unfortunately, those results were hard to replicate for a broader population mainly because there was no recipe fo success that was based on empiristic evidence or research on how to achieve the promises of DevOps as a organization.

2012, Puppet started conducting the State of DevOps survey to draw a conclusive picture on the adoption of DevOps principles and outcomes³. That effort was later taken over by the DevOps Research and Assessment (Dora), a research organization funded mainly by Google.

Based on this data, in 2018 Dr. Nicole Forsgren, Jez Humble and Gene Kim published the book Accelerate, which scientifically proved the positive effects on organizations and individuals alike when implementing DevOps principles. Additionally, it established four metrics to describe Software Delivery Performance and score System Performance in general⁴ now commonly referred as DORA metrics.

Lead Time
Deployment Frequency
Mean Time to Restore
Change Fail Percentage

While additional analysis also found evidence that good Software Delivery Performance correlates with good organizational culture and individual well-being⁵, correlation and causation have not been fully understood at that point.

In addition to that, Goodhart's law started diminishing the effectiveness of the metrics almost immediately after publication as practitioners and managers alike started pushing for higher rankings on those metrics without improving both technological capabilities and organizational culture leading to mixed results.

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure"⁶.

Turning Correlation into Causation

As research went on both on Microsoft Developer Experience Lab, DORA and other places, the term of Developer Experience and Productivity came up to describe a set of technical capabilities and organizational measures that enables good Software Delivery performance.

Most prominently is the SPACE Framework published in 2021⁷ that focuses on five dimensions to better understand Developer Experience. As Software Engineering is a socio-technical exercise, metrics are qualitative (survey data) and quantitative (system metrics).

Satisfaction and well-being
Performance
Action
Communication and collaboration
Efficiency and flow

To improve the DORA metrics for an organization, it is necessary to invest into metrics from different dimensions as these are causally interconnected according to the paper. Additional research has been conducted and recently published⁸ as well deepening the focus on the individual developers experience whereas DORA indicates solely system performance.

Developer Experience and DORA

Technical Capabilities and Platform

Recently, DevOps practitioners shifted into two diverging directions:

Site Reliability Engineering
Platform Engineering

While the first one is already well-understood and established both within the industry⁹ as well within the literature¹⁰, the latter is still in Hype phase and not well-defined yet.

Starting with the Platform term, let's use the following working definition:

The platform is the sum of technological capabilities and systems used to develop and deliver software.

This includes, but is not limited to the following elements:

CI/CD
GitOps
Source Control
Automated Testing
Infrastructure as Code
Container and Container Orchestration
Security and Compliance Measures

In separation of SRE, where the lifecycle of a singular service is the scope, Platform Engineering aims to provide a common platform for all teams involved in developing and delivering software across the organization.

Note: Deploying Kubernetes and integrating all CNCF projects is not necessarily Platform Engineering. Deploying a Backstage.io is not, either.

Designing, operating, and managing of the platform itself is where the engineering part of the term comes into play.

The "Golden Path" and Product Market Fit

The ultimate goal of platform teams is 100% adoption of the platform by the software teams across the organization. As this is illusional for a lot of reasons¹¹, platform teams are craving for it talking about "paving a golden path".

This golden path is meant to make it easy and worthwhile for application teams to integrate into the common platform and abandon duplicates of functions and capabilities that are provided by the platform team (like contesting CI/CD systems, orchestrators, ...).

NOTE: Platform Engineering is not about forcing seniors into a particular IDE, but providing juniors and new joiners with a workspace that allows them to contribute quickly, confidently and safely.

Providing the technical capabilities ("product") required by the application teams ("the market) is what platform engineering is all about.

So finding "market fit" for the platform is actually an exercise in Product Engineering¹².

What to work on next?

In essence, there is never a green-field deployment for a "platform" because even in a fresh start-up the first software engineer will walk in and open her favorite IDE and create the first pipeline in her CI/CD system of choice, which is then the platform that will be iterated upon. Therefore, it is unreasonable to think of platform engineering with a fixed starting point or to ask "When to start building a platform?"

Alternatively, platform engineers should be thinking about the most lacking SPACE metrics and provide technical capabilities that can potentially lift up those values. As this is based on hypotheses and experiments, the correct way of working would follow a Deming Cycle of Plan - Do - Check - Act.

One example would be the following: survey data of software engineers indicates discontent with inconsistencies when executing pipelines. Analysis shows that software versions vary greatly between different CI runners. In this scenario, the next best thing to work on the platform would be to improve the CI system by using build containers with a fixed tag on all runners to harmonize the environment. After applying the change, both the quantitative metrics of number of successfuly pipeline runs should be increasing as well and the qualitative measure of perceived flow for the engineers.

Conclusion

By integrating DORA metrics, the SPACE Framework as well as all technical capabilities constituting the platform, the following Bow-Tie model can be used to illustrate a platform engineering team's mission.

Bowtie PE DORA

As described SPACE metrics are predicting software delivery performance operationalized using the DORA metrics which result in better business outcomes and higher-quality software. Technical capabilities developed by the platform engineering team have effects both on SPACE metrics as well as on the DORA metrics.

To decide on how to improve your platform, you would need to identify lacking metrics on the right side, form a hypothesis on what capability could provide an uptick on this particular metric, implement a solution and test the hypothesis.

By applying this mental model as a platform team, platform teams can focus on improving developer experience in a measurable fashion without getting trapped too deep into the usual hype surrounding new terms and practices within our industry.

Sources

Kim, G., Humble, J., Debois, P., Willis, J., & Forsgren, N. (2016). The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (1st ed.). IT Revolution. (p. 5)↩
History of the State of DevOps Report - Puppet ↩
Amazon Deployment statistics ↩
Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The science of lean software and DevOps: Building and scaling high performing technology organizations. IT Revolution Press. (p. 17)↩
Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The science of lean software and DevOps: Building and scaling high performing technology organizations. IT Revolution Press. (p. 88)↩
Goodharts Law - Wikipedia ↩
Forsgren, N., Wernick, C., Kamerer, T., Redmiles, E. M., & Herbsleb, J. (2020). The SPACE of Developer Productivity: There’s more to it than you think. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1–35. https://doi.org/10.1145/3432934 ↩
Noda, A., Storey, M.-A., Forsgren, N., & Greiler, M. (2023). DevEx: A Practical Framework for Measuring and Improving Developer Experience. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 1–14. https://doi.org/10.1145/3411763.3451612 ↩
Reliability and SRE in the 2022 State of DevOps Report ↩
Limoncelli, T. A., Chalup, S. R., & Hogan, C. J. (2014). The practice of cloud system administration: DevOps and SRE practices for web services. Pearson Education. (p. 401)↩
Kim, G., Humble, J., Debois, P., Willis, J., & Forsgren, N. (2016). The DevOps handbook: How to create world-class agility, reliability, and security in technology organizations. IT Revolution. (p. 297).↩
Platform engineering is just DevOps with a product mindset - StackOverflow Blog ↩

#devops #platformengineering #research