Sustainability in Site Reliability Engineering (SRE)

This presentation and transcript were recorded for the USENIX SRECon Americas 2020 conference in December 2020. SRECon is a yearly conference focusing on the discipline of Site Reliability Engineering and acts as a gathering point for engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale.

Presentation Transcript

Introduction (0:17)

Usually, a talk about the environment or climate is filled with dramatic images of polar bears or raging wildfires. There is also a dramatic headline or two thrown in about glaciers melting and oceans rising. Certainly, an extreme hockey stick graph showing how much carbon we are emitting as a society and you can almost guarantee to see an image of thick black smoke billowing out of a smokestack. They tend to conclude that the overall temperature of the planet is causing mass extinctions of species, relocating populations, and wiping out entire countries. It’s all very heavy on the doom and gloom. While this is all very important, the overwhelming nature of that doom and gloom messaging makes it hard to process and prioritize what could, and really should, be done. I’m going to frame this conversation in a different way.

Planetary SLO (1:23)

Our planet has an SLO: the average global temperatures rise needs to stay under 1.5 degree Celsius as measured from pre-Industrial Revolution times. For anyone like me that hasn’t memorized timelines, that is the late 1700s. It’s important to note that this is an average across the globe, so some places are already past this point. If the planet breaches its SLO there will be severe impact to its users: The humans, animals, and plants. As of this recording our SLO value is 1.15 degrees and we are about 12 years away from breaching the 1.5 degree line. You can see these numbers for yourself at https://climateclock.net and the website even plays a sad song for you while you watch the numbers change because, you know, doom and gloom… But we’re SREs! We have an SLO defined, we have active monitoring of it so we should be able to do something about this! And not JUST do something, but be critical pieces for the solution and maintenance of our planets SLO.

SREs Balance Technical & Operational Aspects (2:36)

In our day-to-day work, SREs are constantly playing the balancing game of technical and operational aspects of a system. I’m defining technical here as “The hardware and software choices you make for the engineering system”. Things like the programming language, architecture, frameworks, tech stack, etc. The Operational side captures all “The human toil needed to maintain a technical system” which includes things like On-Call, Problem Management, Observability, and your team processes. These two areas are typically separate roles with Software Engineers focusing on primarily the technical piece and Ops Engineers focusing on primarily the Operational piece. But SRE has brought these two together in a balance in order to drive reliability through the entire system.

Sustainability = Reliability Over Time (3:33)

Reliability takes many forms. It isn’t just the code or test cases we write, it’s the architecture of the system, it’s the relationships and trust we form with our colleagues, it’s the blameless and learning cultures that we cultivate every day, and the infinite curiosity that leads to consistent reliability. Our goal is not just Reliability but maintaining that reliability over time. And reliability over time is a really great way to define Sustainability. Which is why I believe that there is a missing piece in what we do in our roles as an SRE

Environmental Sustainability (4:11)

Environmental Sustainability, or “The impact on the planet of our collective Technical and Operational choices”. This speaks that reliability over time for a system and more specifically has to do with things like your carbon emissions, and power sources for your hardware, resource utilization, and the waste your technical choices generate. We live in a world of ephemeral drives and virtual machines and things that can just be thrown away in an essentially infinitely resourced cloud. And that’s great, and super useful for our jobs, but we have done that at the neglect of the long-term and we have to bring it back in to the equation.

SREs Should Focus On All 3 Areas of Sustainability (4:55)

In fact, we could go as far as to say we really are focusing on sustainability in all 3 of these areas and not just the reliability of our systems. In the same way that we have balanced the Technical and Operational pieces, we need to also balance the Environmental and long-term effects. The key is that it is a balance though. The most sustainable system is one that is never built. You can’t swing too far in one direction and overcompensate. Borrowing from Game Theory and Economics, you want to find the Pareto Optimal point between these 3 areas where they are maximizing each of their own benefits without disproportionately negatively impacting the others. This point will shift around over time as new technologies emerge, reliability changes, power grids evolve, and business rules and priorities change. Find that point and maintain it. We will always have to build things, its part of the job, but we can do it in an intentional and principled way to balance all 3 sustainability areas.

The Green Principles of Engineering (5:57)

Similarly to how Security is part of what we do, Sustainable Software Engineering should also be part of what we do. Luckily, principles.green has defined some for you: 1 – Build applications that are carbon efficient which means minimizing the amount of carbon emitted per unit of work. 2 – Build applications that are energy efficient – if you create anything for mobile devices you are probably well aware of this already and how your code effects battery life. Sustainable Software Engineering takes responsibility for the electricity it consumes and is built in a way that minimizes consumption of that energy. 3 – Run you servers and machines at a high rate of utilization. Take full advantage of what you have already and minimize wasted cycles and resources. 4 – Understand the carbon intensity behind your system. A carbon Intensity value is calculated by how many grams of carbon is required to produce a kilowatt-hour of electricity. 5 – Minimize the carbon embodied in your hardware and extend the lifetime of your machines. 6 – Reduce the amount of data and the distance it must travel across the network. 7 — Shape demand to your supply rather than shaping your supply to your demand by making your systems Carbon-Aware. This is similar to the practice of load shedding that drops traffic during times of peak load. 8 — And then look at the whole system rather than just your specific piece to understand where you can increase carbon efficiency. Deep-diving all of these principles is an entire talk on its own, and you can read more for yourself at principles.green (Which is on GitHub so feel free to contribute!), so instead I will focus on a few examples of applying these principles to some common scenarios.

Applying Green Principles: Microservices architecture (7:57)

What would this look like for Microservices. (1) The first step is to focus on your compute utilization. Limit the number of cores you need and use the cores that you have more effectively through more efficient code or improved scaling parameters. (2) Similarly, for storage: Ensure proper indexing and sharding rules to minimize power and processing and maximize your energy proportionality. Having a more efficient database means more efficient queries which means less wait time which makes your compute and application faster. (3) Reduce your payloads and overall volume of data as well as the distance it travels across the network. (4) And look to reduce your number of overall microservices as each one brings an overhead with it. I’m not advocating for full monolith here since you probably have very good business reasons for all your microservices, but you can move the ones that have tight dependencies on each other to the same nodes or clusters. Try co-locating them with their storage if they do a lot of reads. You probably are doing some of this already but viewing these through the lens of Environmental Sustainability changes the priority and value of the work.

Applying Green Principles: Demand Shifting in Time (9:18)

Let’s look at a different example that has to do with Carbon Intensity. Renewable energy sources like Solar or Wind take very little carbon to produce energy. However, when the sun isn’t shining and the wind isn’t blowing, they are supplemented by other sources like coal or gas that take much more carbon to produce the same amount of energy. That means that the carbon intensity for a power grid has natural fluctuations like this chart from one of California’s power grids. Your system can take advantage of these fluctuations to reduce your overall carbon footprint. One way is to intentionally delay workloads and run them during a lower carbon intensity time period. A 1-hour job could be 40% less intense at various times throughout the day from nothing else than running it at a different time! Anything that doesn’t require an immediate result, like Batch processing, is a great candidate for this approach.

Applying Green Principles: Demand Shifting in Space (10:21)

In addition to shifting your workloads in time, you can also move them to locations that have more renewable energy and are lower carbon intensities. This is a map from a service called WattTime that shows grid emissions for power grids across the world. You can see which regions emit less carbon overall and use that to reduce your baseline emissions. Some grids are cleaner than others but they do fluctuate. If you use Kubernetes, you can customize the scheduler to take advantage of these fluctuations by setting the carbon intensity value as a preference in the scheduling algorithm. The scheduler can automatically evaluate and select regions with low carbon emissions for you.

Emissions Impact from Artificial Intelligence (11:10)

And one more example here, let’s say there is an industry that mines a resource, refines it, and sells it for a lot of money, but has a very large negative impact on the environment, you probably think I’m talking about the Oil Industry. While that is true, I’m actually talking about the AI industry and the amount of data it consumes and processes. Computation costs of AI models have been doubling every few months and increased 300,000x in just 6 years! All that compute produces a lot of carbon emissions. Let’s try and put that into perspective. A roundtrip flight between NYC and SF has a carbon footprint of about 1984 lbs of CO2. The average carbon footprint of a person over 1 year? 11,023 lbs of CO2 or about 5 times more. Americans are 3 times worse about emissions so that number goes up to 36,156 lbs of CO2 in a single year. About 3 times more than the world average. Over its entire lifetime, and this includes manufacturing and fuel, a car in the US will produce about 126,000 lbs of CO2. And now we come to the 213 million parameter NLP model — it has a carbon footprint of 626,155 lbs of equivalent CO2. That’s one roundtrip flight from NY to SF every single day for almost a year. An entire AI sub-discipline has spun up around GreenAI efforts with some good projects like ML CO2 Impact that try to be more transparent about the costs. Just being aware of the numbers and impact is a fantastic start, but it is still early days on any improvements of energy consumption and compute required in AI.

A lot of these examples point to being really good at optimizations and efficiency.

Carbon-Efficient is Faster, Cheaper, More Resilient (13:40)

Well that’s kind of the secret here and ultimately what Environmental Sustainability is about: being more efficient with the resources you have so you don’t need to produce or consume more. The 3 R’s of the environment are Reduce, Reuse, Recycle and there is even a fourth one becoming more popular called Refuse. These also align to good software engineering practices: Refuse new features or machines unless it’s necessary and even then, reuse existing code and resources to reduce your over complexity and effort. If you focus on efficiency at all levels of your system, not just the software and hardware but what it takes to RUN them over time, then you can minimize the impact your system has on the planet. Because of the focus on high utilization and efficiency, Carbon-Efficient systems are typically faster and cheaper. They are also typically more resilient because of the priority of simplification overall. The focus on carbon emissions further incentivizes the system to be faster, cheaper, and more reliable over time.

COVID-19 Impact to Emissions (14:47)

But wait, haven’t we fixed all our emissions problems because of COVID and everyone working from home now? Emissions have gone down, at least for a few months, and it’s what I call the COVID Caveat. The pandemic will eventually be over and people will revert back to their carbon-emitting ways. The bigger problem is that all these emissions are putting carbon into the atmosphere. That carbon stays in the atmosphere for years before dissolving into the ocean and causing ocean acidification which can last for centuries. There is a ton, well, dozens of Gigaton’s actually, of carbon in the atmosphere now and we need to both cut emissions AND capture the carbon that is there before is dissolves.

Companies Taking Sustainability Seriously (15:46)

But don’t worry, we have some very powerful allies here. The 3 biggest public clouds have set some ambitious goals and have already made great progress on renewable energy. Both Microsoft and Google have committed to being Carbon Negative – which is removing more carbon from the atmosphere than they generate — and removing their entire company historical emissions. For a company like Microsoft, by 2050, that’s 75 years’ worth of carbon! Stripe just released a way for anyone using their payments to automatically direct 1% of their revenue to fund carbon removal projects. Apple is making a carbon-neutral iPhone and committed to being Carbon Neutral by 2030, Amazon has committed to 50% of their shipments being Carbon Neutral by 2030, Starbucks will cut their emissions, water use, and waste by 50% also by 2030. And Walmart will be carbon-neutral by 2040. That’s some of the biggest companies in the world as allies in this effort! There are efficiency gains everywhere and so much opportunity to reduce emissions and become more resilient that we are not taking advantage of as SREs.

Takeaways (17:02)

So what do we do now? Most important is to Ask The Question. What are your sustainability goals or SLOs? SREs are in the perfect position in the stack to have the biggest impact on making software engineering sustainable and driving that reliability over time. Ask The Question at metrics reviews, in design reviews, at your company all hands, your teams stand-up, the General slack channel, wherever you feel comfortable. Design for sustainability in your software upfront, not just after the fact. Second, Apply The Principles. Even if your company or team or you don’t value sustainability you do value efficiency and can still apply the Sustainable Software Engineering principles to your own work and advocate for them in others. And lastly, Continue The Conversation. You aren’t on this journey alone, keep talking about it and share your progress and wins. Write blog posts, or tweets about your work, help refine the Sustainable Software Engineering Principles, or join communities like climateaction.tech. And if nothing else, tell me about it. I would love to hear any thoughts, or big wins, and even the huge mistakes. A habitable planet is the ultimate reliability and as an SRE myself, I want to make sure we have one.

Thanks!