Digital Library, Books and Resources Hub

The SRE resource library

SRE represents a mindset, engineering practices, and a job function. Here you will find articles, videos, and guides to help you implement SRE principles and run reliable production systems.

Explore All Resources

Machine Learning in Production

Start your journey by exploring

Machine Learning in Production

Machine Learning Inference

Continue your journey by reading

Efficient Machine Learning Inference

Machine Learning at scale

Extend your journey by watching

Machine Learning at Scale

  • Book - Building secure and reliable systems

    Building Secure & Reliable Systems

    Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable? Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. In this book, experts from Google share best practices to help your organization design scalable and reliable systems that are fundamentally secure.

    Edited by: Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield

  • site reliability workbook

    The Site Reliability Workbook

    The Site Reliability Workbook is the hands-on companion to the bestselling Site Reliability Engineering book and uses concrete examples to show how to put SRE principles and practices to work. This book contains practical examples from Google’s experiences and case studies from Google’s Cloud Platform customers. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.

    Edited by: Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne

  • Running Production Systems

    Site Reliability Engineering

    Members of the SRE team explain how their engagement with the entire software lifecycle has enabled Google to build, deploy, monitor, and maintain some of the largest software systems in the world.

    Edited by: Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy

Implementing SLOs

Begin by reading

Implementing SLOs

Alerting on SLOs

Dig deeper by exploring

Alerting on SLOs

measures service reliability

Build your skills with

Art of SLOs

Non-Abstract Large System Design

Learn the basics by reading

Introducing Non-Abstract Large System Design

Distributed imageserver

Develop fundamentals by exploring

SRE Classroom: Distribued ImageServer

SRE best practices

Build advanced skills with this video workshop

How to Design a Distributed System

Filter by:

Sorry, no available at the moment.