Course

Site Reliability Engineering: Measuring and Managing Reliability

Google Cloud

Explore the fundamental tools for measuring and managing reliability in the "Site Reliability Engineering: Measuring and Managing Reliability" course by Google Cloud. This comprehensive learning experience delves into the significance of service level indicators (SLIs) and service level objectives (SLOs) in making systems reliable. Through a deep dive into approaches for devising appropriate SLIs and SLOs, students gain insights into quantifying risks and consequences of SLOs, as well as managing reliability through the use of an error budget.

With a focus on understanding SLIs, SLOs, and SLAs, this course equips learners with the knowledge to make informed decisions and balance operational and project work effectively. Through engaging modules covering topics such as targeting reliability, operating for reliability, choosing good SLIs, developing SLOs, and quantifying risks to SLOs, participants gain practical skills to enhance the reliability of systems and services. The course emphasizes the iterative nature of reliability improvement and the importance of error budgets, providing valuable insights into managing complexity and achieving achievable and aspirational SLOs.

  • Gain insights into the significance of service level indicators (SLIs) and service level objectives (SLOs)
  • Learn to quantify risks and consequences of SLOs
  • Understand the iterative nature of reliability improvement and the importance of error budgets
  • Acquire practical skills to enhance the reliability of systems and services

Certificate Available ✔

Get Started / More Info
Site Reliability Engineering: Measuring and Managing Reliability
Course Modules

This course comprises comprehensive modules that cover essential topics such as the difference between DevOps and SRE, targeting reliability, operating for reliability, choosing good SLIs, developing SLOs, quantifying risks to SLOs, and understanding the consequences of SLO misses.

Introduction to SRE

Module 1 provides a foundational understanding of the difference between DevOps and SRE, and the principles of reliability in the cloud. Learners delve into the significance of SLOs in helping businesses make informed decisions and balancing operational and project work effectively.

Targeting Reliability

Module 2 focuses on targeting reliability, emphasizing the iterative nature of reliability improvement, and the importance of error budgets. Participants gain insights into measuring reliability and assessing reliability targets effectively.

Operating for Reliability

Module 3 delves into operational approaches for increasing reliability, including error budgets, trade-offs, and axes of improvement. Learners gain practical skills in managing complexity and achieving achievable and aspirational SLOs.

Choosing a Good SLI

Module 4 covers the process of choosing good SLIs, refining SLI specifications, and measuring happiness in metric form. Participants explore different measurement strategies and learn to define freshness and correctness effectively.

Developing SLOs and SLIs

Module 5 focuses on developing SLOs and SLIs, guiding learners through a 4-step process and refining SLI specifications for complex user journeys. The module emphasizes the identification of observability gaps and failure modes.

Quantifying Risks to SLOs

Module 6 quantifies risks to SLOs, modeling risks, and analyzing risk effectively. Participants brainstorm SLO risks for example services and propose fixes or mitigations to meet the desired availability target.

Consequences of SLO Misses

Module 7 explores the consequences of SLO misses, highlighting the importance of error budget policies and the fundamentals of drafting them effectively. Learners gain insights into production outages and the metadata usage for SLOs.

More Cloud Computing Courses

Adding a Phone Gateway to a Virtual Agent

Google Cloud

Adding a Phone Gateway to a Virtual Agent enables users to learn how to integrate a phone gateway into their virtual agent to enhance user interaction and transfer...

Creating dynamic SQL derived tables with LookML and Liquid

Google Cloud

Learn to create and update SQL derived tables using LookML and Liquid to generate dynamic values in this Google Cloud Self-Paced Lab.

Interact with Terraform Modules

Google Cloud

Interact with Terraform Modules is a self-paced lab in the Google Cloud console, where you'll create and use Terraform modules to organize your cloud configuration....

Applying Machine Learning to Your Data with GC - Français

Google Cloud

Machine learning is a powerful tool for businesses. Learn to distinguish between pre-trained and custom ML models and build your own using BigQuery ML.