Explore the fundamental tools for measuring and managing reliability in the "Site Reliability Engineering: Measuring and Managing Reliability" course by Google Cloud. This comprehensive learning experience delves into the significance of service level indicators (SLIs) and service level objectives (SLOs) in making systems reliable. Through a deep dive into approaches for devising appropriate SLIs and SLOs, students gain insights into quantifying risks and consequences of SLOs, as well as managing reliability through the use of an error budget.
With a focus on understanding SLIs, SLOs, and SLAs, this course equips learners with the knowledge to make informed decisions and balance operational and project work effectively. Through engaging modules covering topics such as targeting reliability, operating for reliability, choosing good SLIs, developing SLOs, and quantifying risks to SLOs, participants gain practical skills to enhance the reliability of systems and services. The course emphasizes the iterative nature of reliability improvement and the importance of error budgets, providing valuable insights into managing complexity and achieving achievable and aspirational SLOs.
Certificate Available ✔
Get Started / More InfoThis course comprises comprehensive modules that cover essential topics such as the difference between DevOps and SRE, targeting reliability, operating for reliability, choosing good SLIs, developing SLOs, quantifying risks to SLOs, and understanding the consequences of SLO misses.
Module 1 provides a foundational understanding of the difference between DevOps and SRE, and the principles of reliability in the cloud. Learners delve into the significance of SLOs in helping businesses make informed decisions and balancing operational and project work effectively.
Module 2 focuses on targeting reliability, emphasizing the iterative nature of reliability improvement, and the importance of error budgets. Participants gain insights into measuring reliability and assessing reliability targets effectively.
Module 3 delves into operational approaches for increasing reliability, including error budgets, trade-offs, and axes of improvement. Learners gain practical skills in managing complexity and achieving achievable and aspirational SLOs.
Module 4 covers the process of choosing good SLIs, refining SLI specifications, and measuring happiness in metric form. Participants explore different measurement strategies and learn to define freshness and correctness effectively.
Module 5 focuses on developing SLOs and SLIs, guiding learners through a 4-step process and refining SLI specifications for complex user journeys. The module emphasizes the identification of observability gaps and failure modes.
Module 6 quantifies risks to SLOs, modeling risks, and analyzing risk effectively. Participants brainstorm SLO risks for example services and propose fixes or mitigations to meet the desired availability target.
Module 7 explores the consequences of SLO misses, highlighting the importance of error budget policies and the fundamentals of drafting them effectively. Learners gain insights into production outages and the metadata usage for SLOs.
Adding a Phone Gateway to a Virtual Agent enables users to learn how to integrate a phone gateway into their virtual agent to enhance user interaction and transfer...
Learn to create and update SQL derived tables using LookML and Liquid to generate dynamic values in this Google Cloud Self-Paced Lab.
Interact with Terraform Modules is a self-paced lab in the Google Cloud console, where you'll create and use Terraform modules to organize your cloud configuration....
Machine learning is a powerful tool for businesses. Learn to distinguish between pre-trained and custom ML models and build your own using BigQuery ML.