Home > Consulting > Reliability Engineering Coaching
Reliability Engineering Coaching
The Reliability Engineering Coaching workshop introduces ways to economically and reliably scale services in an organization. It explores strategies to improve agility, cross-functional collaboration and transparency of health of services towards building resiliency by design, automation and closed loop remediations.
Features
24 hours of Instructor-led training classes
Share relevant Industry Insights
Shares real-world experience
Workshop Objective
This workshop aims to equip participants with the practices, methods, and tools to engage people across the organization involved in reliability through the use of real-life scenarios and case stories. Upon completion of the workshop, participants will have tangible takeaways to leverage when back in the office such as implementing reliability models that fit their organizational context, building advanced observability in distributed systems, building resiliency by design and effective incident responses.
The workshop is developed by leveraging key sources, engaging with thought-leaders in the DevOps space and working with organizations embracing Reliability Engineering to extract real-life best practices and has been designed to coach the key principles & practices.
WORKSHOP OBJECTIVES
At the end of the workshop, the following learning objectives are expected to be achieved:
- Practical view of how to successfully implement a flourishing reliability culture in your organization.
- The underlying principles of reliability and an understanding of what it is not in terms of anti-patterns, and how do you become aware of them to avoid them.
- The organizational impact of introducing reliability.
- Acing the art of SLIs and SLOs in a distributed ecosystem, and extending the usage of Error Budgets beyond the normal to innovate and avoid risks.
- Building security and resilience by design in a distributed, zero-trust environment.
- How do you implement full stack observability, distributed tracing and bring about an Observability-driven development culture?
- Curating data using AI to move from reactive to proactive and predictive incident management. Also, how do you use DataOps to build clean data lineage.
- Why is Platform Engineering so important in building consistency and predictability for reliability ?
- Implementing practical Chaos Engineering.
- Major incident response responsibilities for a reliability based on incident command framework, and examples of anatomy of unmanaged incidents.
- Perspective of why Reliability Engineering can be considered as the purest implementation of DevOps.
- Reliability Engineering Execution model
- Understanding why reliability is everyone’s problem.
- Success story learnings from a Reliability perspective.
Course Agenda
Module 1:
Module 1: Anti-patterns
● Rebranding Ops as Reliability Engineering
● Users notice an issue before you do
● Measuring until my Edge
● False positives are worse than no alerts
● Configuration management trap for snowflakes
● The Dogpile: Mob incident response
● Point fixing
● Production Readiness Gatekeeper
● Fail-Safe really? & Use Case Discussion
Module 2:
Module 2: SLO is a Proxy for Customer Happiness
● Define SLIs that meaningfully measure the reliability of a service from a user’s perspective
● Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis
● Use error budgets to help your team have better discussions and make better data-driven decisions & Use Case Discussion
Module 3:
Module 3: Building Secure, Scalable and Reliable Systems
● Reliability Engineering and its role in Building Secure and Reliable systems
● Design for Changing Architecture
● Fault tolerant Design
● Design for Security
● Design for Resiliency
● Design for Reliability & Use Case Discussion
Module 4:
Module 4: Full-Stack Observability
● Modern Apps are Complex & Unpredictable
● Slow is the new down
● Pillars of Observability
● Using Open Telemetry & Use Case Discussion
Module 5:
Module 5: Platform Engineering and AIOPs
● Taking a Platform Centric View
● AIOps -> A Bigdata view to go from reactive to proactive to predictive management
● Technology becomes more human through ML, allowing ubiquitous self-service & Use Case Discussion
Module 6:
Module 6: Incident Response Management
● Key Responsibilities towards incident response
● DevOps & ITIL
● OODA and Reliability Incident Response
● Closed Loop Remediation and the Advantages
● Swarming – Food for Thought & Use Case Discussion
Module 7:
Module 7: DiRT and Chaos Engineering
● Disaster Recovery Testing
● Fault Injection
● Chaos Engineering
● Tools that can be instrumented for Chaos Engineering & Use Case Discussion
Module 8:
Module 8: Reliability is the Purest form of DevOps
● Key Principles of Reliability Engineering
● How to increse increase Reliability across the spectrum
● Metrics for Success
● Possible implementation Model
● Culture and Behavioral Skills are key
● Case Study & Use Case Discussion
Pricing
Pricing: 2000 SGD per participant for 3 days workshop.
1500 USD per participant for 3 days workshop.
Note: Prices could vary depending on the country you reside. Please contact us at info@taubsolutions.com for further details
Email Us
Workshop Schedule
Schedule for Workshop
- Sep 9th to 11th
- Oct 14th to 16th
- Nov 4th to 6th
Are there any pre-requisites for this course?
It is highly recommended that learners attend the SRE Foundation course with an accredited DevOps Institute Education Partner and earn the SRE Foundation certification prior to attending the Reliability Engineering Coaching and exam. An understanding and knowledge of common SRE terminology, concepts, principles and related work experience are recommended.