Site Reliability Engineering Tidbits [Leanpub PDF/iPad/Kindle]

Site Reliability Engineering is a relatively young discipline focused on treating operations as a software problem. Because it is so young, the SRE knowledge base is still growing. The goal is to make this book short, light and fun, but most importantly relevant.

Each chapter in this book describes a Site Reliability Engineering concept in a short and easily digestible way. The chapters in this book aim to provide every software engineer with information that can be used to increase the reliability of the systems they work on.

Topics include: observability, monitoring, Service Level Objectives (SLOs), alerting, resilience and debugging.

These concepts have been at the core of my personal SRE journey, and my hope is that you will find them valuable too!

Copyright

About the Author

Introduction

15 months of 24x7 Primary On-Call — Here’s How I Survived

Background
Surface Actionable Metrics
Alert on Symptoms Not Causes
Ratios Rule - But Be Careful
Emulate The Customer Experience: Probes Probes Probes
Give Yourself Room to Fail - SLO Based Alerts
Conclusion

Debugging Memory Leaks Using Go

What is a Memory Leak?
Debug Process
Identification
Root Cause Analysis / Source Analysis

How Probes Partition the Debug Space

Probes
Output
Debugging Using Probes

Observability Metric Namespaces and Structures

Metric Spaces
Metric Trees
Defining a Metric in Terms of its Children
Increasingly Specific — Subsets of Data
Ratios Rule
It’s All in the Questions
Generic Metrics Enriched With Tags
Conclusion

Debugging: Getting To Impact Through SLOs

Phrasing the Impact in Terms of Client Impact
Guiding With SLOs

Deploying SLOs Across An Organization

What is an SLO?
Principles
Representative the Client Experience
Actionable
Minimal Investment / Low Technical Overhead
Low Number of False Positives
Rollout Strategy

No Friction Application Observability Using Envoy

Problem
Envoy
Example
Conclusion

Alerting on SLOs

Terminology Refresher
Client Experience
Objective Quantities
Call to Action
Generic Tooling
Conclusion

Debugging Fundamentals: Profiling

What is Profiling?
Why Profile? - Risks of Not Profiling
How to Profile
Profile Profiles “drilling-down”
Conclusion

Performance Analysis: Tuning Methodology Using a Simple HTTP Webserver

Strategy
Simple HTTP Server Architecture
Determine Goals (Dimensions)
Setup the Test Harness
Observe
Execute/Observe/Analyze
Profile
Analysis - Hypothesis
Tune the Application - Experiment
Execute/Observe/Analyze
2000 Requests / Second
3000 Requests / second
Analysis - Hypothesis
Tune the application - Experiment
Execute/Observe/Analyze
Conclusion

Distributed Tracing: Impact on Engineering Organizations

Onboarding
Development
Operations
Conclusion

Dashboard Patterns: Aggregate View

Why Views?
So What’s an Aggregate View?
Throughput
Availability
Latency
Conclusion

Dashboard Patterns: Component Views

Purpose
Feedback Loops
In Practice
Approach
Conclusion

Why Capacity Planning Needs Queueing Theory (Without the Hard Math)

Problem
Capacity Planning Organizational Systems
Conclusion

Debugging Lambda File Descriptor Exhaustion

Background
A Strange Occurrence
AWS Support
Ensuring the Rollup Script Worked
Moving Forward
Starting to Debug
Back to Basics
Verifying Hypothesis
Bounding Resource Usage
Error Free!

Debugging Heuristics: Drivers of Increased Latency

Increase in the Amount of Work Being Done
Increased in the Type of Work Being Done
Change in the Amount of Work Performed in Each Transaction
Conclusion

Knowledge Graphs: Increased Context in Human Involved Incident Response

An Example
So What is an Incident Response (IR) Knowledge Graph?
Components
IR Knowledge Graphs In Practice
The Incident
Conclusion

Bolt on Rate Limiting

Protecting Resources
What is Envoy??
Solving Rate Limiting Using Envoy
Conclusion

Debugging Strategies: Triangulation

What is Triangulation?
Example Scenario
Heuristics
Conclusion

Debugging SQL Performance Using the “EXPLAIN” Statement

Methodology
Determine the Table Schema
Determine the Table Index
EXPLAIN the Query
Leveraging the Index
Predicate Query Missing Sortkey
Results

Stay on Top of Your ETL Pipelines With Table Freshness Checks

Detecting Resource Leaks With Baseline Tests

Data Operational Maturity

Maturity Model
Level 1 - Mechanism
Level 2 - Consistency
Level 3 – Accuracy
Conclusion

Bulkheads in Action — Partitioning to Minimize Failure Impact

What are Bulkheads?
Why Use Bulkheads
How?
When to Use?

Retries in Action: Availability in Exchange for Latency

What are Retries?
Why Use Retries?
How?
When to Use?
Caveats

Probing 101

Uptime Probes
What Probes Don’t Do
Purpose of Probes
How to start probing
Uses
Conclusion

Using Views for Backwards Compatible Data Migrations

Common Database Clients
Leveraging Views
Example of a View Based Migration
Conclusion

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

About

Share this book

Categories

Feedback

Author

Contents