Leanpub Header

Skip to main content

Site Reliability Engineering Tidbits

Learn SRE Principles & Techniques for Observability, Monitoring, SLOs, Resilience and Debugging.

This book is a collection of 28 chapters on SRE concepts such as observability, monitoring, Service Level Objectives (SLOs), alerting, resilience and debugging.

Minimum price

$7.99

$7.99

You pay

$7.99

Author earns

$6.39
$

...Or Buy With Credits!

You can get credits monthly with a Reader Membership
PDF
EPUB
WEB
About

About

About the Book

Site Reliability Engineering is a relatively young discipline focused on treating operations as a software problem. Because it is so young, the SRE knowledge base is still growing. The goal is to make this book short, light and fun, but most importantly relevant.

Each chapter in this book describes a Site Reliability Engineering concept in a short and easily digestible way. The chapters in this book aim to provide every software engineer with information that can be used to increase the reliability of the systems they work on.

Topics include: observability, monitoring, Service Level Objectives (SLOs), alerting, resilience and debugging.

These concepts have been at the core of my personal SRE journey, and my hope is that you will find them valuable too!

Author

About the Author

Danny Mican

Hello! I’m Danny.

I have over 11 years experience working with software. I’m a top ranked contributor on StackOverflow in python, django, javascript, go and unit-testing. During this time I've developed and led the development of dozens of successful software projects.

A couple years ago I discovered technical writing. I regularly write tech blogs on medium and my personal blog. I’ve also ghost-written content for some of the largest vendors in tech.

Thank you.

Contents

Table of Contents

Copyright

About the Author

Introduction

15 months of 24x7 Primary On-Call — Here’s How I Survived

  1. Background
  2. Surface Actionable Metrics
  3. Alert on Symptoms Not Causes
  4. Ratios Rule - But Be Careful
  5. Emulate The Customer Experience: Probes Probes Probes
  6. Give Yourself Room to Fail - SLO Based Alerts
  7. Conclusion

Debugging Memory Leaks Using Go

  1. What is a Memory Leak?
  2. Debug Process
  3. Identification
  4. Root Cause Analysis / Source Analysis

How Probes Partition the Debug Space

  1. Probes
  2. Output
  3. Debugging Using Probes

Observability Metric Namespaces and Structures

  1. Metric Spaces
  2. Metric Trees
  3. Defining a Metric in Terms of its Children
  4. Increasingly Specific — Subsets of Data
  5. Ratios Rule
  6. It’s All in the Questions
  7. Generic Metrics Enriched With Tags
  8. Conclusion

Debugging: Getting To Impact Through SLOs

  1. Phrasing the Impact in Terms of Client Impact
  2. Guiding With SLOs

Deploying SLOs Across An Organization

  1. What is an SLO?
  2. Principles
  3. Representative the Client Experience
  4. Actionable
  5. Minimal Investment / Low Technical Overhead
  6. Low Number of False Positives
  7. Rollout Strategy

No Friction Application Observability Using Envoy

  1. Problem
  2. Envoy
  3. Example
  4. Conclusion

Alerting on SLOs

  1. Terminology Refresher
  2. Client Experience
  3. Objective Quantities
  4. Call to Action
  5. Generic Tooling
  6. Conclusion

Debugging Fundamentals: Profiling

  1. What is Profiling?
  2. Why Profile? - Risks of Not Profiling
  3. How to Profile
  4. Profile Profiles “drilling-down”
  5. Conclusion

Performance Analysis: Tuning Methodology Using a Simple HTTP Webserver

  1. Strategy
  2. Simple HTTP Server Architecture
  3. Determine Goals (Dimensions)
  4. Setup the Test Harness
  5. Observe
  6. Execute/Observe/Analyze
  7. Profile
  8. Analysis - Hypothesis
  9. Tune the Application - Experiment
  10. Execute/Observe/Analyze
  11. 2000 Requests / Second
  12. 3000 Requests / second
  13. Analysis - Hypothesis
  14. Tune the application - Experiment
  15. Execute/Observe/Analyze
  16. Conclusion

Distributed Tracing: Impact on Engineering Organizations

  1. Onboarding
  2. Development
  3. Operations
  4. Conclusion

Dashboard Patterns: Aggregate View

  1. Why Views?
  2. So What’s an Aggregate View?
  3. Throughput
  4. Availability
  5. Latency
  6. Conclusion

Dashboard Patterns: Component Views

  1. Purpose
  2. Feedback Loops
  3. In Practice
  4. Approach
  5. Conclusion

Why Capacity Planning Needs Queueing Theory (Without the Hard Math)

  1. Problem
  2. Capacity Planning Organizational Systems
  3. Conclusion

Debugging Lambda File Descriptor Exhaustion

  1. Background
  2. A Strange Occurrence
  3. AWS Support
  4. Ensuring the Rollup Script Worked
  5. Moving Forward
  6. Starting to Debug
  7. Back to Basics
  8. Verifying Hypothesis
  9. Bounding Resource Usage
  10. Error Free!

Debugging Heuristics: Drivers of Increased Latency

  1. Increase in the Amount of Work Being Done
  2. Increased in the Type of Work Being Done
  3. Change in the Amount of Work Performed in Each Transaction
  4. Conclusion

Knowledge Graphs: Increased Context in Human Involved Incident Response

  1. An Example
  2. So What is an Incident Response (IR) Knowledge Graph?
  3. Components
  4. IR Knowledge Graphs In Practice
  5. The Incident
  6. Conclusion

Bolt on Rate Limiting

  1. Protecting Resources
  2. What is Envoy??
  3. Solving Rate Limiting Using Envoy
  4. Conclusion

Debugging Strategies: Triangulation

  1. What is Triangulation?
  2. Example Scenario
  3. Heuristics
  4. Conclusion

Debugging SQL Performance Using the “EXPLAIN” Statement

  1. Methodology
  2. Determine the Table Schema
  3. Determine the Table Index
  4. EXPLAIN the Query
  5. Leveraging the Index
  6. Predicate Query Missing Sortkey
  7. Results

Stay on Top of Your ETL Pipelines With Table Freshness Checks

Detecting Resource Leaks With Baseline Tests

Data Operational Maturity

  1. Maturity Model
  2. Level 1 - Mechanism
  3. Level 2 - Consistency
  4. Level 3 – Accuracy
  5. Conclusion

Bulkheads in Action — Partitioning to Minimize Failure Impact

  1. What are Bulkheads?
  2. Why Use Bulkheads
  3. How?
  4. When to Use?

Retries in Action: Availability in Exchange for Latency

  1. What are Retries?
  2. Why Use Retries?
  3. How?
  4. When to Use?
  5. Caveats

Probing 101

  1. Uptime Probes
  2. What Probes Don’t Do
  3. Purpose of Probes
  4. How to start probing
  5. Uses
  6. Conclusion

Using Views for Backwards Compatible Data Migrations

  1. Common Database Clients
  2. Leveraging Views
  3. Example of a View Based Migration
  4. Conclusion

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub