Copyright
About the Author
Introduction
15 months of 24x7 Primary On-Call — Here’s How I Survived
- Background
- Surface Actionable Metrics
- Alert on Symptoms Not Causes
- Ratios Rule - But Be Careful
- Emulate The Customer Experience: Probes Probes Probes
- Give Yourself Room to Fail - SLO Based Alerts
- Conclusion
Debugging Memory Leaks Using Go
- What is a Memory Leak?
- Debug Process
- Identification
- Root Cause Analysis / Source Analysis
How Probes Partition the Debug Space
- Probes
- Output
- Debugging Using Probes
Observability Metric Namespaces and Structures
- Metric Spaces
- Metric Trees
- Defining a Metric in Terms of its Children
- Increasingly Specific — Subsets of Data
- Ratios Rule
- It’s All in the Questions
- Generic Metrics Enriched With Tags
- Conclusion
Debugging: Getting To Impact Through SLOs
- Phrasing the Impact in Terms of Client Impact
- Guiding With SLOs
Deploying SLOs Across An Organization
- What is an SLO?
- Principles
- Representative the Client Experience
- Actionable
- Minimal Investment / Low Technical Overhead
- Low Number of False Positives
- Rollout Strategy
No Friction Application Observability Using Envoy
- Problem
- Envoy
- Example
- Conclusion
Alerting on SLOs
- Terminology Refresher
- Client Experience
- Objective Quantities
- Call to Action
- Generic Tooling
- Conclusion
Debugging Fundamentals: Profiling
- What is Profiling?
- Why Profile? - Risks of Not Profiling
- How to Profile
- Profile Profiles “drilling-down”
- Conclusion
Performance Analysis: Tuning Methodology Using a Simple HTTP Webserver
- Strategy
- Simple HTTP Server Architecture
- Determine Goals (Dimensions)
- Setup the Test Harness
- Observe
- Execute/Observe/Analyze
- Profile
- Analysis - Hypothesis
- Tune the Application - Experiment
- Execute/Observe/Analyze
- 2000 Requests / Second
- 3000 Requests / second
- Analysis - Hypothesis
- Tune the application - Experiment
- Execute/Observe/Analyze
- Conclusion
Distributed Tracing: Impact on Engineering Organizations
- Onboarding
- Development
- Operations
- Conclusion
Dashboard Patterns: Aggregate View
- Why Views?
- So What’s an Aggregate View?
- Throughput
- Availability
- Latency
- Conclusion
Dashboard Patterns: Component Views
- Purpose
- Feedback Loops
- In Practice
- Approach
- Conclusion
Why Capacity Planning Needs Queueing Theory (Without the Hard Math)
- Problem
- Capacity Planning Organizational Systems
- Conclusion
Debugging Lambda File Descriptor Exhaustion
- Background
- A Strange Occurrence
- AWS Support
- Ensuring the Rollup Script Worked
- Moving Forward
- Starting to Debug
- Back to Basics
- Verifying Hypothesis
- Bounding Resource Usage
- Error Free!
Debugging Heuristics: Drivers of Increased Latency
- Increase in the Amount of Work Being Done
- Increased in the Type of Work Being Done
- Change in the Amount of Work Performed in Each Transaction
- Conclusion
Knowledge Graphs: Increased Context in Human Involved Incident Response
- An Example
- So What is an Incident Response (IR) Knowledge Graph?
- Components
- IR Knowledge Graphs In Practice
- The Incident
- Conclusion
Bolt on Rate Limiting
- Protecting Resources
- What is Envoy??
- Solving Rate Limiting Using Envoy
- Conclusion
Debugging Strategies: Triangulation
- What is Triangulation?
- Example Scenario
- Heuristics
- Conclusion
Debugging SQL Performance Using the “EXPLAIN” Statement
- Methodology
- Determine the Table Schema
- Determine the Table Index
- EXPLAIN the Query
- Leveraging the Index
- Predicate Query Missing Sortkey
- Results
Stay on Top of Your ETL Pipelines With Table Freshness Checks
Detecting Resource Leaks With Baseline Tests
Data Operational Maturity
- Maturity Model
- Level 1 - Mechanism
- Level 2 - Consistency
- Level 3 – Accuracy
- Conclusion
Bulkheads in Action — Partitioning to Minimize Failure Impact
- What are Bulkheads?
- Why Use Bulkheads
- How?
- When to Use?
Retries in Action: Availability in Exchange for Latency
- What are Retries?
- Why Use Retries?
- How?
- When to Use?
- Caveats
Probing 101
- Uptime Probes
- What Probes Don’t Do
- Purpose of Probes
- How to start probing
- Uses
- Conclusion
Using Views for Backwards Compatible Data Migrations
- Common Database Clients
- Leveraging Views
- Example of a View Based Migration
- Conclusion