Team Guides for Software
Foreword
Introduction
- What is software operability and why should we care?
- Where can operability techniques be used?
- How to use this book
- What is covered in this book
- Why we wrote this book
- Feedback and suggestions
1.What does good operability look like?
- Key points
- 1.1Use operability checklists to assess core operability
- 1.2Assess operability with real people regularly
- 1.3Provide a good User Experience for all agents and all users, external and internal
- 1.4Treat operational aspects as product features: viable, configurable, deployable, diagnosable, reliable, well-performing, securable, observable, recoverable
- 1.5Good operability does not necessarily mean good safety or ethics
- 1.6Summary
2.Core practices for good software operability
- Key points
- 2.1Logging and metrics are the first features to implement
- 2.2Enumerate all likely failure modes of the software
- 2.3Use well-defined, meaningful event identifiers
- 2.4Define and track at least one KPI or SLI per service
- 2.5Include operational hooks as first-class features
- 2.6‘DONE’ means working correctly in Production
- 2.7Treat Operations as a high-skill activity
- 2.8The software development team writes a draft Run Book
- 2.9Avoid a separate ‘Production-ization’ or ‘Hardening’ phase
- 2.10Avoid Production-specific tools
- 2.11Talk about ‘operational features’, not ‘non-functional requirements’
- 2.12Developers and Product Owners should be on-call
- 2.13Make operational problems visible
- 2.14Test for operability in a deployment pipeline
- 2.15Summary
3.Use Run Book collaboration to increase operability and prevent operational issues
- Key points
- 3.1Operational aspects are very similar across many software systems
- 3.2Use a Run Book template as a common baseline for operational aspects
- 3.3Use a Run Book Dialogue Sheet to facilitate discovery and avoid ‘documentation fallacy’
- 3.4Assess operability on a regular basis: every sprint, iteration, or week
- 3.5Summary
4.Use modern log aggregation and metrics for deep operational insights
- Key points
- 4.1Use logging to help design and understand distributed systems
- 4.2Collect and aggregate logs and metrics centrally using standard tools & software
- 4.3Focus on collaboration, design decisions, and team experience
- The power of log aggregation
- 4.4Identify 2 or 3 key application metrics and test these early on
- 4.5Run log aggregation and metrics locally on development workstations
- 4.6Hide sensitive information at the point of logging
- 4.7Use Structured Logging for greater meaning
- 4.8Use Event IDs for visibility of application behaviour
- 4.9Collaborate on Event IDs to enhance operability
- 4.10Test your logging and metrics
- 4.11Trace operations across system boundaries with correlation IDs
- 4.12Adapt your logging and metrics techniques to the technology characteristics
- 4.13Summary
5.Use well-defined readiness checks to increase operational confidence
- 5.1Introduction
- 5.2Define readiness checks so we know when a service is ‘ready’
- 5.3Use Deployment Verification Tests (DVTs) to increase confidence in infrastructure
- 5.4Expose Endpoint Healthchecks for persistent services to detect problems early
- 5.5Provide custom diagnostic hooks to expose additional operational information
- 5.6Run operational checks within a deployment pipeline to gain rapid feedback
- Key points
- 5.7Define a set of Service Readiness criteria to establish operational viability
- 5.8Summary
6.Use information radiators and dashboards to drive effective behaviour and good psychological responses
- 6.1Introduction
- 6.2Invest time an effort in good dashboard design and information radiators
- 6.3Avoid information overload - be selective about information on screen
- 6.4Consider common psychological responses
- 6.5Use dashboards to promote inter-team collaboration
- 6.6Example: Using Dashboard visualisation in Formula 1
- 6.7Be aware of typical mistakes with dashboard design
- 6.8Summary
7.Make operability part of the software product
- Key points
- 7.1Overview: why we need a focus on operability
- 7.2Use rich, time-series logging and metrics to drive product decisions
- The power of metrics on dashboards
- 7.3Go beyond the Agile “User Story” to address operability as a first-class concern
- 7.4Use secondary User Personas to address operability aspects
- 7.5Use a single backlog for visible features and operational features
- 7.6Make operational aspects part of the team’s regular work
- 7.7Address operational aspects from the very start and then throughout the delivery phase
- 7.8Raise an alert if a team is spending less than ~30% of their time / effort / budget on operational aspects
- 7.9Product Owners should be responsible for the operational success of the software
- 7.10Developers, Testers, and Product Owners should be “on call” for operational problems
- 7.11Understand the business case for operability
- 7.12Encourage a culture of operability
- 7.13Summary (?)
Appendix
- Adapt your logging techniques to the technology characteristics
- Understand how the complexity of modern distributed systems drives a need for a focus on operability
Terminology
References and further reading
- Introduction
- Chapter 1 - What does good operability look like?
- Chapter 2 - Core Operability Practices
- Chapter 3 - Use Run Book collaboration to increase operability and prevent operational issues
- Chapter 4 - Use modern log aggregation for deep operational and insights
- Chapter 5 - Use Deployment Verification Tests and Endpoint Healthchecks for rapid feedback on environments
- Chapter 6 - Run operational checks within a deployment pipeline to gain rapid feedback and increased collaboration
- Chapter 7 - Use information radiators and dashboards to drive effective behaviour and good psychological responses
- Chapter 8 - Use operability as a differentiating aspect of your software
- Appendix
Run Book template
- Service or system overview
- System characteristics
- Required resources
- Security and access control
- System configuration
- System backup and restore
- Monitoring and alerting
- Operational tasks
- Maintenance tasks
- Failover and Recovery procedures
Index
About the authors
- Matthew Skelton
- Alex Moore
- Rob Thatcher

