Site Reliability Engineering

Site Reliability Engineering

Hope is not a strategy

A site reliability engineer (SRE) is a software engineer that focuses on solving operational issues with software, creating Service-Level-Objects minimizing toil of manual tasks, reduce the cost of failure, and share ownership with developers.

Tenants

  • Availability
  • Latency
  • Performance
  • Efficiency
  • [[change-management]]
  • [[monitoring]]
  • [[capacity-planning]]

Principles

  • Embracing-risk
  • Eliminating-toil
  • Monitoring-distributed-systems
  • Automation
  • Release-engineering
  • Simplicity

References