DevOps Intermediate

What is SRE (Site Reliability Engineering)?

An engineering discipline that applies software engineering principles to infrastructure and operations to create reliable systems.

Site Reliability Engineering, pioneered by Google, treats operations as a software problem. SRE teams define Service Level Objectives (SLOs), measure them through Service Level Indicators (SLIs), and use error budgets to balance reliability with feature velocity. When the error budget is exhausted, teams prioritize reliability work over new features. Key practices include toil automation (eliminating repetitive manual work), blameless postmortems, capacity planning, and progressive rollouts. SRE bridges the gap between development speed and operational stability.

Learn More About This Topic

Kubernetes for Production: Scaling & Monitoring

Related reading

n8n CLI for Beginners

Related reading

Webhook Automation in Practice

Related reading

Related Terms

Incident Management

The process of detecting, responding to, and resolving service disruptions to minimize impact and restore normal operations.

A tool by HashiCorp for securely managing secrets, encryption keys, and certificates with dynamic secret generation.

An architectural style where an application is composed of small, independent services that communicate over APIs.

An open-source monitoring and alerting toolkit that collects time-series metrics using a pull-based model.

A CI/CD platform integrated into GitHub that automates build, test, and deployment workflows using YAML configuration.

Artifact Repository

A centralized storage system for build artifacts like compiled binaries, packages, and container images used in CI/CD pipelines.

View All DevOps Terms →