🎁 New User? Get 20% off your first purchase with code NEWUSER20 Register Now →
Menu

Categories

DevOps Intermediate

What is SRE (Site Reliability Engineering)?

An engineering discipline that applies software engineering principles to infrastructure and operations to create reliable systems.

Site Reliability Engineering, pioneered by Google, treats operations as a software problem. SRE teams define Service Level Objectives (SLOs), measure them through Service Level Indicators (SLIs), and use error budgets to balance reliability with feature velocity. When the error budget is exhausted, teams prioritize reliability work over new features. Key practices include toil automation (eliminating repetitive manual work), blameless postmortems, capacity planning, and progressive rollouts. SRE bridges the gap between development speed and operational stability.

Related Terms

Container Registry
A storage and distribution service for container images, similar to a package repository but for Docker images.
Monitoring
The practice of collecting, analyzing, and alerting on system metrics and logs to ensure application health and performance.
Immutable Deployment
A deployment strategy where new versions replace existing instances entirely rather than updating them in place.
Trunk-Based Development
A source control strategy where developers integrate small changes directly into the main branch frequently, often multiple times per day.
YAML
A human-readable data serialization language commonly used for configuration files in DevOps tools and applications.
Helm
A package manager for Kubernetes that simplifies deploying and managing applications using reusable, configurable charts.
View All DevOps Terms →