🎁 New User? Get 20% off your first purchase with code NEWUSER20 Register Now →
Menu

Categories

DevOps Intermediate

What is SRE (Site Reliability Engineering)?

An engineering discipline that applies software engineering principles to infrastructure and operations to create reliable systems.

Site Reliability Engineering, pioneered by Google, treats operations as a software problem. SRE teams define Service Level Objectives (SLOs), measure them through Service Level Indicators (SLIs), and use error budgets to balance reliability with feature velocity. When the error budget is exhausted, teams prioritize reliability work over new features. Key practices include toil automation (eliminating repetitive manual work), blameless postmortems, capacity planning, and progressive rollouts. SRE bridges the gap between development speed and operational stability.

Related Terms

Ansible
An agentless automation tool for configuration management, application deployment, and task automation using YAML playbooks.
Grafana
An open-source analytics and visualization platform for creating dashboards from various data sources.
Nginx
A high-performance web server and reverse proxy known for its stability, low resource usage, and ability to handle many concurrent connections.
Incident Management
The process of detecting, responding to, and resolving service disruptions to minimize impact and restore normal operations.
Containerization
A lightweight virtualization method that packages applications with their dependencies into isolated, portable containers.
ArgoCD
A declarative GitOps continuous delivery tool for Kubernetes that automatically syncs cluster state with Git repositories.
View All DevOps Terms →