SRE Security Reliability and Compliance
A building with great fire suppression but no locks is not a safe building. Reliability and security are not separate disciplines — they are two dimensions of the same problem: keeping systems working correctly for the right people, and only for the right people. SREs who ignore security create brittle systems. Security teams who ignore reliability create unusable defenses.
Where Reliability and Security Intersect
Most security incidents are also reliability incidents. A distributed denial-of-service (DDoS) attack is a capacity problem. A credential theft leads to unauthorized changes that cause outages. A supply chain compromise introduces code that degrades or destroys service availability. Treating security as a reliability domain leads to better outcomes for both.
Security Incident Type Reliability Impact ------------------------------------------------- DDoS attack Service unavailable for all users Credential compromise Unauthorized configuration changes; outage Ransomware infection Complete data and service unavailability Software supply chain attack Corrupted deployments degrade service Privilege escalation Unauthorized operations delete resources
Least Privilege Principle
Every person, process, and service should have exactly the permissions needed to do their job — and nothing more. This limits the blast radius when a credential is compromised or a bug introduces unintended behavior.
Without Least Privilege: ------------------------ All microservices use the same database account with full read/write/delete access. Payment service gets compromised. Attacker uses it to delete all user records. Blast radius: entire database ❌ With Least Privilege: --------------------- Payment service account: read from payments table, write to transactions table. Auth service account: read/write to users table only. Analytics service account: read-only from all tables. Payment service gets compromised. Attacker can only read and write payment data. Blast radius: payments scope only ✅
Secrets Management
Secrets are credentials that grant access to protected resources: API keys, database passwords, TLS certificates, and service account tokens. Storing secrets in code, environment variable files, or configuration repositories is a common and serious mistake. Secrets belong in dedicated secrets management systems.
The Right and Wrong Way
| Practice | Risk |
|---|---|
| Secret hardcoded in source code | Anyone with code access has the secret; exposed in version history |
| Secret in a config file in the repository | Same as above; also exposed in code reviews and backups |
| Secret in an environment variable without rotation | Persists indefinitely; hard to audit who accessed it |
| Secret in a dedicated vault with rotation and audit logs | Minimal — access controlled, audited, and secrets expire automatically |
Tools like HashiCorp Vault, AWS Secrets Manager, and Google Secret Manager store secrets centrally, enforce access controls, rotate credentials automatically, and log every access.
Immutable Infrastructure
Immutable infrastructure means servers are never modified after deployment. When a change is needed, a new server image is built and deployed — the old one is discarded. This eliminates configuration drift (where servers gradually become inconsistent from manual changes) and ensures every deployment is from a known, verified starting point.
Mutable Infrastructure (risky): Server deployed → SSH in → apply patches → change config → SSH in again After 6 months: nobody knows the exact state of this server Immutable Infrastructure (safe): Server deployed from image v1.4 → needs change → Build new image v1.5 → deploy v1.5 → terminate v1.4 Every server in production matches a known, tested image exactly ✅
Compliance and Audit Trails
Many industries — healthcare, finance, payments, government — require systems to meet regulatory standards. Common frameworks include SOC 2, ISO 27001, PCI-DSS for payment card data, and HIPAA for healthcare data.
SREs support compliance by ensuring the infrastructure produces the audit evidence these frameworks require: logs of every action, access control records, change management history, and evidence of continuous monitoring.
What Compliance Requires From SRE
- Access logs: Every login, every API call, every privileged operation is logged with timestamp and user identity.
- Change management: Every infrastructure change goes through an approved pipeline — no manual changes that bypass the audit trail.
- Encryption at rest and in transit: Data is encrypted in storage and in all network communications.
- Vulnerability management: Systems are patched regularly; vulnerabilities above a severity threshold are addressed within defined timelines.
- Backup and recovery testing: Backups exist and are tested by actually restoring them on a schedule.
Supply Chain Security
Modern software depends on hundreds of third-party libraries and packages. A compromised dependency can introduce malicious code into production. SREs address this through dependency scanning, software bill of materials (SBOM) tracking, and signed artifact verification.
Software Supply Chain Risk:
----------------------------
Your service
→ uses library A
→ library A uses library B
→ library B has a known vulnerability (CVE-2024-XXXX)
→ attacker exploits it to gain access
Defense:
- Automated dependency scanning in CI pipeline
- Alerts when new CVEs affect any dependency in use
- Pinned dependency versions with verified checksums
- Container image scanning before deployment
Key Points
- Security incidents are reliability incidents — treat them with the same urgency and disciplines.
- Least privilege limits the blast radius when any credential or service is compromised.
- Secrets belong in dedicated vault systems — never in code, config files, or repositories.
- Immutable infrastructure eliminates configuration drift and ensures every deployment starts from a known state.
- Compliance frameworks require the same logging, access control, and change management that good SRE practice produces anyway.
