SRE Security Reliability and Compliance

A building with great fire suppression but no locks is not a safe building. Reliability and security are not separate disciplines — they are two dimensions of the same problem: keeping systems working correctly for the right people, and only for the right people. SREs who ignore security create brittle systems. Security teams who ignore reliability create unusable defenses.

Where Reliability and Security Intersect

Most security incidents are also reliability incidents. A distributed denial-of-service (DDoS) attack is a capacity problem. A credential theft leads to unauthorized changes that cause outages. A supply chain compromise introduces code that degrades or destroys service availability. Treating security as a reliability domain leads to better outcomes for both.

Security Incident Type        Reliability Impact
-------------------------------------------------
DDoS attack                   Service unavailable for all users
Credential compromise         Unauthorized configuration changes; outage
Ransomware infection          Complete data and service unavailability
Software supply chain attack  Corrupted deployments degrade service
Privilege escalation          Unauthorized operations delete resources

Least Privilege Principle

Every person, process, and service should have exactly the permissions needed to do their job — and nothing more. This limits the blast radius when a credential is compromised or a bug introduces unintended behavior.

Without Least Privilege:
------------------------
All microservices use the same database account with full read/write/delete access.
Payment service gets compromised.
Attacker uses it to delete all user records.
Blast radius: entire database ❌

With Least Privilege:
---------------------
Payment service account: read from payments table, write to transactions table.
Auth service account: read/write to users table only.
Analytics service account: read-only from all tables.
Payment service gets compromised.
Attacker can only read and write payment data.
Blast radius: payments scope only ✅

Secrets Management

Secrets are credentials that grant access to protected resources: API keys, database passwords, TLS certificates, and service account tokens. Storing secrets in code, environment variable files, or configuration repositories is a common and serious mistake. Secrets belong in dedicated secrets management systems.

The Right and Wrong Way

PracticeRisk
Secret hardcoded in source codeAnyone with code access has the secret; exposed in version history
Secret in a config file in the repositorySame as above; also exposed in code reviews and backups
Secret in an environment variable without rotationPersists indefinitely; hard to audit who accessed it
Secret in a dedicated vault with rotation and audit logsMinimal — access controlled, audited, and secrets expire automatically

Tools like HashiCorp Vault, AWS Secrets Manager, and Google Secret Manager store secrets centrally, enforce access controls, rotate credentials automatically, and log every access.

Immutable Infrastructure

Immutable infrastructure means servers are never modified after deployment. When a change is needed, a new server image is built and deployed — the old one is discarded. This eliminates configuration drift (where servers gradually become inconsistent from manual changes) and ensures every deployment is from a known, verified starting point.

Mutable Infrastructure (risky):
Server deployed → SSH in → apply patches → change config → SSH in again
After 6 months: nobody knows the exact state of this server

Immutable Infrastructure (safe):
Server deployed from image v1.4 → needs change
→ Build new image v1.5 → deploy v1.5 → terminate v1.4
Every server in production matches a known, tested image exactly ✅

Compliance and Audit Trails

Many industries — healthcare, finance, payments, government — require systems to meet regulatory standards. Common frameworks include SOC 2, ISO 27001, PCI-DSS for payment card data, and HIPAA for healthcare data.

SREs support compliance by ensuring the infrastructure produces the audit evidence these frameworks require: logs of every action, access control records, change management history, and evidence of continuous monitoring.

What Compliance Requires From SRE

  • Access logs: Every login, every API call, every privileged operation is logged with timestamp and user identity.
  • Change management: Every infrastructure change goes through an approved pipeline — no manual changes that bypass the audit trail.
  • Encryption at rest and in transit: Data is encrypted in storage and in all network communications.
  • Vulnerability management: Systems are patched regularly; vulnerabilities above a severity threshold are addressed within defined timelines.
  • Backup and recovery testing: Backups exist and are tested by actually restoring them on a schedule.

Supply Chain Security

Modern software depends on hundreds of third-party libraries and packages. A compromised dependency can introduce malicious code into production. SREs address this through dependency scanning, software bill of materials (SBOM) tracking, and signed artifact verification.

Software Supply Chain Risk:
----------------------------
Your service
  → uses library A
    → library A uses library B
      → library B has a known vulnerability (CVE-2024-XXXX)
        → attacker exploits it to gain access

Defense:
  - Automated dependency scanning in CI pipeline
  - Alerts when new CVEs affect any dependency in use
  - Pinned dependency versions with verified checksums
  - Container image scanning before deployment

Key Points

  • Security incidents are reliability incidents — treat them with the same urgency and disciplines.
  • Least privilege limits the blast radius when any credential or service is compromised.
  • Secrets belong in dedicated vault systems — never in code, config files, or repositories.
  • Immutable infrastructure eliminates configuration drift and ensures every deployment starts from a known state.
  • Compliance frameworks require the same logging, access control, and change management that good SRE practice produces anyway.

Leave a Comment

Your email address will not be published. Required fields are marked *