SRE at Scale Organizational Structures and Embedded
A single SRE engineer managing five small services has a very different job from an SRE organization supporting five hundred services across twenty product teams. As companies grow, the way SRE teams are structured — how they relate to product teams, who owns what, how work is prioritized — becomes as important as any technical practice.
Three Models for Organizing SRE Teams
Model 1: Centralized SRE Team
One SRE team serves the entire company. Product teams submit requests, the SRE team evaluates services for onboarding, and SREs work on the services they accept responsibility for. This model works well in smaller organizations and ensures consistent SRE practices across all services.
Centralized Model:
------------------
[Central SRE Team]
/ | \
[Team A] [Team B] [Team C]
Product Product Product
Services Services Services
SREs serve all teams from one pool.
Tradeoffs
- Consistent standards and tooling across all services.
- Easier to balance SRE capacity across teams.
- SREs may not deeply understand every team's product domain.
- Can become a bottleneck as the company grows.
Model 2: Embedded SRE
SREs sit directly within product teams. An embedded SRE attends the team's standups, reviews their architecture, joins their on-call rotation, and builds tooling specific to that team's services. The SRE is a full member of the product team, not a visitor from a central pool.
Embedded Model: --------------- [Payment Team] [Search Team] [Auth Team] Developers + SRE Developers + SRE Developers + SRE (owns payment SLOs) (owns search SLOs) (owns auth SLOs)
Tradeoffs
- Deep product knowledge; SREs understand what they are protecting.
- Faster response to reliability problems within the team's domain.
- Risk of diverging practices — each team invents its own tooling.
- Hard to rebalance SRE capacity across teams when priorities shift.
Model 3: SRE as a Service (Consulting Model)
A small SRE team defines standards, builds shared tooling, and advises product teams. Product teams run their own operations day-to-day, but consult the SRE team for complex reliability problems, architecture reviews, and incident escalations. This model scales broadly but requires mature engineering culture throughout the company.
SRE-as-a-Service Model:
-----------------------
[SRE Platform Team]
- Defines SLO standards
- Builds shared observability platform
- Reviews architectures
- Consults on complex incidents
↕ advice + tooling ↕
[Product Teams run their own operations,
own their own SLOs and on-call rotations]
The Engagement Model: Who Decides What SRE Supports
Not every service gets SRE support. SRE teams at scale must be selective. The engagement model defines the criteria a service must meet to receive SRE support, and what happens if a service fails to meet reliability standards.
Service Acceptance Criteria
Before an SRE team accepts operational responsibility for a service, they review:
- Does the service have defined SLOs?
- Does the service have runbooks for known failure modes?
- Does the service have appropriate monitoring, alerting, and dashboards?
- Is the deployment pipeline automated with rollback capability?
- Does the team have a process for addressing postmortem action items?
Handing Services Back
One of the most powerful mechanisms in the Google SRE model is the ability for an SRE team to hand a service back to the development team when it fails to maintain reliability standards. If a service consistently burns error budget without the development team investing in fixes, the SRE team can withdraw operational support temporarily until the reliability issues are addressed. This creates real accountability.
Hand-Back Trigger: ------------------ Service burns 100% of monthly error budget three months in a row AND development team has not addressed reliability action items. ↓ SRE team withdraws on-call coverage. Development team runs their own on-call. ↓ Development team now feels the pain of their own reliability gaps. Reliability work gets prioritized ✅
Scaling SRE: The Platform Approach
At large scale, SRE teams build shared platforms that multiply their impact. Instead of each team manually configuring monitoring, every team uses a common observability platform. Instead of each team writing their own deployment pipelines, every team uses a shared CI/CD platform. The SRE team maintains and improves these platforms; all product teams benefit.
Platform Leverage: ------------------ Without Platform: 50 teams × (build own tooling) = 50 teams doing duplicate work With Platform: 1 SRE Platform Team builds observability platform 50 teams use it out of the box Each team gets consistent dashboards, alerts, and SLO tracking Platform team's effort multiplies across all teams
On-Call Coverage at Scale
Large organizations run complex on-call structures across time zones. A single global on-call pool requires someone to be available at all hours. A "follow-the-sun" model assigns on-call coverage to teams based in regions where it is currently business hours.
Follow-the-Sun On-Call: ----------------------- 00:00 - 08:00 UTC: APAC team on-call (business hours in Asia-Pacific) 08:00 - 16:00 UTC: EMEA team on-call (business hours in Europe/Middle East) 16:00 - 00:00 UTC: AMER team on-call (business hours in Americas) Each shift ends with a handoff call covering active incidents and known risks.
Key Points
- SRE organizations use centralized, embedded, or consulting models — each with distinct tradeoffs.
- Engagement criteria ensure SRE support goes to services ready to meet reliability standards.
- Handing services back creates accountability and aligns incentives between SRE and product teams.
- Shared platforms multiply SRE team impact across many product teams without linear headcount growth.
- Follow-the-sun on-call models distribute operational burden without burning out engineers with overnight shifts.
