Default: Tue-Thu 09:00–15:00 local. Freeze on major events. Emergency changes require IC + approver.
Approvals
CAB for high-risk; auto-approve for low-risk with policy attestation (tests + rollout plan attached).
Automatic rollback
Triggered on error budget burn or elevated 5xx / latency over SLO for 10 minutes.
Observabilityinstrumentation • sampling • alerts
Instrumentation
OTLP export (traces + metrics + logs)
Resource attrs: service/environment/region
Exemplars on key metrics
Sampling
Tail-based, target 2–5% steady-state; raise to 20% during incidents on affected services.
Alerts
Multi-window burn alerts (5m/1h; 30m/6h), dedup in paging tool, runbook link on every alert.
SLIs & SLOs
Availability
99.95% per month per region for user-visible endpoints.
Latency
P95 < 250ms (read), < 500ms (write) at steady load.
Error budget
Burn policies gate releases; budgets reset monthly with carry-over caps.
Ops Console
# Declare an incident (IC + channel)
ops incident create --sev=2 --title "Elevated 5xx in EU-West" --assign @oncall
# Brownout non-critical features via flags
ops feature disable feed.recommendations --region eu-west
# Roll back the last change in service X
ops release rollback --service api --region eu-west --to previous
# Capture forensic bundle on a hot node
ops capture bundle --node ip-10-0-12-34 --pcap --pprof --heap --out /tmp/bundle.tgz
Runbooks
OPS-101 — SEV triage & escalation
OPS-201 — Planned change checklist
OPS-301 — Region failover drill
OPS-412 — Brownout switches
OPS-502 — Forensic capture & handoff
Engage Operations
Need a drill, change review, or incident dry-run? We can set up a pilot or harden an existing workflow.