Operations — BLACKMESA.SYSTEMS

Operations

Incidents Change Observability Security Capacity Edge Runbooks

SRE

Incident Engineering

Runbooks, on-call rotations, and post-incident review. Objectives: MTTR ↓, change fail rate ↓, uptime ↑.

SEV ladder & paging matrices (primary/secondary/manager)
Brownout switches, traffic shed, feature flags
Post-incident review with owners, actions, and deadlines

Change

Change Management

Blue/green & canary rollouts, progressive delivery, and blast-radius containment with automatic rollback.

Change windows & freeze periods, CAB approvals
Auto-rollback tied to SLO burn & error budgets
Release notes + diff links from CI

Observability

Telemetry & Tracing

Unified logs/metrics/traces, SLOs with burn-rate alerts, and forensic capture for post-mortems.

OpenTelemetry pipelines, tail-based sampling
Multi-window burn alerts (5m/1h & 30m/6h)
On-trigger pcap/pprof/heap dumps (quarantined)

Security & Compliance

Attestation, artifact signing, least-privilege access, and audit-ready pipelines.

Sigstore/COSIGN signing + provenance (SLSA)
RBAC/ABAC policies, just-in-time elevation
Evidence collection for audits (immutable)

Capacity Planning

Utilization models and cost guardrails with autoscaling strategies for peak events.

P50/P95/P99 demand forecasting
Right-sizing & bin-packing optimizations
Emergency headroom + pre-warm pools

Edge Operations

Remote rebuilds, OTA updates, and offline-first modes for constrained environments.

Delta updates with signed bundles
Store-and-forward telemetry, conflict merge
Remote console with dead-man’s switch

Playbooks

Incident lifecycle detect • triage • mitigate • recover • review

Detect & Triage

Pager fires on burn-rate or error spike
Declare SEV, name incident, open comms room
Roles: IC, Ops, Comms, Scribe

Mitigate

Brownout non-critical features
Traffic shed or rollback the last change
Enable elevated logging / captures

Recover & Review

Confirm SLO recovery, close comms
PIR within 72h — timeline, cause, actions
Action owners & due dates tracked

Change policy windows • approvals • rollback

Windows

Default: Tue-Thu 09:00–15:00 local. Freeze on major events. Emergency changes require IC + approver.

Approvals

CAB for high-risk; auto-approve for low-risk with policy attestation (tests + rollout plan attached).

Automatic rollback

Triggered on error budget burn or elevated 5xx / latency over SLO for 10 minutes.

Observability instrumentation • sampling • alerts

Instrumentation

OTLP export (traces + metrics + logs)
Resource attrs: service/environment/region
Exemplars on key metrics

Sampling

Tail-based, target 2–5% steady-state; raise to 20% during incidents on affected services.

Alerts

Multi-window burn alerts (5m/1h; 30m/6h), dedup in paging tool, runbook link on every alert.

SLIs & SLOs

Availability

99.95% per month per region for user-visible endpoints.

Latency

P95 < 250ms (read), < 500ms (write) at steady load.

Error budget

Burn policies gate releases; budgets reset monthly with carry-over caps.

Ops Console

# Declare an incident (IC + channel)
ops incident create --sev=2 --title "Elevated 5xx in EU-West" --assign @oncall

# Brownout non-critical features via flags
ops feature disable feed.recommendations --region eu-west

# Roll back the last change in service X
ops release rollback --service api --region eu-west --to previous

# Capture forensic bundle on a hot node
ops capture bundle --node ip-10-0-12-34 --pcap --pprof --heap --out /tmp/bundle.tgz

Runbooks

OPS-101 — SEV triage & escalation
OPS-201 — Planned change checklist
OPS-301 — Region failover drill
OPS-412 — Brownout switches
OPS-502 — Forensic capture & handoff

Engage Operations

Need a drill, change review, or incident dry-run? We can set up a pilot or harden an existing workflow.

Contact Ops View Status