BLACKMESA.SYSTEMS
RESEARCH • OPERATIONS • SAFETY • λ
Operations
SRE
Incident Engineering

Runbooks, on-call rotations, and post-incident review. Objectives: MTTR ↓, change fail rate ↓, uptime ↑.

  • SEV ladder & paging matrices (primary/secondary/manager)
  • Brownout switches, traffic shed, feature flags
  • Post-incident review with owners, actions, and deadlines
Change
Change Management

Blue/green & canary rollouts, progressive delivery, and blast-radius containment with automatic rollback.

  • Change windows & freeze periods, CAB approvals
  • Auto-rollback tied to SLO burn & error budgets
  • Release notes + diff links from CI
Observability
Telemetry & Tracing

Unified logs/metrics/traces, SLOs with burn-rate alerts, and forensic capture for post-mortems.

  • OpenTelemetry pipelines, tail-based sampling
  • Multi-window burn alerts (5m/1h & 30m/6h)
  • On-trigger pcap/pprof/heap dumps (quarantined)
Security & Compliance

Attestation, artifact signing, least-privilege access, and audit-ready pipelines.

  • Sigstore/COSIGN signing + provenance (SLSA)
  • RBAC/ABAC policies, just-in-time elevation
  • Evidence collection for audits (immutable)
Capacity Planning

Utilization models and cost guardrails with autoscaling strategies for peak events.

  • P50/P95/P99 demand forecasting
  • Right-sizing & bin-packing optimizations
  • Emergency headroom + pre-warm pools
Edge Operations

Remote rebuilds, OTA updates, and offline-first modes for constrained environments.

  • Delta updates with signed bundles
  • Store-and-forward telemetry, conflict merge
  • Remote console with dead-man’s switch
Playbooks
Incident lifecycle detect • triage • mitigate • recover • review
Detect & Triage
  • Pager fires on burn-rate or error spike
  • Declare SEV, name incident, open comms room
  • Roles: IC, Ops, Comms, Scribe
Mitigate
  • Brownout non-critical features
  • Traffic shed or rollback the last change
  • Enable elevated logging / captures
Recover & Review
  • Confirm SLO recovery, close comms
  • PIR within 72h — timeline, cause, actions
  • Action owners & due dates tracked
Change policy windows • approvals • rollback
Windows

Default: Tue-Thu 09:00–15:00 local. Freeze on major events. Emergency changes require IC + approver.

Approvals

CAB for high-risk; auto-approve for low-risk with policy attestation (tests + rollout plan attached).

Automatic rollback

Triggered on error budget burn or elevated 5xx / latency over SLO for 10 minutes.

Observability instrumentation • sampling • alerts
Instrumentation
  • OTLP export (traces + metrics + logs)
  • Resource attrs: service/environment/region
  • Exemplars on key metrics
Sampling

Tail-based, target 2–5% steady-state; raise to 20% during incidents on affected services.

Alerts

Multi-window burn alerts (5m/1h; 30m/6h), dedup in paging tool, runbook link on every alert.

SLIs & SLOs
Availability

99.95% per month per region for user-visible endpoints.

Latency

P95 < 250ms (read), < 500ms (write) at steady load.

Error budget

Burn policies gate releases; budgets reset monthly with carry-over caps.

Ops Console
# Declare an incident (IC + channel)
ops incident create --sev=2 --title "Elevated 5xx in EU-West" --assign @oncall

# Brownout non-critical features via flags
ops feature disable feed.recommendations --region eu-west

# Roll back the last change in service X
ops release rollback --service api --region eu-west --to previous

# Capture forensic bundle on a hot node
ops capture bundle --node ip-10-0-12-34 --pcap --pprof --heap --out /tmp/bundle.tgz

Runbooks

  • OPS-101 — SEV triage & escalation
  • OPS-201 — Planned change checklist
  • OPS-301 — Region failover drill
  • OPS-412 — Brownout switches
  • OPS-502 — Forensic capture & handoff
Engage Operations

Need a drill, change review, or incident dry-run? We can set up a pilot or harden an existing workflow.