Skip to main content
Back to team jobs

Infrastructure & Reliability

Principal Site Reliability Engineer

SREP
Full remote Full-time 8+ years

About the Role

The Principal Site Reliability Engineer owns the reliability posture of Mail Tactic’s entire production platform. This is a senior individual-contributor role with broad organisational influence — you’ll define how we build, operate, and scale infrastructure across the company.

You’ll work directly with engineering leadership to set the technical direction for observability, incident management, and capacity planning. You’re the person who sees the system as a whole and keeps it healthy as we grow.

What You’ll Do

  • Own the SLO/SLI framework across all production services and drive adoption with product engineering teams
  • Design and operate multi-region, high-availability infrastructure for email delivery at scale
  • Lead the incident response programme: on-call rotations, blameless postmortems, runbook culture
  • Drive infrastructure-as-code practices with Terraform across AWS and GCP
  • Define the observability stack — metrics, logs, traces — and ensure every service is instrumented correctly
  • Mentor and grow a team of SREs; set technical standards and review architecture decisions
  • Partner with security on hardening, compliance, and threat modelling

What We’re Looking For

  • 8+ years of infrastructure or SRE experience, with at least 3 in a senior or staff-level role
  • Deep expertise with Kubernetes in production (multi-cluster, multi-region)
  • Strong IaC background — Terraform required, Pulumi a bonus
  • Hands-on experience with a major cloud provider (AWS preferred, GCP experience valued)
  • Solid observability experience: Prometheus, Grafana, OpenTelemetry, distributed tracing
  • Strong scripting and automation skills in Python and/or Go
  • Proven incident response leadership — you’ve owned major incidents end-to-end
  • Network fundamentals: DNS, TLS, BGP, CDN/edge routing

Nice to Have

  • Experience with email infrastructure (SMTP, deliverability, SPF/DKIM/DMARC)
  • Familiarity with eBPF-based networking or observability tooling
  • Security certifications or formal training in threat modelling

What We Offer

  • Fully remote, async-first — work from anywhere
  • Competitive base salary + equity
  • Annual learning & conference budget
  • Home office stipend
  • Flexible working hours, no core hours required

Department

Infrastructure & Reliability

Location

Full remote

Employment

Full-time

Experience

8+ years

Skills

Kubernetes Terraform AWS / GCP Observability SLO / SLI Python Incident Management Distributed Systems

Ready to apply?

Send us your application via our contact page. Include your LinkedIn or a link to relevant work.