Site Reliability Engineer

Remote / San FranciscoFull-time

Ensure system reliability and performance

Who We Are

nirvahatech is a leading DevOps and cloud infrastructure consulting firm. Our SRE team ensures that client systems maintain 99.99% uptime while continuously improving automation, monitoring, and incident response capabilities.

Tech Stack

KubernetesPrometheusGrafanaELK StackDatadogPagerDutyPythonGoTerraformAnsibleJenkinsGitLab CI

What You'll Do

✓Build and maintain highly available production systems
✓Implement comprehensive monitoring, alerting, and observability solutions
✓Design and execute disaster recovery and business continuity plans
✓Automate operational tasks and eliminate toil
✓Participate in on-call rotation and incident response
✓Conduct post-incident reviews and implement preventive measures
✓Define and track SLIs, SLOs, and error budgets

What You Bring

✓4+ years in SRE, DevOps, or infrastructure engineering
✓Strong programming skills in Python, Go, or similar languages
✓Deep understanding of Linux systems administration
✓Experience with container orchestration (Kubernetes preferred)
✓Expertise in monitoring tools like Prometheus, Grafana, Datadog
✓Proven ability to troubleshoot complex distributed systems
✓On-call experience with incident management

Why Join Us