Back to All Positions

Site Reliability Engineer

Remote / San FranciscoFull-time

Ensure system reliability and performance

Who We Are

nirvahatech is a leading DevOps and cloud infrastructure consulting firm. Our SRE team ensures that client systems maintain 99.99% uptime while continuously improving automation, monitoring, and incident response capabilities.

Tech Stack

KubernetesPrometheusGrafanaELK StackDatadogPagerDutyPythonGoTerraformAnsibleJenkinsGitLab CI

What You'll Do

  • Build and maintain highly available production systems
  • Implement comprehensive monitoring, alerting, and observability solutions
  • Design and execute disaster recovery and business continuity plans
  • Automate operational tasks and eliminate toil
  • Participate in on-call rotation and incident response
  • Conduct post-incident reviews and implement preventive measures
  • Define and track SLIs, SLOs, and error budgets

What You Bring

  • 4+ years in SRE, DevOps, or infrastructure engineering
  • Strong programming skills in Python, Go, or similar languages
  • Deep understanding of Linux systems administration
  • Experience with container orchestration (Kubernetes preferred)
  • Expertise in monitoring tools like Prometheus, Grafana, Datadog
  • Proven ability to troubleshoot complex distributed systems
  • On-call experience with incident management

Why Join Us

  • High Impact: Keep critical systems running for major clients
  • Automation First: Eliminate repetitive tasks through smart automation
  • Learning Culture: Share knowledge through runbooks and documentation
  • Work-Life Balance: Fair on-call rotation with compensation
  • Modern Stack: Work with the latest SRE tools and practices
  • Problem Solving: Complex technical challenges every day
  • Team Collaboration: Supportive team that values reliability over speed

Apply for this Position

PDF, DOC, DOCX, TXT, RTF (Max 5MB)

Your information is confidential

Nirvahatech | Expert DevOps & Cloud Infrastructure Solutions