Job Title: Site Reliability Engineer (SRE)
Location: Chennai, Bengaluru, Hyderabad, Pune, Mumbai, Noida, NCR
Experience: 5–8 years
Job Type: Full-Time
Department: Infrastructure / DevOps / Engineering
Job Summary:
We are looking for a skilled Site Reliability Engineer (SRE) to ensure the reliability, availability, scalability, and performance of our production systems. As an SRE, you will bridge the gap between software development and IT operations by applying software engineering practices to infrastructure and operations.
Key Responsibilities:
- Design, build, and maintain highly available and scalable systems.
- Monitor system performance, availability, and health using tools like Prometheus, Grafana, Datadog, or New Relic.
- Automate infrastructure provisioning and deployments using Infrastructure as Code (IaC) tools (Terraform, CloudFormation).
- Implement and maintain CI/CD pipelines to ensure safe and efficient application delivery.
- Collaborate with development and operations teams to troubleshoot production issues and improve incident response processes.
- Define and track SLAs, SLOs, and error budgets to balance reliability and development velocity.
- Build self-healing systems with automated recovery from failures.
- Ensure proper logging, alerting, and monitoring is in place across environments.
Required Skills & Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- 5+ years of experience as an SRE, DevOps Engineer, or Systems Engineer.
- Proficiency in at least one programming language (Python, Go, Java, etc.) and shell scripting.
- Hands-on experience with Linux systems, cloud platforms (AWS, Azure, or GCP), and container orchestration (Kubernetes, Docker).
- Deep understanding of networking, DNS, load balancing, and distributed systems.
- Experience with monitoring, logging, and alerting systems.
- Familiarity with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, etc.).
Preferred Qualifications:
- Experience with service mesh (e.g., Istio, Linkerd).
- Knowledge of chaos engineering, incident management, and postmortem practices.
- Exposure to security best practices and compliance (SOC2, ISO, etc.).
- Certification in Cloud (AWS/GCP/Azure) or Kubernetes (CKA/CKAD).
What We Offer:
- A key role in maintaining mission-critical infrastructure and applications.
- Opportunities to shape reliability practices and system architecture.
- A collaborative team focused on automation and innovation.
- Flexible work culture and competitive compensation.