Skip to main content

Job Description

   Back

Architect - Site Reliability Engineer/DevOps

18-11-2025 12:53:19

Job_303282

15 - 18 years

  • Chennai, Tamil Nadu, India (CHN)
  • Pune, Maharashtra, India (PUN)

We look forward to the possibility of welcoming you toNeurealm—where human ingenuity meets technology to shape what’s next.

Position: Senior SiteReliability Engineer (SRE)
Experience: 10+ Years
Location: Pune /Chennai
Mode : Hybrid

About the Role:
We are seeking a SeniorSite Reliability Engineer (SRE) to ensure the reliability, scalability, andperformance of large-scale production systems.
This role demands strong technical expertise, ownership, and strategic thinking— with a focus on automation, monitoring, and operational excellence.
You will design and implement SRE practices for our customers, driveimprovements in availability and performance, and work closely with developmentteams to build resilient systems that scale.

Key Responsibilities

  • Design, implement, and maintain highly available and scalable systems on AWS and Azure.
  • Define, configure, and report on SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements).
  • Conduct performance analysis, capacity planning, and load testing to identify and resolve bottlenecks.
  • Drive system and tooling improvements — implement monitoring stacks, tracing tools, and CI/CD pipelines.
  • Develop and maintain automation frameworks and tools (Python, Go, Java) to eliminate manual tasks (toil).
  • Manage infrastructure using Infrastructure as Code (IaC) tools such as Terraform or Ansible.
  • Enhance and maintain CI/CD pipelines for reliable, secure deployments.
  • Lead incident response efforts, reducing MTTD and MTTR, and conduct blameless postmortems and RCAs.
  • Participate in on-call rotations to resolve production issues quickly.
  • Collaborate with development teams to influence system design for improved reliability, operability, and security.
  • Configure and manage monitoring and observability stacks (Prometheus, Grafana, ELK/Loki).
  • Develop error budgets and build dashboards for reliability reporting.
  • Write scripts to automate repetitive operational tasks and improve overall efficiency.

Required Skills & Experience

  • Deep knowledge of Linux/Unix administration, troubleshooting, and performance tuning.
  • Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing).
  • Hands-on experience with AWS and/or Azure cloud platforms.
  • Expertise in Infrastructure as Code (IaC) — Terraform, CloudFormation, or Ansible.
  • Proficiency in Docker and Kubernetes for containerization and orchestration.
  • Experience with monitoring tools (Prometheus, Grafana) and logging solutions (ELK Stack: Elasticsearch, Logstash, Kibana).
  • Familiarity with CI/CD pipelines (Azure DevOps, Jenkins, GitLab CI) and Git version control.
  • Strong scripting skills in Python, Go, or Java.
  • Excellent problem-solving, analytical, and collaboration skills.

What We Offer

  • Opportunity to work with cutting-edge infrastructure and cloud technologies.
  • Ownership of critical reliability and automation initiatives.
  • Collaborative work environment focused on learning and innovation.