Location: Anywhere in India – Remote
Shift timing: CST timing which is 7:30pm to 4:30am IST.
Site Reliability Engineer:
- The Site Reliability Engineering team plays a critical role in ensuring the stability, performance, and reliability of internal platforms.
- We are responsible for troubleshooting and maintaining applications built on a complex, distributed, and cloud-native infrastructure, heavily leveraging technologies like Apache Spark and Apache Airflow.
- Our mission is to support cross-functional teams by ensuring their Spark jobs run smoothly and efficiently, providing essential operational support and expertise.
- We are a team of curious problem-solvers, always striving to understand the “why ” behind the “what, ” and digging deep into the internals of systems to truly understand how they work.
Role Summary:
- As a Site Reliability Engineer, you will be at the forefront of operational excellence, directly impacting the productivity of numerous teams.
- You will be responsible for diagnosing and resolving complex issues related to Spark applications and workflows running on our internal platform.
- This role requires a strong problem-solver with a deep, intrinsic curiosity to understand how applications function at a detailed level, and the ability to troubleshoot effectively when things aren’t working right.
- You are someone who isn’t satisfied with just knowing how to do something but rather seeks to dig in and truly understand the underlying mechanisms and internals of the system.
Key Responsibilities:
- Troubleshoot and resolve complex application issues, providing detailed root cause analysis and preventative measures.
- Provide direct support to individual users and cross-functional teams, diagnosing and resolving their Spark job-related problems with a focus on understanding the core issue.
- Maintain the reliability and performance of critical infrastructure that power Spark and Airflow.
- Automate operational tasks and improve system efficiency through scripting, always looking for opportunities to enhance stability and reduce manual intervention.
- Collaborate with development and other SRE teams to identify root causes of issues and implement robust, long-term solutions.
- Participate in on-call rotations to ensure continuous availability and rapid response to incidents.
Required Qualifications:
- Strong understanding and hands-on experience troubleshooting applications deployed on Kubernetes.
- Basic proficiency in Python for scripting and automation tasks.
- Deep and practical knowledge of networking principles, with the ability to diagnose network issues using standard command-line tools.
- Experience with containerization technologies (e.g., Docker) and their orchestration.
Proficient in Linux operating systems, including:
- Advanced command-line tools for system diagnostics and troubleshooting (e.g., for inspecting network routes, open files, process information).
- Scripting and system administration.
- A strong desire to understand the internals of the OS.
Preferred Qualifications:
- Familiarity with Apache Spark and Apache Airflow, given their central role in day-to-day troubleshooting.
- Basic understanding of Java, which may be occasionally required for specific jobs, though deeper Java expertise is often handled by cross-functional teams.
- Tech Stack Python scripting, Linux, Networking, Kubernetes, Docker, Previous SRE experience