As a Site Reliability Engineer (SRE) you will be responsible for improving the overall reliability of applications by ensuring its availability, performance, and scalability. Should be able to gather the technical requirements from the DevOps team and the operational requirements from the Application Support team. With the Site Reliability Engineer role being at the heart of solving production problems, should be able to take a holistic approach to troubleshooting and delve deeply into technical details and must acquire the necessary domain knowledge to effectively troubleshoot and recover from an outage as well as monitor applications in production and build alerts as required.
Working Hours :05:30 AM to 1:30 PM IST (GMT+5:30)
- Work closely with the application support team.
- Monitor critical applications and services to minimize downtime and ensure their availability.
- Collaborate with DevOps teams to maintain and monitor CI/CD pipelines.
- Deploy new versions to production environments.
- Work with project teams to ensure the reliability and maintainability of new and modified releases.
- Provide input to risk management practices that will anticipate reliability-related incidents that could adversely impact operations.
- Document processes and monitor application performance metrics.
- Continuously improve proactive monitoring alert configuration and incident response processes to increase reliability and reduce Mean Time to Recovery (MTTR ).
- Optimize performance and cost efficiency through continuous monitoring, trend analysis, and fine-tuning.
- Monitor any abnormal usage that can impact the cost or performance and take corrective actions.
- Proactively implement preventive measures to improve system reliability.
- Maintain runbooks, Standard Operating Procedures (SOPs), diagrams, and documentation for swift incident response.
- Conduct post-incident reviews to improve reliability and contribute to the development of resilience strategies.
- Achieve Service Level Indicators (SLIs) that are set to meet reliability objectives.
- Azure Solutions Architect Expert (Microsoft)
- AWS Certified Solutions Architect (AWS)
- Open Group Certified Enterprise Architect (TOGAF)
- PMP or Prince-2 in Project Management