Job Description
Are you passionate about ensuring maximum system uptime and optimizing infrastructure? Moreover, do you want to make a significant impact on cloud infrastructure? Join our team as a Site Reliability Engineer and drive innovation in system reliability.
About the Role
As a Site Reliability Engineer, you’ll bridge the gap between development and operations. Furthermore, you’ll ensure our systems run smoothly and efficiently. Additionally, you’ll implement automation solutions that enhance system reliability.
Key Objectives
In particular, our Site Reliability Engineer will focus on these essential goals:
- Monitor production environments and maintain system health
- Build automated solutions for infrastructure management
- Optimize system performance and improve uptime
- Implement DevOps best practices and troubleshooting procedures
- Collaborate with development teams on reliability improvements
Core Responsibilities
Moreover, you’ll handle these key responsibilities:
- Analyze system metrics and performance data for optimization
- Partner with teams to implement rigorous testing procedures
- Design scalable systems using cloud technologies
- Create automation scripts using Python and Linux tools
- Monitor application performance and troubleshoot issues
- Participate in capacity planning and system design consulting
Required Skills & Qualifications
Consequently, we’re looking for candidates with:
- Bachelor’s degree in Computer Science or related field
- Strong programming skills in Python and shell scripting
- Experience with cloud platforms and monitoring tools
- Knowledge of Linux system administration
- Understanding of DevOps practices and automation
- Proven troubleshooting and system engineering abilities
Application Process
Ready to become our next Site Reliability Engineer? Subsequently, learn more about us and apply today. For additional insights, check out Google’s SRE Handbook to understand industry best practices.
Objectives of this role
- Run the production/UAT environment by monitoring availability and taking a holistic view of system health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, to push our capabilities forward, getting ahead of customer needs, and innovate for continual improvement
- Provide primary operational support and engineering for multiple large-scale distributed software applications.
- Engage in and improve the whole lifecycle of services from inception and design, deployment, operation, and refinement
- Collaborate with stakeholders to set SLO and maintain Service level Indicators (SLI’s) that are representative of our customer experience and/or committed SLA
Responsibilities
- Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts Balance feature development speed and reliability with well-defined service-level objectives
Required skills and qualifications
- Master’s/Bachelor’s degree (or equivalent) in computer science or related discipline
- Ability to program using one or more high-level languages, such as Python and Shell scripting.
- A proactive approach to identifying problems, performance bottlenecks, and areas for improvement.