Site Reliability Engineer

12 January 2024

Apply for this job

Job Description

Objectives of this role

  • Run the production/UAT environment by monitoring availability and taking a holistic view of system health
  • Build software and systems to manage platform infrastructure and applications
  • Improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, to push our capabilities forward, getting ahead of customer needs, and innovate for continual improvement
  • Provide primary operational support and engineering for multiple large-scale distributed software applications.
  • Engage in and improve the whole lifecycle of services from inception and design, deployment, operation, and refinement
  • Collaborate with stakeholders to set SLO and maintain Service level Indicators (SLI’s) that are representative of our customer experience and/or committed SLA

Responsibilities

  • Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts Balance feature development speed and reliability with well-defined service-level objectives

Required skills and qualifications

  • Master’s/Bachelor’s degree (or equivalent) in computer science or related discipline
  • Ability to program using one or more high-level languages, such as Python and Shell scripting.
  • A proactive approach to identifying problems, performance bottlenecks, and areas for improvement.