Site Reliability Engineer

Location: Bangsar South
Job ID: P2C3


Job description:

Strong knowledge & experience in following items are required:

  • Experience in negotiating SLO/SLI with product owner
  • Experience in building highly available & observable systems at scale
  • Proven track record working as Site Reliability Engineer in managerial level


Job Description

  • Implement/Improve SRE principles by working with Infra/DevOps members and engineers in the greater organization to spread SRE knowledge and best practices.
  • Responsible as a multi-hat team member with software and system engineer mindset, passion for system reliability and observability
  • Build reliability as a feature into our core infrastructure and applications


  • Knowledge of scalable production architectures (config management, monitoring, infrastructure-as-a-code, load balancing, CDNs, distributed systems)
  • Experience with cloud infrastructure (e.g. AWS, Alibaba cloud), Kubernetes, and most of the following technologies: Helm, Docker, Terraform, Graylog, Prometheus, Jaeger, Kafka/RabbitMQ
  • Good understanding of the SLIs, SLOs, and SLAs concepts
  • Experience in using data/metrics/logs to diagnose and troubleshoot complex systems
  • Experience as a software developer, preferably polyglot [C#, Python or Go]
  • Ability to work anywhere in the stack
  • Knowledge of operating system internals
  • Familiarity with operations: metrics/statistics, incident management, post mortems, etc.
  • Good understanding of MTTD, MTTR, and MTBF metrics
  • Have "Automate things, removing toils" in your DNA
  • Strong passion about observability and sharing knowledge


Apply Now   Back to Job Vacancies

AsiaRecruit CV