Asia Recruit - Job Details

Site Reliability Engineer

Location: Bangsar South

Job ID: P2C3

Specialization: IT OR COMPUTER SOFTWARE

Job description:

Strong knowledge & experience in following items are required:

Implement/Improve SRE principles by working with Infra/DevOps members and engineers in the greater organization to spread SRE knowledge and best practices.
Responsible as a multi-hat team member with software and system engineer mindset, passion for system reliability and observability
Build reliability as a feature into our core infrastructure and applications

Knowledge of scalable production architectures (config management, monitoring, infrastructure-as-a-code, load balancing, CDNs, distributed systems)
Experience with cloud infrastructure (e.g. AWS, Alibaba cloud), Kubernetes, and most of the following technologies: Helm, Docker, Terraform, Graylog, Prometheus, Jaeger, Kafka/RabbitMQ
Good understanding of the SLIs, SLOs, and SLAs concepts
Experience in using data/metrics/logs to diagnose and troubleshoot complex systems
Experience as a software developer, preferably polyglot [C#, Python or Go]
Ability to work anywhere in the stack
Knowledge of operating system internals
Familiarity with operations: metrics/statistics, incident management, post mortems, etc.
Good understanding of MTTD, MTTR, and MTBF metrics
Have "Automate things, removing toils" in your DNA
Strong passion about observability and sharing knowledge