Specialization: IT OR COMPUTER SOFTWARE
Job description:
Strong knowledge & experience in following items are required: - Experience in negotiating SLO/SLI with product owner
- Experience in building highly available & observable systems at scale
- Proven track record working as Site Reliability Engineer in managerial level
Job Description- Implement/Improve SRE principles by working with Infra/DevOps members and engineers in the greater organization to spread SRE knowledge and best practices.
- Responsible as a multi-hat team member with software and system engineer mindset, passion for system reliability and observability
- Build reliability as a feature into our core infrastructure and applications
Qualifications- Knowledge of scalable production architectures (config management, monitoring, infrastructure-as-a-code, load balancing, CDNs, distributed systems)
- Experience with cloud infrastructure (e.g. AWS, Alibaba cloud), Kubernetes, and most of the following technologies: Helm, Docker, Terraform, Graylog, Prometheus, Jaeger, Kafka/RabbitMQ
- Good understanding of the SLIs, SLOs, and SLAs concepts
- Experience in using data/metrics/logs to diagnose and troubleshoot complex systems
- Experience as a software developer, preferably polyglot [C#, Python or Go]
- Ability to work anywhere in the stack
- Knowledge of operating system internals
- Familiarity with operations: metrics/statistics, incident management, post mortems, etc.
- Good understanding of MTTD, MTTR, and MTBF metrics
- Have "Automate things, removing toils" in your DNA
- Strong passion about observability and sharing knowledge
Apply Now
Back to Job Vacancies
|