DevOps Monitoring

Posted 2 months ago



• Improving all aspects of monitoring of our client Cloud Platform and all supporting services;
• Defining best practices for engineering teams and guiding them to get deep insights into their applications in production;
• Ensuring that dashboards and information radiators provide the right level of information to the right people in the organization;
• Making events traceable and introducing improvements to help on-call engineers analyzing our client distributed system;
• Operating infrastructure and tools required to work with metrics of our client core banking services;
• Improving standards of gathering and processing metrics;
• Ensuring that development teams can produce custom metrics;
• Providing various reports and aggregation based on engineering or business needs;
• Monitoring SLA performance of our client;
• Operating infrastructure and tools required to work with logs produced by our client core banking services;
• Implementing ways to process these logs and providing insights to development teams;
• Improving logs retention, processing strategies;
• Providing developers tooling & guidance to define alert based on various needs;
• Monitoring, reporting and alerting on SLOs;
• Improving anomaly detection based on the past performance of applications;
• Predicting capacity problems;
• Reducing alert fatigue;
• Ensuring that our client monitoring systems don’t hold any personal identifiable information;
• Together with security and compliance, conducting regular reviews of the systems.

You need to have:

• Solid knowledge of public cloud services (at least 3 years of experience as a DevOps/SRE/Ops engineer with a focus on services monitoring);
• Understanding of cloud-native applications and distributed systems;
• Software development and testing skills (Go, Java, Python, etc.);
• Experience with monitoring applications on Kubernetes;
• Good understanding of distributed tracing;
• Experience with application performance monitoring tools;
• Experience with on-call rotation and incident handling;
• Monitoring of applications at a worldwide scale;
• Strong communication, organizational and problem-solving skills.

Nice to have:

• Good experience with monitoring of Java applications;
• Application security knowledge;
• Knowledge of statistics-based monitoring and modeling.