DevOps Monitoring

Posted 2 months ago



  • Improving all aspects of monitoring of our client Cloud Platform and all supporting services;
  • Defining best practices for engineering teams and guiding them to get deep insights into their applications in production;
  • Ensuring that dashboards and information radiators provide the right level of information to the right people in the organization;
  • Making events traceable and introducing improvements to help on-call engineers analyzing our client distributed system;
  • Operating infrastructure and tools required to work with metrics of our client core banking services;
  • Improving standards of gathering and processing metrics;
  • Ensuring that development teams can produce custom metrics;
  • Providing various reports and aggregation based on engineering or business needs;
  • Monitoring SLA performance of our client;
  • Operating infrastructure and tools required to work with logs produced by our client core banking services;
  • Implementing ways to process these logs and providing insights to development teams;
  • Improving logs retention, processing strategies;
  • Providing developers tooling & guidance to define alert based on various needs;
  • Monitoring, reporting and alerting on SLOs;
  • Improving anomaly detection based on the past performance of applications;
  • Predicting capacity problems;
  • Reducing alert fatigue;
  • Ensuring that our client monitoring systems don’t hold any personal identifiable information;
  • Together with security and compliance, conducting regular reviews of the systems.

You need to have:

  • Solid knowledge of public cloud services (at least 3 years of experience as a DevOps/SRE/Ops engineer with a focus on services monitoring);
  • Understanding of cloud-native applications and distributed systems;
  • Software development and testing skills (Go, Java, Python, etc.);
  • Experience with monitoring applications on Kubernetes;
  • Good understanding of distributed tracing;
  • Experience with application performance monitoring tools;
  • Experience with on-call rotation and incident handling;
  • Monitoring of applications at a worldwide scale;
  • Strong communication, organizational and problem-solving skills.

Nice to have:

  • Good experience with monitoring of Java applications;
  • Application security knowledge;
  • Knowledge of statistics-based monitoring and modeling.