Job Location : New York,NY, USA
We are seeking an experienced Senior Observability Engineer to manage and maintain the operational health of our observability stack, including the on-premises Elastic Stack and other real-time monitoring and alerting tools. This individual will be responsible for ensuring that our observability tools are up-to-date, secure, and running efficiently. The Senior Observability Engineer will troubleshoot system issues, perform upgrades, and configure monitoring systems such as Uptime Robot and alerting tools like PagerDuty to support our engineering teams. This role will require collaboration with SaaS, PaaS and IaaS teams to ensure the observability systems provide comprehensive insights into system performance and availability. You will also help implement and improve monitoring strategies to maximize uptime and operational efficiency. Requirements Observability Platform Management: Manage and maintain the operational health of the Elastic Stack, including Elasticsearch, Kibana, Logstash, and Beats. Configure, troubleshoot, and upgrade the observability stack to ensure high availability and scalability. Monitoring and Alerting: Set up and manage real-time monitoring tools such as Uptime Robot to track system uptime and performance. Implement and configure alerting tools like PagerDuty to notify the appropriate teams of system issues. System Upgrades and Maintenance: Perform regular updates and upgrades to observability tools, ensuring they are on the latest stable versions. Troubleshoot and resolve any operational issues related to the observability tools in a timely manner. Security and Best Practices: Ensure the observability stack adheres to security best practices, including data encryption and access control. Monitor for vulnerabilities and apply security patches when necessary. Collaboration and Documentation: Collaborate with the platform, DevOps, and infrastructure teams to ensure observability solutions align with business and operational requirements. Maintain comprehensive documentation of configurations, procedures, and monitoring strategies. Maintenance and Troubleshooting: Perform routine maintenance, including upgrading Kafka brokers, patches, and handling schema updates. Troubleshoot Kafka-related issues such as performance degradation, replication lag, and consumer/producer errors. Qualifications Bachelors degree in Computer Science, Information Technology, or a related field. 5 years of Software Engineering Experience 3 years of experience managing and administering observability platforms, including Elastic Stack, in a production environment. Must be willing to work Eastern timezone hours and have the ability to work past 6pm. Must be willing to work 24/7 on call every week once every 4-6 weeks. Must have an AZ900 certification or be willing to obtain it within 90 days of accepting the role. Must be willing to obtain a new certification relevant to the role every quarter. Strong knowledge of real-time monitoring tools and alerting platforms (e.g., Uptime Robot, PagerDuty). Experience with system administration tasks such as upgrades, troubleshooting, and configuration management. Familiarity with security best practices, including encryption and access control for observability tools. Excellent problem-solving skills and the ability to work effectively in a collaborative team environment. Benefits Have an opportunity to add value to a diverse team of innovative professionals Rise to new challenges everyday Receive competitive salary and career growth opportunities Softheon is an equal opportunities employer. We strongly believe that employing a diverse workforce is key to our success. We make recruitment hiring decisions based on each candidate's experience and skills. We value your passion to make healthcare affordable, accessible, and plentiful.