Role/Title: SRE
Onsite/Remote: Remote in FL
Position Type: Temp - Perm
Duration: 6 months - Perm
Interview Process: 2 Virtual Interviews
Site Reliability EngineerAs a SRE this role will be responsible for monitoring the applications and responding to events, incidents and changes originating from internal or vendor applications. Investigate incidents and problems and determine root cause. Analyze existing IT processes and use IaC to automate them. Reports to the Director of IT and works to establish operational metrics for our AWS and Azure environments. The SRE role will participate in our on-call rotation.
Job Responsibilities - Work on platform services to design, develop, and improve services, platforms and processes that result in improved end-to-end reliability and maintainability for all our services.
- Create and drive adoption of tools that help deliver insights and automation to simplify the complex world of large scale services.
- Create the infrastructure to support the deployment of the Supply Chain in AWS.
- Leverage new technology paradigms (e.g., serverless, containers, microservices)
- Influence infrastructure architecture by sharing your application development expertise.
- Be a mentor for design reviews, code, and test cases.
- Quickly adapt, apply and train on new technologies, tools, methods, and processes from both internal and external sources.
- Familiar with ITSM methodology for incident response.
Basic Qualifications - Bachelor's Degree in Computer Science or 5+ years professional experience in software development (MS, BE, Computer Science, Site reliability, etc..)
- 5+ years of large-scale software development or application engineering with recent coding experience in one or more of the following languages: Java, JavaScript, C/C++, C#, Node.js, Python, or Rust.
- Experience in designing and building infrastructure to support applications using container and serverless technologies.
- Experience in designing and building infrastructure to support traditional 3-tier applications.
- Experience with network technologies such as static routing, BGP, firewalls, WAFs, and DDoS services.
- Proficiency in scripting languages such as Bash, Python, and PowerShell
- Experience working with operating systems (Linux, Windows).
- Experience supporting infrastructure for large multi-services applications.
- Experience working with CICD in micro-services architectures.
- Experience with observability/Monitoring tools: DataDog, New Relic, Istio.
- Experience working with configuration management tools: Kubernetes.
- Experience developing environment documentation and support procedures.
Preferred Qualifications - Understanding of enterprise IT operational capabilities - examples include Change, Release, Incident Management, infrastructure management or applications management.
- Experience architecting highly available systems that utilize load balancing, horizontal scalability and high availability
- Experience with Agile software development and DevOps practices such as Infrastructure as Code (IaC), Continuous Integration and automated deployment
- Experience in adopting chaos engineering techniques to validate system resiliency
- Experience with Distributed Services, Asynchronous Messaging Architecture, Eventual Consistency, Telemetry, and high scale experience with managing and writing services on top of cloud environments such as Azure, AWS, or GCP