SR. Infrastructure Engineer - Site Reliability Engineering - NET - Insight Global : Job Details

SR. Infrastructure Engineer - Site Reliability Engineering - NET

Insight Global

Job Location : New York,NY, USA

Posted on : 2024-11-14T07:21:02Z

Job Description :

You Lead the Way. Weve Got Your Back. With the right backing, people and businesses have the power to progress in incredible ways. When you join Team Amex, you become part of a global and diverse community of colleagues with an unwavering commitment to back our customers, communities and each other. Here, youll learn and grow as we help you create a career journey thats unique and meaningful to you with benefits, programs, and flexibility that support you personally and professionally. At American Express, youll be recognized for your contributions, leadership, and impactevery colleague has the opportunity to share in the companys success. Together, well win as a team, striving to uphold our company values and powerful backing promise to provide the worlds best customer experience every day. And well do it with the utmost integrity, and in an environment where everyone is seen, heard and feels like they belong. Join Team Amex and let's lead the way together. As part of our diverse tech team, you can architect, code and ship software that makes us an essential part of our customers digital lives. Here, you can work alongside talented engineers in an open, supportive, inclusive environment where your voice is valued, and you make your own decisions on what tech to use to solve challenging problems. American Express offers a range of opportunities to work with the latest technologies and encourages you to back the broader engineering community through open source. And because we understand the importance of keeping your skills fresh and relevant, we give you dedicated time to invest in your professional development. Find your place in technology on #TeamAmex. Overview We are seeking a versatile and highly skilled Full Stack Infrastructure Engineer with expertise in Compute, Storage, Network and Cloud technologies. The ideal candidate will design, implement, and manage robust infrastructure solutions, ensuring reliability, scalability, and performance. Key Responsibilities: Ensure the reliability, availability, and performance of the entire infrastructure stack including compute, storage, network and cloud components. Lead incident response efforts across the infrastructure stack, coordinating with Application Support, SRE, and Engineering teams to minimize MTTD and MTTR. Perform root cause analysis for infrastructure related incidents and implement corrective actions. Develop and maintain automation tools for managing infrastructure resources. Collaborate with Engineering teams to plan and execute system upgrades and maintenance. Conduct capacity planning and resource management for all infrastructure components. Participate in on-call rotations to provide 24x7 support for all critical infrastructure issues. Design and implement disaster recovery plans and business continuity strategies. Implement best practices for monitoring, logging, and alerting across the infrastructure. Foster a culture of continuous improvement and operational excellence. Analyze complex infrastructure problems, design scalable and resilient solutions, and lead the implementation of these solutions. Collaborate with architects and other engineers to design and enhance the architecture of infrastructure systems, ensuring alignment with business needs and technology standards. We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal. com. To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: .

10 plus years of experience in using cloud native monitoring tools like AWS CloudWatch, Azure Monitor, and Google Cloud Operations Suite. Experience with packet capture tools like Wireshark for troubleshooting network issues. Experience in using traceroute utilities and performance analysis tools like perf for identifying and resolving bottlenecks. Familiarity with tools such as ipconfig/ifconfig for viewing network configurations, flushing DNS, and diagnosing network issues. Experience with SNMP-based tools for network device monitoring and performance management. Experience in using NetFlow for network traffic analysis. Experience with tools like iostat, vmstat, and dstat for monitoring storage and system performance. Experience in tools like df, du, lsblk, and fdisk for managing and troubleshooting file systems and disk partitions. Familiarity with tools like Prometheus and Grafana for monitoring and observability. Proven experience managing and optimizing a diverse infrastructure stack. Extensive knowledge of cloud platforms (AWS, Azure, GCP) and infrastructure as code (Terraform, CloudFormation). Familiarity of service mesh technologies (Istio, Linkerd). Solid understanding of virtualization (VMware, Hyper-V) and containerization (Docker, Kubernetes) and orchestration. Understanding of storage solutions (SAN, NAS, cloud storage) and backup systems. Strong understanding of network protocols, routing, switching, and firewalls (Palo Alto) Experience with load balancers (F5, HAProxy, Nginx) and network monitoring tools. Experience in DNS management and troubleshooting. Experience in network security best practices. Proficiency in monitoring and observability tools (Prometheus, Grafana, Splunk). Proficiency in at least one scripting language (Python, Bash) for automation. Experience with CI/CD pipeline management and DevOps practices. Strong understanding of disaster recovery and business continuity planning. Experience with performance tuning and capacity planning. Understanding of chaos engineering principles and practices. Skills in cost optimization for cloud infrastructure. Familiarity with Akamai

Apply Now!

Similar Jobs ( 0)

-- View More Similar Jobs --