Wal-Martposted 16 days ago
$90,000 - $180,000/Yr
Full-time • Mid Level
Hybrid • Bentonville, AR
General Merchandise Retailers

About the position

Walmart's Transactional System provides core transactional systems to enable segment and technology partners in creating wonderful omni experiences with speed and leverage. We are a highly motivated group of engineers, working in an agile group to solve sophisticated and high impact problems. This role is part of Cloud Powered Checkout team and will build the next generation multi-tenant, client agnostic, highly scalable, omnichannel checkout solution to seamlessly enable a frictionless customer checkout experience across all sales channels globally. We process millions of orders daily through our high-performance checkout services running in Edge and Cloud. As a Site Reliability Engineer in the CPC Team, you will work with L2, Other dependent Applications, Platform team, DevOps and Engineering practitioners to proactively maintain mission-critical infrastructure, cloud platforms, microservices, tools, and processes that will ensure the highest levels of availability and reliability of CPC applications.

Responsibilities

  • Incident triage, Escalation and Resolution: Triage site-impacting production issues by quantifying impact, severity and urgency, analyzing systems for quick remediation, engaging the right teams for recovery and focusing on immediate restoration of large-scale enterprise systems.
  • Alert, Monitoring, Log analysis: Detect and analyze monitoring graphs and alerts to identify systems causing production impacts with various tools like Grafana, Prometheus, MMS, Service Now, JIRA, Dynatrace, Splunk etc.
  • Enhance Alerting solutions: Design and implement JavaScript for the integration of alerting tool with service API endpoints.
  • Disaster Recovery Planning: Work with business partners to identify and document critical applications and execute established procedures necessary to continue operations in an emergency.
  • Performance and Optimization: Monitor site reliability conditions and new reliability requirements, and assist in the design and development of a reliability program plan.
  • Work on Product Enrichment & Content Services projects at Walmart: Develop enterprise monitoring and utilize tooling software solutions to improve visibility, pro-actively detect issues and restore system availability.
  • Develop Tools and support: Design and develop solutions for widespread internal communications for cloud applications support or workflows for infrastructure availability issues.
  • Handle Deployments: Streamline the deployments process and handle the responsibility as a single team.
  • Coordinate with platform teams for non-app releases like VM upgrades, DB Maintenance, and other component environment related tasks.
  • Participate in rotating on-call duties and work across different time zones with a multi-national team.
  • Responsible for timely root cause analysis of production issues.
  • Develop reusable tooling and processes to drive and improve customer experience and lower operational costs.
  • Help teams to build highly Observable and Resilient systems.
  • Collaborate with developers to capture requirements and understanding pain points.
  • Build reusable tools, library, dashboards which can be used across DevOps/SRE teams.

Requirements

  • Bachelor's degree in Computer Science, Engineering or related discipline.
  • 3+ years of hands-on related to SRE, Operations & Development experience with Java Script, Java, Restful services, Git, Maven, Jenkins, DevOps, Containerization, Docker, Kubernetes, Azure, Google cloud, Kafka, Azure Cosmos, Azure SQL, Mega cache CI/CD, Prometheus, Grafana, Splunk etc.
  • Demonstrate knowledge of scripting and software development for automation and self-healing of multi-cloud environments.
  • Excellent end to end technical understanding of core infrastructure, cloud services, platforms, and micro-services.
  • Ability to effectively triage - be able to detect and determine symptom vs cause.
  • Identify and drive continuous improvement efforts to reduce waste.

Nice-to-haves

  • Master's degree in Computer Science, Computer Engineering, Computer Information Systems, Software Engineering, or related area and 1 year's experience in software engineering or related area.
  • Knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility.

Benefits

  • 401(k) match
  • stock purchase plan
  • paid maternity and parental leave
  • PTO
  • multiple health plans
  • incentive awards for performance
  • short-term and long-term disability
  • company discounts
  • adoption and surrogacy expense reimbursement
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service