Software Engineer, Data Center Infrastructure Management Lifecycle

GoogleSunnyvale, CA
1d$141,000 - $202,000

About The Position

The Data Center Infrastructure Management (DCIM) Lifecycle team operates one of the largest-scale monitoring systems at Google, reading telemetry from millions of devices in every Google data center. Our issues include managing the rapid growth and diversification of the Google fleet and hardware, new use cases for critical monitoring of third-party facilities, and retiring technical debt. Google is bringing back tape libraries to our data centers in order to support various critical requirements including new cold storage tier, better TCO, contingency for HDD/SSD shortage due to unprecedented AI/ML capacity demand. This role is to design and deliver Tape Health at Google scale for reliability. In this role, you will work with your teammates to design, code, and put into production very large-scale distributed monitoring systems and work with your team and partner teams to enable new use cases for large-scale telemetry gathering. You will also create various system monitoring dashboards, defining service level objectives (SLOs), documentation and playbooks. You will have the opportunity to take onsite trips to one or more of Google's data centers each year to work with new systems and data center technical staff in person.

Requirements

  • Bachelor’s degree or equivalent practical experience.
  • 2 years of experience with coding in C++.
  • 1 year of experience with distributed computing.
  • 1 year of experience with debugging, troubleshooting and monitoring systems.

Nice To Haves

  • Master's degree or PhD in Computer Science, or a related technical field.
  • 2 years of experience in unit testing, integration testing, and continuous deployment.
  • 2 years of experience in SQL.

Responsibilities

  • Design, develop, and maintain software services for collecting and analyzing telemetry data from tape libraries, drives, and robotic components.
  • Implement algorithms and rules to detect, diagnose, and predict hardware failures.
  • Integrate tape health systems with Google's data center health monitoring infrastructure (e.g., system health, network doctor) and automated repair workflows (e.g., surgeon, silk roads).
  • Collaborate with hardware engineers and vendors to understand failure modes and improve diagnostic capabilities.
  • Develop dashboards and tools to provide visibility into the health and status of the tape hardware fleet.
  • Participate in the full software development lifecycle, including requirements gathering, design, coding, testing, deployment, and operation.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service