Bloomberg-posted 3 days ago
$160,000 - $240,000/Yr
Full-time • Mid Level
New York, NY
Web Search Portals, Libraries, Archives, and Other Information Services
Craft a resume that recruiters will want to see with Teal's resume Matching Mode

We are seeking an engineer to join our hardware management team. This team is responsible for the provisioning, monitoring, and support for thousands of servers supporting dozens of teams within Bloomberg, including the entire AI stack! The ideal candidate will have experience in designing, implementing, and maintaining system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systems. This role will also be responsible for overseeing the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability.

  • Design, build, and maintain highly reliable, scalable, and efficient infrastructure platforms that support our engineering teams and business needs.
  • Participate in system design discussions and contribute to architectural decisions.
  • Ensure code quality through standard methodologies, code reviews, and alignment to clean code principles.
  • Produce clear and consumable documentation for a wide audience.
  • Communicate effectively across diverse teams.
  • Participate in on-call rotations as arranged.
  • Manage priorities and work independently.
  • Stay up-to-date with the latest infrastructure technologies and evaluate their potential impact on existing and future solutions.
  • 4+ years of proficiency in Kubernetes environments (deployments, storage, services, jobs, ingress, egress, etc).
  • BA, BS, MS, PHD, in Computer Science, Electrical Engineering or related field.
  • Hands-on management of GPU-based systems, including kernel and driver management, and developing software tooling to automate provisioning and maintenance of these systems.
  • Design, implement, and maintain system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systems.
  • Oversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability.
  • Drive system upgrades, customization, and seamless integration with software developers, network operations, and data center teams.
  • Manage and maintain a diverse range of computer systems and application software, ensuring they meet the highest standards of functionality and efficiency.
  • Develop and maintain expertise in low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, Ethernet, RDMA/RoCE, and others).
  • Monitor and evaluate the efficiency and effectiveness of infrastructure service delivery methods and procedures.
  • Partner with internal teams to develop prioritization, metrics, and processes around capacity planning and infrastructure availability.
  • Expertise with Kubernetes design patterns (operators, helm charts, kustomize, etc).
  • Experience with data center planning, including rack elevations, cabling plan, and cables/transceivers.
  • Experience with data center operations and management.
  • Paid holidays
  • Paid time off
  • Medical insurance
  • Dental insurance
  • Vision insurance
  • Short and long term disability benefits
  • 401(k) with match
  • Life insurance
  • Various wellness programs
  • Merit increases
  • Incentive compensation
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service