Amazon.composted 4 days ago
$151,300 - $261,500/Yr
Full-time • Senior
Sunnyvale, CA
General Merchandise Retailers

About the position

Join our team building the scale-out networking backbone that powers the world's largest AI training clusters. We're developing high-performance RDMA and RoCE solutions that enable distributed training of trillion-parameter models across thousands of compute nodes on AWS infrastructure. Our team is responsible for creating the networking software that connects massive AI accelerator clusters, focusing on SmartNIC integration, collective communication optimization, and ultra-high-bandwidth inter-rack connectivity. As a senior engineer, you'll drive technical architecture decisions and lead the development of next-generation distributed AI training infrastructure.

Responsibilities

  • Lead the design and development of high-performance networking software solutions utilizing RDMA and RoCE technologies for large-scale AI clusters
  • Architect SmartNIC integration strategies with EC2 control plane systems and define API specifications
  • Drive optimization of collective communication patterns and multi-rack networking protocols for distributed AI training
  • Lead development of comprehensive performance monitoring, metrics collection, and benchmarking infrastructure
  • Design automated testing frameworks and stress testing methodologies for large-scale distributed systems
  • Lead complex system-level debugging efforts across hardware acceleration, kernel networking, and distributed applications
  • Define technical architecture and strategy for next-generation scale-out AI cluster networking
  • Provide technical leadership and mentoring to engineering teams
  • Drive cross-functional collaboration with hardware, cloud infrastructure, and AI platform teams
  • Lead technical design reviews and establish engineering best practices

Requirements

  • Experience as a mentor, tech lead or leading an engineering team
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of programming experience in C/C++ with focus on high-performance distributed systems
  • 5+ years of leading design or architecture of large-scale networked systems
  • Deep expertise in RDMA technologies, RoCE implementations, and high-performance networking
  • Extensive experience with collective communication libraries (NCCL, RCCL, OneCCL, MPI)
  • Experience as a technical lead or leading engineering teams on complex infrastructure projects

Nice-to-haves

  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Expert-level experience with SmartNIC programming and network acceleration hardware APIs
  • Deep knowledge of AI training infrastructure, cluster networking, and scale-out communication patterns
  • Proven track record of performance optimization and system-level debugging in distributed environments
  • Experience with cloud infrastructure integration, virtualization, and large-scale system deployment
  • Understanding of modern AI accelerator architectures and multi-rack cluster design
  • Experience building and optimizing systems for trillion-parameter model training workloads
  • Track record of delivering complex technical projects in high-performance computing environments
  • Strong communication and technical leadership skills
  • Master's degree in Computer Science, Computer Engineering, or related field
  • Experience with AWS cloud infrastructure and large-scale distributed system operations

Benefits

  • Flexible working culture
  • Mentorship and career growth opportunities
  • Inclusive team culture
  • Work-life harmony
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service