Member of Technical Staff, Kernels

InceptionSan Francisco, CA
12d

About The Position

We're looking for engineers and scientists to design, optimize, and maintain the compute foundations that power large-scale language model training and inference. You will develop high-performance ML kernels, enable efficient low-precision arithmetic, and improve the distributed compute stack that makes training and serving large models possible.

Requirements

  • BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience).
  • Proficiency in CUDA, CuTe, Triton, or other GPU programming frameworks.
  • Understanding of ML frameworks (PyTorch, TensorFlow) from a systems perspective.
  • Background in performance optimization and profiling of ML systems.
  • Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (XLA, TVM).
  • Familiarity with distributed training techniques (data parallel, model parallel, pipeline parallel).
  • Proficiency in Python and at least one systems programming language (C++/Rust/Go).
  • Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines.

Nice To Haves

  • Experience building and maintaining large-scale language models with tens of billions of parameters or more.
  • Experience with distributed systems and cloud computing platforms (AWS/GCP/Azure).
  • Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM.
  • Prior contributions to open-source deep learning infrastructure such as PyTorch, DeepSpeed, or XLA.

Responsibilities

  • Design and implement custom ML kernels (CUDA, CuTe, Triton) for core dLLM operations such as attention, matrix multiplication, gating, and normalization, optimized for modern GPU architectures.
  • Design compute primitives to reduce memory bandwidth bottlenecks and improve kernel efficiency.
  • Contribute to infrastructure stability and scalability, ensuring reproducibility, consistency across precision formats, and high utilization of compute resources.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service