AI Infrastructure Engineer

Bright Vision TechnologiesBridgewater Township, NJ
1dRemote

About The Position

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge cloud data platform technologies to design scalable, secure, and high-performance analytics environments. As we continue to grow, we’re looking for a skilled AI Infrastructure Engineer to join our dynamic team and contribute to our mission of transforming business processes through technology. This is a fantastic opportunity to join an established and well-respected organization offering tremendous career growth potential.

Requirements

  • Design and manage AI/ML Infrastructure optimized for GPU Computing using NVIDIA CUDA, enabling high-throughput training and inference workloads.
  • Develop and automate scalable environments with Python scripting on Linux, leveraging Docker for containerization and Kubernetes for orchestration.
  • Deploy and optimize AI workloads across Cloud Platforms (AWS, Azure, GCP), configuring GPU clusters for cost-effective scaling.
  • Implement AI Workload Orchestration tools to schedule, manage, and monitor distributed training jobs across multi-node setups.
  • Build High-Performance Computing (HPC) systems with Distributed Systems expertise, focusing on low-latency Storage & Networking for AI (e.g., NVMe, InfiniBand).
  • Provision infrastructure using Infrastructure as Code (Terraform), ensuring reproducible and version-controlled deployments.
  • Establish CI/CD pipelines with Git integration for automated building, testing, and rollout of AI infrastructure components.
  • Set up Monitoring & Observability stacks (e.g., Prometheus, Grafana) to track GPU utilization, cluster health, and performance bottlenecks.
  • Collaborate in Agile methodologies, delivering iterative improvements to AI infrastructure through sprints and cross-functional teamwork.
  • Optimize resource allocation for AI pipelines, reducing costs while maximizing throughput for large-scale model training and serving.

Responsibilities

  • Design and manage AI/ML Infrastructure optimized for GPU Computing using NVIDIA CUDA, enabling high-throughput training and inference workloads.
  • Develop and automate scalable environments with Python scripting on Linux, leveraging Docker for containerization and Kubernetes for orchestration.
  • Deploy and optimize AI workloads across Cloud Platforms (AWS, Azure, GCP), configuring GPU clusters for cost-effective scaling.
  • Implement AI Workload Orchestration tools to schedule, manage, and monitor distributed training jobs across multi-node setups.
  • Build High-Performance Computing (HPC) systems with Distributed Systems expertise, focusing on low-latency Storage & Networking for AI (e.g., NVMe, InfiniBand).
  • Provision infrastructure using Infrastructure as Code (Terraform), ensuring reproducible and version-controlled deployments.
  • Establish CI/CD pipelines with Git integration for automated building, testing, and rollout of AI infrastructure components.
  • Set up Monitoring & Observability stacks (e.g., Prometheus, Grafana) to track GPU utilization, cluster health, and performance bottlenecks.
  • Collaborate in Agile methodologies, delivering iterative improvements to AI infrastructure through sprints and cross-functional teamwork.
  • Optimize resource allocation for AI pipelines, reducing costs while maximizing throughput for large-scale model training and serving.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service