AI Infrastructure Engineer

Bright Vision Technologies•Bridgewater Township, NJ

1d•Remote

About The Position

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge cloud data platform technologies to design scalable, secure, and high-performance analytics environments. As we continue to grow, we’re looking for a skilled AI Infrastructure Engineer to join our dynamic team and contribute to our mission of transforming business processes through technology. This is a fantastic opportunity to join an established and well-respected organization offering tremendous career growth potential.

Requirements

Design and manage AI/ML Infrastructure optimized for GPU Computing using NVIDIA CUDA, enabling high-throughput training and inference workloads.
Develop and automate scalable environments with Python scripting on Linux, leveraging Docker for containerization and Kubernetes for orchestration.
Deploy and optimize AI workloads across Cloud Platforms (AWS, Azure, GCP), configuring GPU clusters for cost-effective scaling.
Implement AI Workload Orchestration tools to schedule, manage, and monitor distributed training jobs across multi-node setups.
Build High-Performance Computing (HPC) systems with Distributed Systems expertise, focusing on low-latency Storage & Networking for AI (e.g., NVMe, InfiniBand).
Provision infrastructure using Infrastructure as Code (Terraform), ensuring reproducible and version-controlled deployments.
Establish CI/CD pipelines with Git integration for automated building, testing, and rollout of AI infrastructure components.
Set up Monitoring & Observability stacks (e.g., Prometheus, Grafana) to track GPU utilization, cluster health, and performance bottlenecks.
Collaborate in Agile methodologies, delivering iterative improvements to AI infrastructure through sprints and cross-functional teamwork.
Optimize resource allocation for AI pipelines, reducing costs while maximizing throughput for large-scale model training and serving.

Responsibilities

Design and manage AI/ML Infrastructure optimized for GPU Computing using NVIDIA CUDA, enabling high-throughput training and inference workloads.
Develop and automate scalable environments with Python scripting on Linux, leveraging Docker for containerization and Kubernetes for orchestration.
Deploy and optimize AI workloads across Cloud Platforms (AWS, Azure, GCP), configuring GPU clusters for cost-effective scaling.
Implement AI Workload Orchestration tools to schedule, manage, and monitor distributed training jobs across multi-node setups.
Build High-Performance Computing (HPC) systems with Distributed Systems expertise, focusing on low-latency Storage & Networking for AI (e.g., NVMe, InfiniBand).
Provision infrastructure using Infrastructure as Code (Terraform), ensuring reproducible and version-controlled deployments.
Establish CI/CD pipelines with Git integration for automated building, testing, and rollout of AI infrastructure components.
Set up Monitoring & Observability stacks (e.g., Prometheus, Grafana) to track GPU utilization, cluster health, and performance bottlenecks.
Collaborate in Agile methodologies, delivering iterative improvements to AI infrastructure through sprints and cross-functional teamwork.
Optimize resource allocation for AI pipelines, reducing costs while maximizing throughput for large-scale model training and serving.