Dev Ops System Administrator

General Dynamics Mission Systems, IncScottsdale, AZ
1d

About The Position

Put your engineering talent to the ultimate test. At General Dynamics Mission Systems, we create the technologies, products and services that help our service members, intelligence analysts and first responders keep our nation safe. The work we do is so advanced our teams often redefine what’s possible for the world. If you want to be a leader with the company that delivers smart solutions for our nation’s challenges, this is your opportunity. We apply advanced technologies such as Artificial Intelligence, Blockchain, AR/VR, Cloud Native and Quantum Physics to solve our customers’ missions in cyber, RF, undersea, interstellar and everything in between. We are seeking a skilled AI DevOps System Administrator to build, manage, and optimize the infrastructure supporting our Artificial Intelligence and Machine Learning initiatives in a classified environment. The ideal candidate will be responsible for maintaining the CI/CD pipeline for ML models, managing GPU resources, and ensuring the stability, scalability, and security of the AI development and deployment environment. This role requires close collaboration with data scientists and ML engineers to streamline workflows from model development to production. As a seasoned leader, you’ll be involved with our client's decision-making process by serving as a front-line interface to users with technical issues and conducting systems analysis and development to keep systems current with changing technologies. Your duties may include installing new software, troubleshooting, granting permissions to applications and training users. You’ll also be responsible for the day-to-day support of server services by performing server administration for physical and virtual server operating systems and configuring, maintaining and troubleshooting of physical and virtual hardware and network related interfaces on servers. We’ll rely on you to perform, maintain, troubleshoot and conduct a complete analysis of alerts; create scripts to automate repetitive processes; and work with customers to identify, isolate, and resolve problems with servers that are affecting other services.

Requirements

  • A Bachelor’s degree in Computer Science, a related field or equivalent experience plus a minimum of 8 years of relevant experience; or Master's degree plus 6 years of relevant experience
  • Advanced understanding of server based operating systems
  • Strong Linux/Container/AI Skills
  • Subject matter expert (SME) with the ability to mentor others on administrating the server environment
  • Enhanced troubleshooting skills within the server OS as well as both networking and storage technologies
  • Hands-on experience developing, deploying and supporting large-scale enterprise server solutions
  • Department of Defense TS/SCI with Polygraph security clearance is required at time of hire.
  • Applicants selected will be subject to a U.S. Government security investigation and must meet eligibility requirements for access to classified information.
  • Due to the nature of work performed within our facilities, U.S. citizenship is required.

Nice To Haves

  • Team player who thrives in collaborative environments and revels in team success
  • Broad understanding of the interrelationships within the IT environment with focus on server and services
  • Senior level knowledge of physical and virtual server support
  • Senior level knowledge of access, permissions and security that gives the clients the access to the data they need to perform their daily activities

Responsibilities

  • Design, implement, and maintain scalable and robust infrastructure for AI/ML model training and inference.
  • Develop and manage CI/CD pipelines for automated building, testing, and deployment of AI applications and machine learning models.
  • Administer and optimize Linux-based systems and virtualized environments.
  • Manage containerization and orchestration platforms (e.g., Docker, Kubernetes) to deploy and scale ML services.
  • Automate infrastructure provisioning, configuration management, and deployment processes using Infrastructure as Code (IaC) tools like Ansible or Terraform.
  • Manage and allocate GPU resources efficiently for model training and other high-performance computing tasks.
  • Implement and maintain monitoring, logging, and alerting systems to ensure platform health and performance.
  • Collaborate with development teams to support their infrastructure needs and troubleshoot issues.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service