Sr. HPC System Administrator

University of ChicagoHyde Park, IL
46dHybrid

About The Position

The University of Chicago Research Computing Center (RCC), a unit in the Office of Research, provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High-Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of Research oversees the conduct of sponsored research, research program development, and contract management functions. The job uses specialized knowledge and breadth of expertise to design automated, scalable, and rapidly deployable solutions to infrastructure development and server configuration. Leads installation, configuration, and maintenance of operating systems. Uses best practices and systems knowledge to monitor and alert systems, utility software, and firewalls. Guides maintenance for production servers as well as Windows and Linux servers. The University of Chicago is seeking a highly qualified Senior HPC System Administrator to join the system and operation team that builds and manages RCC HPC systems and facility operations. The individual in this position will be involved in the procurement and management of HPC hardware and software. This is a hybrid position requiring 3 days onsite.

Requirements

  • Minimum requirements include a college or university degree in related field.
  • Minimum requirements include knowledge and skills developed through 5-7 years of work experience in a related job discipline.

Nice To Haves

  • Master’s degree in Computer Science or closely related field.
  • Full time Linux system administration experience in a large distributed computing environment.
  • Previous experience in providing support for Linux HPC cluster used for scientific research.
  • Experience with installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.).
  • Experience configuring, installing and troubleshooting MPI and OpenMP.
  • Experience with operating system deployment tools (e.g. XCAT, ROCKS).
  • Experience configuring, administering, and supporting network storage subsystems (e.g. IBM, NetAppl DataDirect Network, LSI, etc.).
  • Hands-on experience of at least one distributed file system (Spectrum Scale-GPFS, Lustre, BeeGFS, Gluster, IMRIX, PVFS, etc.).
  • Direct experience working with Infiniband (must at least be able to demonstrate a working knowledge of Infiniband concepts, OFED layers, sub-net managers).
  • Experience configuring, installing, tuning and maintaining scientific application software on large-scale systems.
  • Experience supporting HPC compilers and libraries.
  • Experience with systems automation tools such as Ansible or Puppet.
  • Experience configuring, installing, maintaining and/or using performance monitoring and optimization tools.
  • Ability to work well with faculty and researchers.
  • Ability to identify and gain expertise in appropriate new technologies and/or software tools.
  • Ability to function as part of an interactive team while demonstrating self-initiative to achieve project's goals and Research Computing Center's mission.
  • Strong analytical skills and problem-solving ability.

Responsibilities

  • Installing, configuring, and maintaining large computer clusters/servers and software.
  • Day-to-day operations of the systems including systems administration, monitoring and storage performance up to and including network components.
  • Management of the system’s network switch, parallel file system and HPC software stack and tools.
  • Configuration of the scheduling and queuing system.
  • Diagnosing and resolving system operational problems quickly and effectively.
  • Coordinating with vendors to resolve hardware and software problems.
  • Assist users with access and other help desk ticket requests or issues.
  • Use scripting/programming skills to enable system-level automation, problem detection, security maintenance and patch management.
  • Building and deploying open-source software and software from vendors/partners.
  • Providing reliable and efficient backups/restores for all managed systems.
  • Documenting system administration procedures for routine and complex tasks.
  • Maintaining and monitoring the security of the HPC systems and servers.
  • Plans and installs necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems.
  • Installs and maintains an appropriate level of intrusion detection, monitoring, and auditing software as required.
  • Tracks compliance and maintains documentation for hardware, software, and service inventories for management reports.
  • Performs other related work as needed.

Benefits

  • The University of Chicago offers a wide range of benefits programs and resources for eligible employees, including health, retirement, and paid time off.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service