The High Performance Computing (HPC) System Administrator will be responsible for day-to-day operational support of the Roush CAE HPC and VDI hardware and software infrastructure. Day to Day operations include supporting end-users with issues, driving root cause analysis and design task automations. This role will work cross functionally on various project teams and operations based on the direction of HPC System lead engineer. This role will also be involved in developing tools and scripts for simulation tests and optimization of simulation jobs and document all the changes. This position is located in Troy, MI.
Responsibilities:
- Responsible for the day-to-day operational support of the Roush CAE HPC Clusters, VDI and backup servers: manage and solve any hardware and software issues that may arise. (Systems Administration)
- Assist in hardware and software upgrade programs to implement new technologies. They will include developing cluster tools or solutions, automation of deployments, HPC job optimization, pre/post processing workflows, alerts, usage and performance metrics.
- Write Help documents for users, develop functional and technical designs for automated tools that can assist users with HPC job optimization following the Roush CAE HPC change management guidelines.
- Identify bottlenecks and assist in maximizing performance of our HPC applications.
- Provide advice and support to Roush HPC users.
- Interact confidently and professionally with various audiences and stakeholders at all levels.
- Keep abreast of latest HPC and industry developments and investigate the suitability of newly available technologies, including but not limited to: new CPU/GPU technologies, HMB, memory and high-speed interconnects, web-based software technologies and parallel high performance computing application tuning & optimization.
Minimum Qualifications:
- Bachelor of Science in Mechanical Engineering with minor in Computer Science or vice versa.
- Experience with Red Hat Enterprise Linux or similar Linux distributions (Fedora, CentOS Stream, Alma Linux and/or Rocky Linux).
- Experience in bash, python, and/or similar scripting languages.
- Experience in Microsoft Office products (Excel, PowerPoint, SharePoint, Teams etc.)
- U.S. Citizen allowing for International Traffic in Arms Regulations (ITAR) compliance.
- Self-starter, able to identify requirements independently, then make proposals for solutions as well as the flexibility in dealing with change in priorities and working on several projects simultaneously.
- Excellent documentation skills and the ability to communicate well with people of diverse backgrounds and computer knowledge.
- High level of personal commitment, occasional availability on weekends and out of hours will be required to ensure the system up time and support system maintenance schedules.
- Aptitude to learn from others, share knowledge with others, and promote continuous improvement of our processes.
- Ability to work with the engineering staff and users to aid and instruct how to use the HPC resources optimally.
Preferred Qualifications:
- Minimum of 2 years' experience of HPC system administration and supporting CAE users.
- Experience in installation, configuration and administration and use of CAE software (LS-DYNA, Nastran, StarCCM+, Abaqus, Fluent, etc.).
- Experience in installation, configuration and administration of queue systems such as SLURM/LSF/PBS.
- Experience in installation, configuration and administration of Virtual Desktop Infrastructure (VDI) applications.
- Willingness to try new tools / technologies and improve process and cost effectiveness.
- Knowledge of HPC interconnect technologies (InfiniBand, Omni-Path, MPI etc.).
- Knowledge and understanding of network technologies such as TCP/IP and networked file systems such NFS, GFS, Lustre, GPFDS.
- Basic project management skills.
Our full-time benefits include medical, dental, vision, life insurance, earned sick time, STD, LTD, 401K, tuition reimbursement, paid vacation, and paid holidays.
Roush is an EO employer – Veterans/Disabled and other protected categories. If you need reasonable accommodation for our employment application process due to disability, please contact Roush Talent Acquisition at (734) 779-7087.