Job Description
Senior Storage & HPC administrator - R&D HPC team - Austin, US
This is what you will do as HPC DevOps engineer at NXP
You are expected to work very closely with your global colleagues within R&D IT and help deliver the Storage & HPC services (High Performance Computing and Virtual Desktop Infrastructure) to our engineering and R&D customers. Your AMEC team has operational responsibility for all designated services within the region and help provide backup support to our other two regions, (EMEA and APAC), when necessary to ensure good coverage and fast response for any issues requiring attention. Your work will be combination of development and operations.
What’s in it for me?
The team keeps a healthy mix of 50% development and 50% operational activities. This requires a constant eye on innovation. You can contribute in every development phase, i.e., from problem analysis and design to implementation, testing, and bug fixing.
- You will receive support from your direct colleagues, a Scrum team of data engineers and data scientists
- You will get training and personal development opportunities from NXP in alignment with your personal and professional needs
- The Product Owner, representatives from the R&D business, and other stakeholders will give you guidance on the solution direction
- The opportunity to work with cutting-edge technologies in an AWS cloud based serverless data platform architecture
- The greatest work place in the Austin and a safe working environment that lives up to its company values
Primary responsibility:
- In this role, you will play a pivotal part in supporting the architecture and design of large-scale data storage related services on-prem and cloud
- You will be responsible for resolving highly technical issues, providing Level 3 support, and serving as a senior resource for IT initiatives and operations
- You will also participate in design reviews and drive specific goals around performance, scalability, and availability
- You will fill a key technical role in our team with equally dedicated engineers supporting our local client’s data storage infrastructure and services
- Perform level 3 support functions, including installation, configuration, upgrades, and support of AWS FSxN, NetApp ONTAP and StorageGRID Environments
- Identify and determine root causes of performance issues in an EDA environment from end users to storage subsystems
- Create dashboards for end users to enhance visibility and usability
- Find opportunity for performance improvement while providing multi-level technical support in defining, developing and evaluating storage products
- Provide recommendations to improve the storage infrastructure and address/lead critical issues and analysis
- Work with application and infra teams to ensure the appropriate storage level provisioning is maintained. Monitors server/storage infrastructure and any processes related to these systems
- Automation to deliver highly optimized infrastructure
- Strong automation skills preferable using Ansible, Python.
- Experienced as a DevOps engineer with IT infra as code (IaC), configuration management tools & CICD
- Keep yourself updated on different storage technologies, protocols, and file systems (eg: NAS, SAN, NFS, CIFS, object storage).
Secondary responsibility:
- As a Linux Systems Administrator with RedHat you will ensure the stability, integrity, and efficient operation of our infrastructure by monitoring, maintaining, supporting, and optimizing our development and production environments
- Provide installation/configuration, operation, and maintenance of systems software and related infrastructure
- Perform ongoing performance tuning, hardware upgrades, and resource optimization as required
- Perform daily system monitoring, verifying the integrity and availability of all hardware, server resources, systems and key processes
- Reviewing system and application logs, and verifying completion of scheduled jobs such as backups, both internally and externally
- Implement automated approaches for system administration tasks
- As a Senior Linux Systems Administrator, you will: Provide technical support for the software, hardware, wired network infrastructure, and system security for a Linux High Performance Computing (HPC) environment. Having knowledge and hands on experience on specific Grid computing applications like LSF, Univa Grid, Slurm, etc would be a plus.
- Test operating system and software upgrades, patches, and hot fixes before they are applied to the operational systems
- Create and test backups of data, provide data cleansing services, verify data integrity, implement access controls
Typically, you…
You can describe yourself as follows:
- Minimum qualifications: Bachelor’s degree or equivalent practical experience in Software Engineering or Computer Science
- 10-12 years of hands-on experience in NetApp storage technologies.
- NetApp certified System Administrator certification
- AWS system architect certification
- Five years of experience working on large DevOps implementations and operations
- Solid DevOps mindset and ability to evaluate and optimize DevOps processes and practice using Lean or Agile.
- Proficiency with maintaining automation and configuration management using tools such as Jenkins, GitLab, AWS CodeDeploy.
- Deep understanding of Linux operating systems, networking, and server management and use tools such as Bash, Ansible, PowerShell.
- Great problem-solving and analytical skillset to troubleshoot and resolve complex issues that may arise and use tools such as Splunk, Grafana, or Prometheus to monitor and analyze the performance and health of the systems that you support
- Understanding of identity and access management relating to provisioning, access control, and certifications