Site Reliability Engineer

ConsultUSA • phoenix, az, us • 2w ago

Our client has an immediate need for a Site Reliability Engineer, working Friday/Monday 3pm-11:30pm EST, who will be responsible for specializing in improving all aspects of reliability, acting as a conduit between infrastructure and application teams on support issues, and improving tools, automation, processes, and software.

Requirements:

Bachelor’s degree in Engineering, Computer Science, or a related field
Possess a breadth and depth of technical and management knowledge
Continuous improvement mindset, always looking for opportunities to streamline, routinize, or automate
Working knowledge across technology in the following support areas:
- Server: Administration and troubleshooting in Linux and Windows as well as patching and basic scripting skills (PowerShell, Bash)
- Converged Solutions: Experience in VCE/UCP (including VMWare versions 6 and above), platform and network connectivity, and patching – understanding of current threat analysis and remediation trends, alongside PowerShell and Linux scripting skills
- Storage: CIFS/NFS, Linux and Windows scripting, DPA reporting, Avamar and Data Domain administration, and solid understanding of Windows and Linux environments
- Middleware: Linux, Windows, WebSphere, Apache, IIS, WebLogic and Tomcat
- Mainframes: JCL, CICS SYSPLEX
- Networking: Strong understanding of the network protocols and OSI Model, as well as Network+ Certification
- Workflow and Knowledge Management: ServiceNow
- Collaboration Tools: TrueSight, Jira, and Confluence
- Process: Skilled and knowledgeable in ITSM; proficiency in operations analytics methodologies to drive performance improvement (e.g., Lean)
Strong troubleshooting and problem-solving skills, with the ability to analyze and resolve complex technical issues
Experience with ITIL fundamentals
Familiarity with Problem Management, Change Management, Release Management, Event Management, and Incident Management
Adaptability to prioritize criticality to incoming incidents; high volume environment
Capable of balancing multiple projects
Ability to quickly learn and adapt to testing and support requirements for non-production work, including creating documentation for new processes and procedures
Strong troubleshooting and problem-solving skills, with the ability to analyze and resolve complex technical issues
Excellent communication and interpersonal skills, with the ability to collaborate effectively with stakeholders at all levels
Insatiable curiosity of how technologies work and how technologies interface in complex, large-scale environments

Responsibilities:

Monitoring systems and infrastructure to maintain operational and performance levels
Rotational on-call responsibilities
Working closely with other SRC professionals/engineers when issues arise, collaborating on troubleshooting, and providing consultation/resolution with events/incidents
Anticipating potential problems before they become impacting and collaborate to determine solutions
Gathering and analyzing metrics from tools and system/application logs to assist in performance tuning, fault finding, and resolution
Creating sustainable systems and services through automation, process enhancement, tools, and noise reduction
Building automation to manage the SRC operations and eliminate/minimize manual functions and toil
Collaborating with Application/Infrastructure support engineers and operations teams
Engaging in post-incident reviews for improvements and determining the cause to prevent recurrence