Our client has an immediate need for a Site Reliability Engineer, working Friday/Monday 3pm-11:30pm EST, who will be responsible for specializing in improving all aspects of reliability, acting as a conduit between infrastructure and application teams on support issues, and improving tools, automation, processes, and software.
Requirements:
- Bachelor’s degree in Engineering, Computer Science, or a related field
- Possess a breadth and depth of technical and management knowledge
- Continuous improvement mindset, always looking for opportunities to streamline, routinize, or automate
- Working knowledge across technology in the following support areas:
- - Server: Administration and troubleshooting in Linux and Windows as well as patching and basic scripting skills (PowerShell, Bash)
- - Converged Solutions: Experience in VCE/UCP (including VMWare versions 6 and above), platform and network connectivity, and patching – understanding of current threat analysis and remediation trends, alongside PowerShell and Linux scripting skills
- - Storage: CIFS/NFS, Linux and Windows scripting, DPA reporting, Avamar and Data Domain administration, and solid understanding of Windows and Linux environments
- - Middleware: Linux, Windows, WebSphere, Apache, IIS, WebLogic and Tomcat
- - Mainframes: JCL, CICS SYSPLEX
- - Networking: Strong understanding of the network protocols and OSI Model, as well as Network+ Certification
- - Workflow and Knowledge Management: ServiceNow
- - Collaboration Tools: TrueSight, Jira, and Confluence
- - Process: Skilled and knowledgeable in ITSM; proficiency in operations analytics methodologies to drive performance improvement (e.g., Lean)
- Strong troubleshooting and problem-solving skills, with the ability to analyze and resolve complex technical issues
- Experience with ITIL fundamentals
- Familiarity with Problem Management, Change Management, Release Management, Event Management, and Incident Management
- Adaptability to prioritize criticality to incoming incidents; high volume environment
- Capable of balancing multiple projects
- Ability to quickly learn and adapt to testing and support requirements for non-production work, including creating documentation for new processes and procedures
- Strong troubleshooting and problem-solving skills, with the ability to analyze and resolve complex technical issues
- Excellent communication and interpersonal skills, with the ability to collaborate effectively with stakeholders at all levels
- Insatiable curiosity of how technologies work and how technologies interface in complex, large-scale environments
Responsibilities:
- Monitoring systems and infrastructure to maintain operational and performance levels
- Rotational on-call responsibilities
- Working closely with other SRC professionals/engineers when issues arise, collaborating on troubleshooting, and providing consultation/resolution with events/incidents
- Anticipating potential problems before they become impacting and collaborate to determine solutions
- Gathering and analyzing metrics from tools and system/application logs to assist in performance tuning, fault finding, and resolution
- Creating sustainable systems and services through automation, process enhancement, tools, and noise reduction
- Building automation to manage the SRC operations and eliminate/minimize manual functions and toil
- Collaborating with Application/Infrastructure support engineers and operations teams
- Engaging in post-incident reviews for improvements and determining the cause to prevent recurrence