SRE Engineer/ Dallas, TX Location / FTE / Hybrid Role.
Job Description:
The Site Reliability Engineer is a fundamental piece of the Site Reliability Engineering team. Site Reliability Engineering is accountable for the availability, reliability, and performance of the services and platforms in a highly transactional 24x7 environment.
The role
- Monitor application performance, take steps to improve overall application performance and stability, and follow through with implementation.
- Apply automation and software to any tasks or parts of the system that would benefit from it or are performed manually.
- Able to troubleshoot issues handling OS, Networking, databases in a cloud-based environment/on-premises environment and handle live production incidents, debug/troubleshoot application, and infrastructure issues, follow and implement SRE best practices.
- Coordinate with Product owners/business representatives to define Service Level Objectives and error budgets for key functionalities of the projects
- Participate in design reviews of software/components with build teams to ensure that they are built right.
- Review products prior to production deployments to validate compliance with Service level objectives
- Conduct system analysis, and configuration management and develop improvements for system software performance, availability, and reliability.
- Work closely with software engineers and QA to ensure the system is responding properly to non-functional requirements such as performance, security, and availability.
- Document system knowledge as acquired over time, create runbooks and ensure critical system information is readily available to those who need it.
- Maintain and monitor deployment of the servers, docker containers, databases, and general backend infrastructure.
- Participate in production feedback sessions, problem management calls to identify opportunities for product improvement.
What you’ll bring
- Bachelor’s Degree in Computer Science or related; or equivalent combination of education and experience
- 5+ years experience in full-stack application support/SRE role
- Experience in JavaScript, Typescript and web development technologies
- Proficient in scripting languages such as PowerShell and/or Python
- Troubleshooting experience of complex application incidents built in AWS stack
- Experience in conducting design reviews of software components and leading performance, capacity and chaos experiments.
- Extensive Experience with observability platforms (Data dog) is required. Experience with built-in browser side diagnostic tools is expected.
- Knowledge of DevOps methodologies and the tools involved such as CI/CD concepts, CI/CD tools (Jenkins, Code Pipeline, etc.), and automation and configuration tools (Puppet, Ansible, etc) a plus.
- Hands on experience with AWS public cloud is a must, Project implementation experience on public cloud is a plus.
- Ability and willingness to adapt to new application stacks and new technology concepts as the business evolves over time
- Excellent communication skills, both verbal and written
- Ability to collaborate with local and remote teams in different time zones
- Ability to present/lead technical discussions with product, cloud COE, security and other support teams
Regards,
Purnima Pobbathy
Technical Recruiter
purnima@themesoft.com
Themesoft INC