Tandym Group is seeking a Site Reliability Engineer to support a financial client based in Charlotte.
Responsibilities:
- Run the production environment by monitoring availability and taking a holistic view of system health
- Support the applications with OnCall rotation support.
- Provide stability to our applications and facilitates rapid feature development by taking active control on direction of the service and be proactive
- Automate and eliminate manual work and look for opportunities for automation
- Maintain and implement the SLO implementation adoption and automation
- Production Readiness/Health Scoring & Error Budget Tracking
- Runbook standards, maintenance, and updates
Qualifications:
- Experience using DevOps tools and technologies such as GitLab, and Infrastructure as Code tools such as Terraform
- Strong troubleshooting skills and building and enhancing the observability using monitoring tools
- Proactive approach to Observability maturity, identifying problems, performance bottlenecks, and areas for improvement for observability
- Leading incident response and supporting application teams.
- Blameless postmortems Developer feedback for enhanced logging, runbooks and addressing technical debt.
- Promoting observability best practices Experience in monitoring tools Dynatrace & Splunk Experience in public cloud platforms, preferably AWS and Api gateways
- Experience developing API or Microservices or Frontend is a plus
- Experience using source version control (SVC) such as Git
Desired Skills