Site Reliability Engineer
Contract
Cincinatti, Ohio
Remote
Our Retail client, is looking for a Site Reliability Engineer to ensure the stability, scalability, and performance of their technology solutions by managing infrastructure, automating processes, and resolving system issues to minimize downtime and optimize customer experience.
Job Description:
- As a Site Reliability Engineer/DevOps Engineer, you will be responsible for ensuring the availability, performance, and reliability of Fulfillment Technology solutions for our retail partner to support omni-channel strategy.
- You will work closely with the development, testing, and operations teams to design, implement, and maintain scalable, reliable, and efficient solutions for the production environment.
- You will also troubleshoot and resolve any issues that may arise in the production systems, using various tools and techniques such as monitoring, logging, alerting, automation, and incident management.
- You will also contribute to the continuous improvement of the DevOps practices and processes, such as CI/CD, configuration management, infrastructure as code, and cloud computing.
- You will have a strong background in software engineering, system administration, networking, and cloud technologies.
- You will also have excellent communication and collaboration skills, as well as a passion for learning new technologies and solving complex problems.
Minimum Position Qualifications:
- Bachelor’s Degree in Computer Science/Engineering or related field.
- 4+ years of experience in the cloud SRE/DevOps/Infrastructure, or any related fields.
- 4+ years experience working with databases, web applications and micro-services, event-driven applications, messaging systems, REST APIs and integrations, cloud, support tools, observability and containerization technologies.
- Knowledge of Java, Spring boot, Microservices, Kafka, Cassandra & SQL Server.
- Proficiency in scripting languages such as Python / Shell scripting.
- 1 year of experience managing System Observability tools (DynaTrace, ELK, PagerDuty, Datadog, Azure Monitor, Grafana, etc).
- Hands-on experience with GitActions for CI/CD automations.
- Knowledge of Linux architecture, security, administration, performance monitoring/tuning, troubleshooting, and production operations.
- Demonstrated skill in working in an Agile environment.
- Demonstrated skill in working with multi-location global teams.
- Proven ability to think and contribute at the strategic level.
- Demonstrated knowledge of eCommerce, Fulfillment, or Retail Technology solutions.
- Demonstrated written, oral and presentation/public speaking communication skills.
Desired Previous Experience/Education:
- Master’s Degree or PhD in computer science, information systems, or related field.
- 4+ years of experience in designing/working in high volume eCommerce applications.
- 2+ years of experience configuring and managing cloud infrastructure (Azure, AWS, GCP).
- 1 year of experience with technologies such as Apache Kafka, Azure Cosmos DB, Apache Cassandra, Ansible, Terraform, Docker and Kubernetes.
- Experience with Nginx, HAProxy, Squid.
- Experience with CI/CD pipelines using tools such as Jenkins, Spinnaker, Azure DevOps, TeamCity, etc.
- Proficient in implementing and managing RoyalTS or similar cross-platform remote management solutions, ensuring secure and efficient remote access and system administration across diverse environments.
Key Responsibilities:
- Partner and collaborate with application engineering, observability, and other support teams within our retail client's ecosystem, as well as our business operation partners and third parties (as appropriate) to prioritize, address and drive the resolution of issues and incidents that impact customer pickup or delivery domains.
- Drive root-cause analysis of critical business and production issues to prevent future occurrences and review/approve potential solutions.
- Lead Major Incident calls impacting the Pickup Fulfillment domain and provide clear, timely updates on status of service restoration to key stakeholders.
- Work with the engineering teams to continuously implement and improve reliable and speedy build environments.
- Increase automation to improve efficiency and quality.
- Ensure traceability, observability, and retrievability of system behavior.
- Build logging, monitoring, and alerting systems to identify bottlenecks and assist with debugging, analysis, and optimization in cloud, on-prem and store environments.
- Craft solid and clearly explained designs, playbooks, and documentation.
- Participate in an off-hours on-call rotation, and perform periodic off-hours work during maintenance windows.