As a Senior Site Reliability Engineer (SRE) at Striveworks, you will take ownership of specific product deployments by maintaining, optimizing, and enhancing our on-premises and cloud computing environments. You will play a crucial role in the successful deployment of our software solutions to clients. You will be responsible for executing technical aspects of implementation projects and for ensuring the seamless integration, customization, and configuration of our software. Your expertise will play a critical role for the company as we deploy new instances of Striveworks’ machine learning operations (MLOps) capabilities to customer infrastructure.
Your day-to-day will include:
- Automating IaC to stand-up virtual machines and deploying containers, services, and other infrastructure; leaning on expertise to deploy custom Kubernetes clusters in AWS, Azure, GCP, on-premises, or hybrid cloud environments
- Working with platform developers and DevOps to define requirements and build solutions for customer use cases of the platform
- Software deployments to unclassified, CUI, Secret, and Top Secret DOD networks
- Incident response and initial triage of critical system faults
The Senior SRE works on the DevOps team and acts as a liaison between DevOps, platform developers, and professional services teams, taking on operational tasks to ensure the efficient functioning of Striveworks’ customer solutions. The Senior SRE monitors, automates, and improves software reliability, performance, and availability, which supports the IT needs for various projects. They work alongside a team of software engineers and data scientists to help them deploy and operate their work as functional products, learning from them so that building effective AI solutions becomes second nature. They may provide guidance and leadership to junior SRE team members.
You will directly contribute to the success of mission-critical systems within national security and commercial clients. You will be expected to wear multiple hats and to step into vacuums where improvements are needed, and you will be given the breadth to explore new technologies and solutions.
This position offers a hybrid work environment but demands proximity to a DOD Sensitive Compartmented Information Facility (SCIF) in Pinehurst, NC, or Tampa, FL.
Here’s what we’re looking for:
- 6+ years of direct, hands-on experience in:
- Microservice deployment in Kubernetes
- Diagnosing and resolving issues within containerized environments
- Helm Chart and Kustomizations development/deployment
- Python and Bash programming
- Automation and IaC (e.g., Terraform, Ansible)
- Cloud infrastructure (e.g., AWS, Azure, GCP, or OpenStack)
- Managing and troubleshooting Linux systems (e.g., RHEL, Ubuntu, Centos)
- Software deployments to on-premises and cloud-based unclassified, CUI, Secret, and Top Secret networks within the DoD
- The ability to work cross-functionally with platform developers to define requirements and build solutions for customer use cases of the platform
- The ability to respond professionally and competently to incident reports and triage of critical system faults
- Active Top Secret security clearance and intimate familiarity with DOD networking, tools, infrastructure, security requirements, and policies
Full job description: https://job-boards.greenhouse.io/striveworks/jobs/6040037003