Software Engineer - Supercomputing Platform & Infrastructure
Introduction: Are you a software engineer with a passion for building resilient and optimized solutions for AI workloads? We are seeking a Software Engineer for our Supercomputing Platform & Infrastructure team to work on massive computing clusters. This role can be based in San Francisco or remote.
About the Company: We are a forward-thinking organization committed to advancing humanity’s progress by developing safe AGI. Our mission focuses on automating research and code generation, leveraging frontier-scale pre-training, domain-specific RL, ultra-long context, and test-time compute. We aim to enhance model reliability and alignment beyond human capabilities.
About the Role: As a Software Engineer on our Supercomputing Platform & Infrastructure team, you will be integral in designing and building highly available and secure AI training and inference infrastructure. Your work will ensure the reliability and optimization of GPU workloads, troubleshoot complex issues, and enhance the efficiency of our engineering processes.
What We Can Offer You:
- Significant equity as part of total compensation
- 401(k) plan with 6% salary matching
- Comprehensive health, dental, and vision insurance for you and your dependents
- Unlimited paid time off
- Flexible work options: in-person in San Francisco or remote
- Visa sponsorship and relocation stipend
Key Responsibilities:
- Build and maintain a software stack for large-scale (thousands of GPUs) AI training and inference infrastructure
- Troubleshoot and resolve issues across GPU resources, networking, OS, drivers, and cloud environments
- Automate detection and recovery processes to ensure high availability and security
- Investigate and resolve incidents affecting security and availability
- Develop solutions to enhance engineering efficiency and speed
- Proactively support the research and engineering teams
Keywords: In this role as a Software Engineer, you will utilize networking technologies, cloud platforms like GCP, AWS, Azure, and apply your IaC knowledge with tools such as Terraform or Pulumi. Your expertise will ensure the reliability and optimization of our AI workloads and GPU deployments.