Distributed Systems Engineer
Are you a seasoned Distributed Systems Engineer looking to take on challenging projects in a cutting-edge environment? Join us to build the data and coordination systems that push the boundaries of AI.
About the Company We are a forward-thinking company dedicated to advancing artificial general intelligence (AGI) to solve some of the world's most critical problems. Our approach combines frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and test-time compute to drive progress in the field of AGI.
About the Role As a Distributed Systems Engineer, you will be instrumental in developing high-performance storage and caching systems to support long-context inference and training on our GPU clusters. Your expertise in distributed systems will help automate fault detection and recovery systems, ensuring highly available training. You will also troubleshoot complex issues across GPUs, networks, storage, OS, and cloud environments, making this a role where your problem-solving skills will be highly valued.
What We Can Offer You
- Significant equity as part of total compensation
- 401(k) plan with 6% salary matching
- Comprehensive health, dental, and vision insurance for you and your dependents
- Unlimited paid time off
- Option to work in-person in San Francisco or remotely
- Visa sponsorship and relocation stipend
Key Responsibilities
- Develop high-performance storage and caching systems for long-context inference and training
- Work on the internals of deep learning frameworks in a distributed setting
- Automate fault detection and recovery systems
- Troubleshoot complex issues across GPUs, network, storage, OS, and cloud environments
- Design and operate highly available, high-throughput data systems
Qualifications
- Deep knowledge of distributed systems design and public cloud platforms
- Experience with distributed DBMS, batch and stream processing systems, and/or distributed file systems
- Exceptional problem-solving skills up and down the stack
Our team values integrity, hands-on work, teamwork, focus, and quality. If you are a Distributed Systems Engineer with a passion for innovation and solving complex problems, we encourage you to apply and become part of our dynamic team.
Keywords: distributed systems, GPU clusters, fault detection, deep learning frameworks, high-performance storage.