Member of Technical Staff: Machine Learning Infrastructure Engineer
Overview: Join a team dedicated to deepening the collaboration between humans and computers, aiming for breakthroughs in AI that redefine the user experience from the ground up. We believe in the power of a small, focused team to drive significant advancements in technology.
About the Company: We are a well-capitalized, multi-disciplinary team supported by leading venture partners and technology companies. Our mission is to solve real-world AI problems by innovating at every layer of the tech stack. Our backers include major industry players, ensuring we have the resources to push the boundaries of AI.
About the Role: As a Machine Learning Infrastructure Engineer, you will architect and build the compute infrastructure that powers our model training and serving. This role demands a deep understanding of the entire backend stack, from frameworks and compilers to runtimes and kernels. Familiarity with cloud-based infrastructure tools like Kubernetes and Docker is essential.
What We Can Offer You:
- Competitive salary and benefits
- Relocation assistance to San Francisco
- Opportunity to work on cutting-edge AI technology
- Collaborative, innovative team environment
- Professional growth and development
Key Responsibilities:
- Design, build, and maintain scalable machine learning infrastructure for model training and inference
- Implement scalable machine learning and distributed systems to enable training and scaling of large language models (LLMs)
- Develop tools and frameworks to automate and streamline ML experimentation and management
- Collaborate with researchers and product engineers to enhance product experiences using LLMs
- Optimize performance and efficiency across various AI accelerators
- Research and write custom kernels to improve training and serving infrastructure
What We Are Looking For:
- Strong understanding of new AI accelerators like TPU, IPU, HPU, and their tradeoffs
- Knowledge of parallel computing concepts and distributed systems
- Experience in performance tuning of training and/or inference workloads, including MLPerf or internal production workloads
- 6+ years of relevant industry experience in designing large-scale ML infrastructure systems
- Familiarity with frameworks such as Megatron, DeepSpeed, and deployment tools like vLLM, TGI, TensorRT-LLM
- Proficiency in kernel languages like OAI Triton, Pallas, and compilers like XLA
- Experience with INT8/FP8 training and inference, quantization, and distillation
- Knowledge of container technologies like Docker and Kubernetes, and cloud platforms such as AWS, GCP
- Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls
Relevant Keywords: Machine Learning Infrastructure, LLMs, Large Language Models, Kubernetes, Distributed Systems, Cloud Platforms, AI, Artificial Intelligence, Megatron, DeepSpeed