Member of Technical Staff: Machine Learning Infrastructure Engineer

Acceler8 Talent • united states, united states, us • 7m ago

Member of Technical Staff: Machine Learning Infrastructure Engineer

Overview: Join a team dedicated to deepening the collaboration between humans and computers, aiming for breakthroughs in AI that redefine the user experience from the ground up. We believe in the power of a small, focused team to drive significant advancements in technology.

About the Company: We are a well-capitalized, multi-disciplinary team supported by leading venture partners and technology companies. Our mission is to solve real-world AI problems by innovating at every layer of the tech stack. Our backers include major industry players, ensuring we have the resources to push the boundaries of AI.

About the Role: As a Machine Learning Infrastructure Engineer, you will architect and build the compute infrastructure that powers our model training and serving. This role demands a deep understanding of the entire backend stack, from frameworks and compilers to runtimes and kernels. Familiarity with cloud-based infrastructure tools like Kubernetes and Docker is essential.

What We Can Offer You:

Competitive salary and benefits
Relocation assistance to San Francisco
Opportunity to work on cutting-edge AI technology
Collaborative, innovative team environment
Professional growth and development

Key Responsibilities:

Design, build, and maintain scalable machine learning infrastructure for model training and inference
Implement scalable machine learning and distributed systems to enable training and scaling of large language models (LLMs)
Develop tools and frameworks to automate and streamline ML experimentation and management
Collaborate with researchers and product engineers to enhance product experiences using LLMs
Optimize performance and efficiency across various AI accelerators
Research and write custom kernels to improve training and serving infrastructure

What We Are Looking For:

Strong understanding of new AI accelerators like TPU, IPU, HPU, and their tradeoffs
Knowledge of parallel computing concepts and distributed systems
Experience in performance tuning of training and/or inference workloads, including MLPerf or internal production workloads
6+ years of relevant industry experience in designing large-scale ML infrastructure systems
Familiarity with frameworks such as Megatron, DeepSpeed, and deployment tools like vLLM, TGI, TensorRT-LLM
Proficiency in kernel languages like OAI Triton, Pallas, and compilers like XLA
Experience with INT8/FP8 training and inference, quantization, and distillation
Knowledge of container technologies like Docker and Kubernetes, and cloud platforms such as AWS, GCP
Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls

Relevant Keywords: Machine Learning Infrastructure, LLMs, Large Language Models, Kubernetes, Distributed Systems, Cloud Platforms, AI, Artificial Intelligence, Megatron, DeepSpeed

Apply