About the Role of Machine Learning Infrastructure Engineer
As a Machine Learning Infrastructure Engineer, you will spearhead the architecture and development of the compute infrastructure essential for model training and deployment. This role demands a deep understanding of backend technologies, from frameworks and compilers to cloud-based infrastructures such as Kubernetes and Docker.
Key Responsibilities:
- Designing, building, and maintaining scalable machine learning infrastructure for model training and inference.
- Implementing scalable machine learning and distributed systems to enhance training capabilities for large language models (LLMs).
- Developing tools and frameworks to automate and streamline ML experimentation and management.
- Collaborating closely with researchers and product engineers to integrate advanced AI capabilities into impactful products.
- Optimizing performance and efficiency across various accelerators and infrastructure layers.
- Exploring new techniques and developing custom solutions, including kernel optimizations, to improve system performance.
What We Are Looking For:
- Strong understanding of AI accelerator architectures (TPU, IPU, HPU) and their tradeoffs.
- Knowledge of parallel computing concepts and distributed systems.
- Experience in performance tuning of LLM workloads, ideally with frameworks like Megatron and deployment frameworks like vLLM.
- Proficiency in kernel languages such as OAI Triton and compilers like XLA.
- Familiarity with INT8/FP8 training and inference, quantization, and distillation techniques.
- Expertise in container technologies (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, GCP).
- Intermediate fluency in network fundamentals (VPC, Subnets, Routing Tables, Firewalls).
What We Can Offer You:
- Opportunity to work with cutting-edge AI technologies in a well-funded environment.
- Competitive compensation package with benefits.
- Relocation assistance for new hires.
- Dynamic work environment with a focus on collaboration and innovation.
Keywords: LLM, Large Language Model, Machine Learning, GPU, Graphics Processing Unit, ML Infrastructure, Cloud Computing, Kubernetes, K8s, Docker, Containerization, Hardware, TPU, Tensor Processing Unit, AWS, GCP, Azure, Compiler, Kernel, CUDA, Triton, GPU Programming