Compiler Engineer – Distributed ML Training
Are you passionate about working at the intersection of machine learning and compiler engineering? As a Compiler Engineer specializing in distributed ML training, you will have the opportunity to shape cutting-edge deep learning models and compiler technologies that scale efficiently across GPU architectures.
This role is with a forward-thinking company revolutionizing the way distributed machine learning is approached. The company is known for pushing boundaries in AI/ML infrastructure by enabling high-performance compute solutions that bridge the gap between software and hardware.
As a Compiler Engineer, you will work on lowering deep learning graphs from popular frameworks like PyTorch and TensorFlow to an intermediate representation (IR) for training. You will take ownership of key areas within the compiler stack, focusing on ensuring reproducibility and optimizing transformations for distributed systems. Your work will directly impact the efficiency of distributed ML workloads by improving graph traversals, code generation, and machine-specific optimizations.
What we can offer you:
- Opportunity to work on challenging problems at the intersection of AI and distributed computing.
- A highly collaborative environment with deep technical expertise.
- Flexible working conditions with high autonomy.
- Competitive compensation, reflective of your expertise and contributions.
Key responsibilities:
- Lower deep learning graphs from frameworks like PyTorch, TensorFlow, or Keras into IR representations for distributed training.
- Write and optimize algorithms that transform compute graphs between different operator representations.
- Own compiler development in two key areas:
- Front-end: handling integration with deep learning frameworks and writing transformation passes in ONNX.
- Middle-end: writing compiler passes for training compute graphs, integrating kernels, and debugging transformations.
- Back-end: lowering IR into machine code optimized for GPUs.
This Compiler Engineer role focuses on utilizing modern compiler technologies like LLVM and exploring the intricacies of distributed ML training systems with a strong emphasis on reproducibility.