GPU Software Engineer – Distributed ML Training
Are you driven by the challenges of optimizing GPU compute for distributed machine learning? As a GPU Software Engineer focused on distributed ML training, you'll be responsible for developing high-performance compute kernels and contributing to a robust multi-GPU infrastructure for modern machine learning applications.
This position is with a forward-leaning company working at the cutting edge of AI infrastructure. They specialize in creating scalable solutions that ensure efficient compute across GPUs, supporting the growth of AI/ML technologies by leveraging innovative hardware and software integration.
As a GPU Software Engineer, your primary role will be to develop performant GPU kernels and contribute to compute infrastructure for training deep learning models. You'll focus on numerical stability and compute flows, ensuring reproducibility in distributed environments. With opportunities to work across GPU-specific optimizations, you'll drive real-world performance improvements in training systems that operate at scale.
What we can offer you:
- A dynamic environment with deep technical challenges at the intersection of ML and GPU compute.
- A strong culture of autonomy, where your expertise drives meaningful contributions.
- Competitive compensation aligned with your impact and experience.
- Flexible working conditions with a focus on collaboration and innovation.
Key responsibilities:
- Develop and optimize GPU kernels and infrastructure from deep learning frameworks (e.g., PyTorch) down to intermediate representations (IR) for distributed training.
- Design novel algorithms with a focus on numerical stability and efficient multi-GPU training flows.
- Implement low-level GPU optimizations to enhance performance and ensure numerical accuracy.
- Contribute to reproducibility in distributed machine learning systems.
This GPU Software Engineer role emphasizes hands-on work with CUDA, PTX, and deep learning frameworks, ensuring high-performance compute for distributed ML training.