You're an engineer who lives and breathes high-performance machine learning. You have a deep understanding of how to make AI models run faster and more efficiently, and you're excited about pushing the boundaries of what's possible with current hardware.
At Replicate, we're building the fastest way to deploy machine learning models. Your role will be crucial in optimizing the performance of the diverse range of models we host, ensuring they run as efficiently as possible on our infrastructure.
We're looking for the right person, not just someone who checks boxes, so you don't need to satisfy all of these things. But, you might have some of these qualities:
- Strong applied engineering skills. You've deployed machine learning models in scaled-up production environments and know the challenges that come with it.
- Deep expertise in CUDA programming and GPU acceleration techniques. You can write custom kernels in your sleep.
- Proficiency in C++ and Python. You're comfortable diving deep into low-level optimizations and high-level model architectures alike.
- Extensive experience with deep learning frameworks like Torch or JAX. You know their strengths, weaknesses, and how to squeeze every ounce of performance out of them.
- A solid grasp of machine learning algorithms, especially with a focus on diffusion models, large language models, or other generative AI techniques.
- Familiarity with model quantization techniques, distillation, model pruning, etc. You understand the tradeoffs and know when to apply which technique.
- You stay up-to-date with the latest developments in ML performance optimization. When a new technique drops, you're already thinking about how to implement it.
You might be particularly good for this job if:
- You've written custom CUDA kernels to significantly improve model latency and can share war stories about the process.
- You can discuss the tradeoffs between fp8 and int8 quantization in depth, and have applied either (or both) to whatever hot new model dropped last week.
- You get excited about diving into academic papers on ML optimization techniques and turning them into practical, production-ready code.