Role Overview
We are looking for a GPU Research Engineer to work on optimizing inference performance for large language models (LLMs) by developing and optimizing GPU kernels. This role involves low-level performance tuning, CUDA/Triton programming, and debugging deep learning workloads to maximize throughput and efficiency.
You will collaborate with ML engineers, systems researchers, and hardware teams to push the limits of GPU acceleration for AI workloads.
Responsibilities
- Develop, optimize, and debug custom GPU kernels using CUDA, Triton, and other low-level performance libraries.
- Profile and analyze deep learning inference workloads to identify bottlenecks and implement optimizations.
- Improve memory bandwidth utilization, kernel fusion, tiling strategies, and tensor parallelism for efficient LLM execution.
- Work closely with ML and infrastructure teams to enhance model execution across different GPU architectures (e.g., NVIDIA H100, A100, MI300).
- Research and implement state-of-the-art techniques for reducing latency, improving throughput, and minimizing memory overhead.
- Contribute to open-source deep learning frameworks or internal acceleration toolkits as needed.
Requirements
- Strong experience in CUDA, Triton, or OpenCL for GPU programming.
- Deep understanding of GPU architectures, memory hierarchy, and parallel computing.
- Experience profiling and debugging GPU workloads using NVIDIA Nsight, cuDNN, TensorRT, or PyTorch/XLA.
- Solid knowledge of ML frameworks such as PyTorch, JAX, or TensorFlow and their GPU execution models.
- Familiarity with numerical precision trade-offs (FP16, BF16, INT8 quantization) and mixed-precision computation.
- Proficiency in C++ and Python.
- Prior experience working on inference optimizations for large-scale ML models is a plus.
Nice to Have
- Experience with compiler optimizations, MLIR, or TVM.
- Contributions to open-source deep learning libraries related to GPU acceleration.
- Hands-on experience with distributed inference techniques (tensor/model parallelism).
- Knowledge of hardware-specific optimizations for TPUs, NPUs, or FPGAs.
Why Join Us?
- Work on cutting-edge AI infrastructure and shape the future of large-scale LLM inference.
- Collaborate with world-class researchers and engineers optimizing AI workloads at scale.
- Access to state-of-the-art hardware, including the latest GPUs and AI accelerators.
- Competitive compensation, equity, and benefits package.