Infrastructure Engineer (GPU Cluster)

Role
The GPU cluster infrastructure engineer’s role is to build and operate the computational foundation that makes large-scale ML model training fast—maximizing ML engineers’ experimental iteration efficiency.
Under the clear mandate of justifying the significant investment made in GPU cluster infrastructure—both in terms of business ROI and technical value—you optimize the infrastructure with purpose.
In a hybrid environment combining on-premises and multiple cloud providers, you solve technically deep problems in a domain close to High-Performance Computing (HPC): data transfer optimization, resource utilization, and network performance.
What You’ll Do
- Physical design and device selection for compute environments
- Network design for high-performance computing environments
- Storage system optimization and acceleration
- Development of clustering technologies
- Build and operate compute environments on cloud platforms (AWS, Azure, GCP)
- Optimize cluster utilization and performance across multiple workload types
- Manage compute resource allocation and scheduling
- Troubleshoot infrastructure issues and performance bottlenecks
- Plan and execute next-generation compute environment builds on a roughly 2–3 year cycle
- Communicate with business and finance stakeholders on infrastructure investment planning and ROI justification
What We’re Looking For
You have deep expertise in cloud infrastructure, GPU computing, and distributed systems. You understand distributed systems fundamentals and can design systems that scale reliably. You’re comfortable with infrastructure as code, automation, and monitoring tools.
Beyond technical depth, you have strong problem-solving skills and communicate clearly with both infrastructure and ML engineering teams. You stay current with evolving infrastructure technologies and can evaluate when to adopt new approaches.
Tech Stack
- Kubernetes (container orchestration)
- GPU cluster management (NVIDIA NCCL, distributed training frameworks)
- Infrastructure automation and provisioning
- Monitoring and observability tools
- Cloud infrastructure platforms (AWS, Azure, GCP)
- Python and shell scripting
- High-speed networking (InfiniBand, RoCE)
- Storage systems (NFS, object storage, parallel file systems)
What Makes This Role Special
You can work on infrastructure optimized for a specific, well-defined application: large-scale ML model training. Unlike general infrastructure work that must respond to diverse requirements, at Turing you optimize for a single, extremely concrete demand.
The purpose is clear: justify the significant investment in GPU cluster infrastructure, both in terms of business ROI and technical value. This clarity sharpens the technical problems you need to solve and enables deep technical exploration with full context.
Building and operating infrastructure for large-scale ML training in a domain close to HPC requires exceptional technical depth—but the direct payoff is concrete: accelerating ML engineers’ experiment iteration cycles, and directly contributing to business outcomes.
Key Qualifications
- Experience building and operating large-scale GPU clusters
- Experience in High-Performance Computing (HPC) environments
- Motivated by infrastructure optimization for the purpose of accelerating model training
- Experience with job scheduling and resource management optimization
- Experience with hybrid infrastructure combining on-premises and multiple cloud environments
- Strong at problem-solving in deep infrastructure layers: networking, storage
Cross-Functional Collaboration
With ML Engineers
You’ll collaborate primarily on training job execution: verifying that jobs run correctly, jointly identifying and resolving errors, and frequently discussing infrastructure approaches to accelerating model training based on ML engineers’ requirements and code characteristics.
This is an opportunity to tackle the technically demanding challenge of accelerating large-scale ML training in a domain close to HPC. ML engineers directly benefit from infrastructure acceleration as faster experiment iteration—and you’ll feel that contribution concretely. Developing deep expertise in high-performance, resource-efficient infrastructure operation is a major technical advantage.
With Software Engineers (MLOps / Cloud Engineers)
You’ll define requirements for the cloud and infrastructure execution environment running large-scale data pipelines—especially reducing data transfer time and ensuring data integrity when moving data from cloud-based pipelines to on-premises or alternative cloud training environments.
Tackling the challenge of large-scale unstructured data transfer across multiple cloud environments is both technically difficult and business-critical—directly accelerating the velocity of the entire autonomous driving development cycle. Building specialized infrastructure in coordination with MLOps requirements is a rare experience available at very few companies.
Join us :
Take on the challenge of fully autonomous driving
with a diverse team of talented members
from various backgrounds.