Role

The GPU cluster infrastructure engineer’s role is to build and operate the computational foundation that makes large-scale ML model training fast—maximizing ML engineers’ experimental iteration efficiency.

Under the clear mandate of justifying the significant investment made in GPU cluster infrastructure—both in terms of business ROI and technical value—you optimize the infrastructure with purpose.

In a hybrid environment combining on-premises and multiple cloud providers, you solve technically deep problems in a domain close to High-Performance Computing (HPC): data transfer optimization, resource utilization, and network performance.

What You’ll Do

Physical design and device selection for compute environments
Network design for high-performance computing environments
Storage system optimization and acceleration
Development of clustering technologies
Build and operate compute environments on cloud platforms (AWS, Azure, GCP)
Optimize cluster utilization and performance across multiple workload types
Manage compute resource allocation and scheduling
Troubleshoot infrastructure issues and performance bottlenecks
Plan and execute next-generation compute environment builds on a roughly 2–3 year cycle
Communicate with business and finance stakeholders on infrastructure investment planning and ROI justification

What We’re Looking For

You have deep expertise in cloud infrastructure, GPU computing, and distributed systems. You understand distributed systems fundamentals and can design systems that scale reliably. You’re comfortable with infrastructure as code, automation, and monitoring tools.

Beyond technical depth, you have strong problem-solving skills and communicate clearly with both infrastructure and ML engineering teams. You stay current with evolving infrastructure technologies and can evaluate when to adopt new approaches.

Tech Stack

Kubernetes (container orchestration)
GPU cluster management (NVIDIA NCCL, distributed training frameworks)
Infrastructure automation and provisioning
Monitoring and observability tools
Cloud infrastructure platforms (AWS, Azure, GCP)
Python and shell scripting
High-speed networking (InfiniBand, RoCE)
Storage systems (NFS, object storage, parallel file systems)

What Makes This Role Special

You can work on infrastructure optimized for a specific, well-defined application: large-scale ML model training. Unlike general infrastructure work that must respond to diverse requirements, at Turing you optimize for a single, extremely concrete demand.

The purpose is clear: justify the significant investment in GPU cluster infrastructure, both in terms of business ROI and technical value. This clarity sharpens the technical problems you need to solve and enables deep technical exploration with full context.

Building and operating infrastructure for large-scale ML training in a domain close to HPC requires exceptional technical depth—but the direct payoff is concrete: accelerating ML engineers’ experiment iteration cycles, and directly contributing to business outcomes.

Key Qualifications

Experience building and operating large-scale GPU clusters
Experience in High-Performance Computing (HPC) environments
Motivated by infrastructure optimization for the purpose of accelerating model training
Experience with job scheduling and resource management optimization
Experience with hybrid infrastructure combining on-premises and multiple cloud environments
Strong at problem-solving in deep infrastructure layers: networking, storage

Cross-Functional Collaboration

With ML Engineers

You’ll collaborate primarily on training job execution: verifying that jobs run correctly, jointly identifying and resolving errors, and frequently discussing infrastructure approaches to accelerating model training based on ML engineers’ requirements and code characteristics.

This is an opportunity to tackle the technically demanding challenge of accelerating large-scale ML training in a domain close to HPC. ML engineers directly benefit from infrastructure acceleration as faster experiment iteration—and you’ll feel that contribution concretely. Developing deep expertise in high-performance, resource-efficient infrastructure operation is a major technical advantage.

With Software Engineers (MLOps / Cloud Engineers)

You’ll define requirements for the cloud and infrastructure execution environment running large-scale data pipelines—especially reducing data transfer time and ensuring data integrity when moving data from cloud-based pipelines to on-premises or alternative cloud training environments.

Tackling the challenge of large-scale unstructured data transfer across multiple cloud environments is both technically difficult and business-critical—directly accelerating the velocity of the entire autonomous driving development cycle. Building specialized infrastructure in coordination with MLOps requirements is a rare experience available at very few companies.