Lead GPU Infrastructure Engineer (HPC / AI Infrastructure)
800299
Posted: 11/05/2026
Responsibilities:
- Competitive
- North America - Remote
- Permanent
We’re partnering with a rapidly scaling technology business building advanced compute infrastructure for next-generation AI systems. This is an opportunity for a senior infrastructure engineer to play a key role in designing and operating large-scale GPU environments supporting highly demanding, enterprise-grade workloads across modern high-performance compute platforms.
The Company
Our client is building next-generation infrastructure at the intersection of AI, high-performance computing, and distributed systems. They’re scaling advanced GPU environments powering demanding workloads for globally recognised technology platforms and emerging digital ecosystems.
With major growth underway, the team is investing heavily in next-generation GPU infrastructure and high-performance compute environments. Infrastructure engineering sits at the core of the company’s long-term direction.
The Role
We’re looking for an experienced Infrastructure Engineer with expertise across large-scale compute, GPU, or high-performance infrastructure environments. This role offers the opportunity to own advanced infrastructure platforms spanning automation, scalability, observability, and operational resilience in a highly technical environment.
You’ll likely come from teams operating at significant scale, where reliability and performance are mission critical.
Responsibilities:
- Own the lifecycle management of large-scale GPU infrastructure, from provisioning and firmware validation through to operational reliability.
- Lead operations across high-density, liquid-cooled compute environments supporting next-generation AI workloads.
- Build automated observability and remediation systems using Prometheus, Grafana, NVIDIA DCGM, and infrastructure automation tooling.
- Drive NetBox DCIM integration, asset management, IPAM, and infrastructure compliance across complex compute environments.
- Act as a senior technical lead for infrastructure operations, incident response, vendor management, and enterprise-level infrastructure support.
- Strong experience managing large-scale GPU, HPC, or high-performance compute infrastructure.
- Deep hands-on expertise with NVIDIA GPU systems, including H200, B200, or B300 environments.
- Advanced knowledge of InfiniBand, NVLink, NVSwitch, and high-throughput networking architectures.
- Strong Linux systems engineering background with infrastructure automation using Python or Go.
- Experience with observability and monitoring tooling including Prometheus, Grafana, NVIDIA DCGM, and SNMP.
- Proven experience across bare-metal provisioning, infrastructure lifecycle management, and automated/self-healing systems.
- Experience with liquid-cooled or high-density compute environments.
- Familiarity with NVIDIA Mission Control and GPU cluster management.
- Exposure to confidential compute technologies and attestation. workflows.
- Experience building infrastructure standards in fast-scaling environments.
- Competitive salary and benefits package.
- Opportunity to build next-generation AI infrastructure.
- Exposure to cutting-edge GPU and HPC environments.
- Strong ownership across infrastructure and automation.
- Engineering-led culture working on mission-critical systems.
To apply, please submit your application via the advert or contact Andrew directly at andrew@axiomrecruit.com.
Andrew Phillips
Founder
Apply for this role
Recruitment