New Job?Nejo!

Your personal AI career agent

PRPrior Labs

2mo ago

Senior ML Infrastructure Engineer(m/w/x)

Freiburg im Breisgau, Berlin

Full-timeOn-siteSenior

AI/ML

Data Science

Nejo AI Summary

Apply now

Multi-cluster GPU infrastructure management for tabular foundation models. Deep Slurm and cluster optimization experience required. Commitment to diversity and inclusion.

Requirements

5+ years building/operating production GPU infrastructure or distributed training systems at scale
Deep hands-on Slurm and cluster management experience
Debugging scheduling failures
Optimizing multi-tenant GPU workload utilization
Operating infrastructure with real cost of downtime
Expert-level systems thinking: memory bandwidth, GPU profiling
Reasoning about hardware, not configs
Strong Python and genuine fluency with PyTorch internals
Profiling training runs to identify bottlenecks
Track record of infrastructure decisions improving training throughput or cost efficiency
Strong AI tooling skills
Fluent use of Claude Code, Cursor, or similar
Experience operating at tens-of-millions-scale GPU spend
Multi-cloud or hybrid HPC/cloud infrastructure experience
Triton, CUDA, or custom kernel experience
Experience scaling from single cluster to multi-cluster orchestration
Background building experiment tracking, model registry, or ML pipeline tooling

Tasks

Own and evolve multi-cluster GPU infrastructure
Manage Slurm on GCP and multi-provider/new hardware deployments
Optimize cluster architecture, scheduling, and reliability
Drive GPU utilization and training throughput
Profile and optimize memory for distributed training
Identify and resolve communication bottlenecks in distributed training
Debug distributed training systems for large runs
Architect next-generation infrastructure
Orchestrate multi-cluster environments
Integrate new GPU generations
Diversify cloud providers
Plan capacity for growing compute demands
Build the developer productivity layer
Develop CI pipelines
Implement experiment tracking
Manage model registry
Oversee data processing infrastructure
Create internal tooling for research iteration
Own the compute budget
Analyze cost per FLOP across providers and hardware
Minimize wasted compute resources

Work Experience

5 years

Education

Bachelor's degreeOR
Master's degree

Languages

English – Business Fluent

Tools & Technologies

GPU
Slurm
Python
PyTorch
Claude Code
Cursor
Triton
CUDA

Benefits

Informal Culture

Commitment to diversity and inclusion

Job Security

Safe and inclusive environment

Other Benefits

Equal opportunities

Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Prior Labs and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

Like this job?

Beta

Your Career Agent finds similar jobs for you every day.

Not a perfect match?

100+ Similar Jobs for you View all

Black Forest Labs
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Full-timeOn-siteExperienced
Freiburg im Breisgau
Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperienced
Berlin, Freiburg im Breisgau
from 140,000 / year
Prior Labs
ML Engineer, Foundation Model(m/w/x)
Full-timeOn-siteExperienced
Berlin, Freiburg im Breisgau
from 120,000 / year
Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Full-timeOn-siteSenior
Freiburg im Breisgau
Prior Labs
Research Scientist Intern (PhD)(m/w/x)
Full-timeInternshipOn-site
Berlin, Freiburg im Breisgau

View all 100+ similar jobs

PRPrior Labs

2mo ago