Your personal AI career agent
Senior ML Infrastructure Engineer(m/w/x)
Multi-cluster GPU infrastructure management for tabular foundation models. Deep Slurm and cluster optimization experience required. Commitment to diversity and inclusion.
Requirements
- 5+ years building/operating production GPU infrastructure or distributed training systems at scale
- Deep hands-on Slurm and cluster management experience
- Debugging scheduling failures
- Optimizing multi-tenant GPU workload utilization
- Operating infrastructure with real cost of downtime
- Expert-level systems thinking: memory bandwidth, GPU profiling
- Reasoning about hardware, not configs
- Strong Python and genuine fluency with PyTorch internals
- Profiling training runs to identify bottlenecks
- Track record of infrastructure decisions improving training throughput or cost efficiency
- Strong AI tooling skills
- Fluent use of Claude Code, Cursor, or similar
- Experience operating at tens-of-millions-scale GPU spend
- Multi-cloud or hybrid HPC/cloud infrastructure experience
- Triton, CUDA, or custom kernel experience
- Experience scaling from single cluster to multi-cluster orchestration
- Background building experiment tracking, model registry, or ML pipeline tooling
Tasks
- Own and evolve multi-cluster GPU infrastructure
- Manage Slurm on GCP and multi-provider/new hardware deployments
- Optimize cluster architecture, scheduling, and reliability
- Drive GPU utilization and training throughput
- Profile and optimize memory for distributed training
- Identify and resolve communication bottlenecks in distributed training
- Debug distributed training systems for large runs
- Architect next-generation infrastructure
- Orchestrate multi-cluster environments
- Integrate new GPU generations
- Diversify cloud providers
- Plan capacity for growing compute demands
- Build the developer productivity layer
- Develop CI pipelines
- Implement experiment tracking
- Manage model registry
- Oversee data processing infrastructure
- Create internal tooling for research iteration
- Own the compute budget
- Analyze cost per FLOP across providers and hardware
- Minimize wasted compute resources
Work Experience
- 5 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- GPU
- Slurm
- Python
- PyTorch
- Claude Code
- Cursor
- Triton
- CUDA
Benefits
Informal Culture
- Commitment to diversity and inclusion
Job Security
- Safe and inclusive environment
Other Benefits
- Equal opportunities
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
Not a perfect match?
- Black Forest LabsFull-timeOn-siteExperiencedFreiburg im Breisgau
- Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 120,000 / year - Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Full-timeOn-siteSeniorFreiburg im Breisgau - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
Full-timeInternshipOn-siteBerlin, Freiburg im Breisgau
Senior ML Infrastructure Engineer(m/w/x)
Multi-cluster GPU infrastructure management for tabular foundation models. Deep Slurm and cluster optimization experience required. Commitment to diversity and inclusion.
Requirements
- 5+ years building/operating production GPU infrastructure or distributed training systems at scale
- Deep hands-on Slurm and cluster management experience
- Debugging scheduling failures
- Optimizing multi-tenant GPU workload utilization
- Operating infrastructure with real cost of downtime
- Expert-level systems thinking: memory bandwidth, GPU profiling
- Reasoning about hardware, not configs
- Strong Python and genuine fluency with PyTorch internals
- Profiling training runs to identify bottlenecks
- Track record of infrastructure decisions improving training throughput or cost efficiency
- Strong AI tooling skills
- Fluent use of Claude Code, Cursor, or similar
- Experience operating at tens-of-millions-scale GPU spend
- Multi-cloud or hybrid HPC/cloud infrastructure experience
- Triton, CUDA, or custom kernel experience
- Experience scaling from single cluster to multi-cluster orchestration
- Background building experiment tracking, model registry, or ML pipeline tooling
Tasks
- Own and evolve multi-cluster GPU infrastructure
- Manage Slurm on GCP and multi-provider/new hardware deployments
- Optimize cluster architecture, scheduling, and reliability
- Drive GPU utilization and training throughput
- Profile and optimize memory for distributed training
- Identify and resolve communication bottlenecks in distributed training
- Debug distributed training systems for large runs
- Architect next-generation infrastructure
- Orchestrate multi-cluster environments
- Integrate new GPU generations
- Diversify cloud providers
- Plan capacity for growing compute demands
- Build the developer productivity layer
- Develop CI pipelines
- Implement experiment tracking
- Manage model registry
- Oversee data processing infrastructure
- Create internal tooling for research iteration
- Own the compute budget
- Analyze cost per FLOP across providers and hardware
- Minimize wasted compute resources
Work Experience
- 5 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- GPU
- Slurm
- Python
- PyTorch
- Claude Code
- Cursor
- Triton
- CUDA
Benefits
Informal Culture
- Commitment to diversity and inclusion
Job Security
- Safe and inclusive environment
Other Benefits
- Equal opportunities
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
About the Company
Prior Labs
Industry
IT
Description
The company is building breakthrough foundation models that understand structured data, aiming to revolutionize scientific discovery and business intelligence.
Not a perfect match?
- Black Forest Labs
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Full-timeOn-siteExperiencedFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 120,000 / year - Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Full-timeOn-siteSeniorFreiburg im Breisgau - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
Full-timeInternshipOn-siteBerlin, Freiburg im Breisgau