Dein persönlicher KI-Karriere-Agent
Senior ML Infrastructure Engineer(m/w/x)
Multi-cluster GPU infrastructure management for tabular foundation models. Deep Slurm and cluster optimization experience required. Commitment to diversity and inclusion.
Anforderungen
- 5+ years building/operating production GPU infrastructure or distributed training systems at scale
- Deep hands-on Slurm and cluster management experience
- Debugging scheduling failures
- Optimizing multi-tenant GPU workload utilization
- Operating infrastructure with real cost of downtime
- Expert-level systems thinking: memory bandwidth, GPU profiling
- Reasoning about hardware, not configs
- Strong Python and genuine fluency with PyTorch internals
- Profiling training runs to identify bottlenecks
- Track record of infrastructure decisions improving training throughput or cost efficiency
- Strong AI tooling skills
- Fluent use of Claude Code, Cursor, or similar
- Experience operating at tens-of-millions-scale GPU spend
- Multi-cloud or hybrid HPC/cloud infrastructure experience
- Triton, CUDA, or custom kernel experience
- Experience scaling from single cluster to multi-cluster orchestration
- Background building experiment tracking, model registry, or ML pipeline tooling
Aufgaben
- Own and evolve multi-cluster GPU infrastructure
- Manage Slurm on GCP and multi-provider/new hardware deployments
- Optimize cluster architecture, scheduling, and reliability
- Drive GPU utilization and training throughput
- Profile and optimize memory for distributed training
- Identify and resolve communication bottlenecks in distributed training
- Debug distributed training systems for large runs
- Architect next-generation infrastructure
- Orchestrate multi-cluster environments
- Integrate new GPU generations
- Diversify cloud providers
- Plan capacity for growing compute demands
- Build the developer productivity layer
- Develop CI pipelines
- Implement experiment tracking
- Manage model registry
- Oversee data processing infrastructure
- Create internal tooling for research iteration
- Own the compute budget
- Analyze cost per FLOP across providers and hardware
- Minimize wasted compute resources
Berufserfahrung
- 5 Jahre
Ausbildung
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- GPU
- Slurm
- Python
- PyTorch
- Claude Code
- Cursor
- Triton
- CUDA
Benefits
Lockere Unternehmenskultur
- Commitment to diversity and inclusion
Sicherer Arbeitsplatz
- Safe and inclusive environment
Sonstige Vorteile
- Equal opportunities
Noch nicht perfekt?
- Black Forest LabsVollzeitnur vor OrtBerufserfahrenFreiburg im Breisgau
- Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
VollzeitPraktikumnur vor OrtBerlin, Freiburg im Breisgau
Senior ML Infrastructure Engineer(m/w/x)
Multi-cluster GPU infrastructure management for tabular foundation models. Deep Slurm and cluster optimization experience required. Commitment to diversity and inclusion.
Anforderungen
- 5+ years building/operating production GPU infrastructure or distributed training systems at scale
- Deep hands-on Slurm and cluster management experience
- Debugging scheduling failures
- Optimizing multi-tenant GPU workload utilization
- Operating infrastructure with real cost of downtime
- Expert-level systems thinking: memory bandwidth, GPU profiling
- Reasoning about hardware, not configs
- Strong Python and genuine fluency with PyTorch internals
- Profiling training runs to identify bottlenecks
- Track record of infrastructure decisions improving training throughput or cost efficiency
- Strong AI tooling skills
- Fluent use of Claude Code, Cursor, or similar
- Experience operating at tens-of-millions-scale GPU spend
- Multi-cloud or hybrid HPC/cloud infrastructure experience
- Triton, CUDA, or custom kernel experience
- Experience scaling from single cluster to multi-cluster orchestration
- Background building experiment tracking, model registry, or ML pipeline tooling
Aufgaben
- Own and evolve multi-cluster GPU infrastructure
- Manage Slurm on GCP and multi-provider/new hardware deployments
- Optimize cluster architecture, scheduling, and reliability
- Drive GPU utilization and training throughput
- Profile and optimize memory for distributed training
- Identify and resolve communication bottlenecks in distributed training
- Debug distributed training systems for large runs
- Architect next-generation infrastructure
- Orchestrate multi-cluster environments
- Integrate new GPU generations
- Diversify cloud providers
- Plan capacity for growing compute demands
- Build the developer productivity layer
- Develop CI pipelines
- Implement experiment tracking
- Manage model registry
- Oversee data processing infrastructure
- Create internal tooling for research iteration
- Own the compute budget
- Analyze cost per FLOP across providers and hardware
- Minimize wasted compute resources
Berufserfahrung
- 5 Jahre
Ausbildung
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- GPU
- Slurm
- Python
- PyTorch
- Claude Code
- Cursor
- Triton
- CUDA
Benefits
Lockere Unternehmenskultur
- Commitment to diversity and inclusion
Sicherer Arbeitsplatz
- Safe and inclusive environment
Sonstige Vorteile
- Equal opportunities
Über das Unternehmen
Prior Labs
Branche
IT
Beschreibung
The company is building breakthrough foundation models that understand structured data, aiming to revolutionize scientific discovery and business intelligence.
Noch nicht perfekt?
- Black Forest Labs
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahrenFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
VollzeitPraktikumnur vor OrtBerlin, Freiburg im Breisgau