Your personal AI career agent
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Requirements
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Tasks
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Work Experience
- approx. 1 - 4 years
Education
- Vocational certificationOR
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
Not a perfect match?
- Prior LabsFull-timeOn-siteSeniorFreiburg im Breisgau, Berlin
- Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Full-timeOn-siteSeniorFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 120,000 / year - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
Full-timeInternshipOn-siteBerlin, Freiburg im Breisgau
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Requirements
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Tasks
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Work Experience
- approx. 1 - 4 years
Education
- Vocational certificationOR
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
About the Company
Black Forest Labs
Industry
IT
Description
The company advances generative deep learning for media, creating models that transform ideas into images and videos.
Not a perfect match?
- Prior Labs
Senior ML Infrastructure Engineer(m/w/x)
Full-timeOn-siteSeniorFreiburg im Breisgau, Berlin - Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Full-timeOn-siteSeniorFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 120,000 / year - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
Full-timeInternshipOn-siteBerlin, Freiburg im Breisgau