The AI Job Search Engine
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Requirements
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Tasks
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Work Experience
- approx. 1 - 4 years
Education
- Vocational certificationOR
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Not a perfect match?
- Black Forest LabsFull-timeOn-siteSeniorFreiburg im Breisgau
- Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 120,000 / year - Black Forest Labs
Developer Relations Engineer(m/w/x)
Full-timeOn-siteExperiencedFreiburg im Breisgau
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Requirements
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Tasks
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Work Experience
- approx. 1 - 4 years
Education
- Vocational certificationOR
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
About the Company
Black Forest Labs
Industry
IT
Description
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. The company focuses on innovation and developing advanced ML infrastructure.
Not a perfect match?
- Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Full-timeOn-siteSeniorFreiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 120,000 / year - Black Forest Labs
Developer Relations Engineer(m/w/x)
Full-timeOn-siteExperiencedFreiburg im Breisgau