The AI Job Search Engine
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Description
You design and maintain ML training clusters, ensuring their performance and security. By collaborating with research teams, you translate their computational needs into effective infrastructure solutions.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •Production experience managing SLURM clusters
- •Hands-on experience with Docker or similar container runtimes
- •Proven track record managing GPU clusters
- •Understanding of distributed training patterns
- •Experience with Kubernetes for containerized workloads
- •Experience with high-performance interconnects
- •Track record of managing 1000+ GPU training runs
- •Familiarity with high-performance storage solutions
- •Experience running hybrid training/inference infrastructure
- •Strong scripting skills in Python and Bash
Work Experience
approx. 1 - 4 years
Tasks
- •Design and maintain large-scale ML training clusters
- •Deploy SLURM for distributed workload orchestration
- •Implement node health monitoring systems
- •Automate failure detection and recovery workflows
- •Ensure cluster availability with cloud providers
- •Monitor performance with colocation partners
- •Establish security best practices for ML infrastructure
- •Build developer-facing tools and APIs for ML workflows
- •Collaborate with ML research teams on infrastructure needs
Tools & Technologies
Languages
English – Business Fluent
- Black Forest LabsFull-timeOn-siteSeniorFreiburg im Breisgau
- Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedfrom 140,000 / yearBerlin, Freiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Black Forest Labs
Member of Technical Staff - Data Engineering(m/w/x)
Full-timeOn-siteExperiencedFreiburg im Breisgau - Black Forest Labs
Member of Technical Staff - Image / Video Applications(m/w/x)
Full-timeOn-siteNot specifiedFreiburg im Breisgau
Member of Technical Staff - Training Cluster Engineer(m/w/x)
The AI Job Search Engine
Description
You design and maintain ML training clusters, ensuring their performance and security. By collaborating with research teams, you translate their computational needs into effective infrastructure solutions.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •Production experience managing SLURM clusters
- •Hands-on experience with Docker or similar container runtimes
- •Proven track record managing GPU clusters
- •Understanding of distributed training patterns
- •Experience with Kubernetes for containerized workloads
- •Experience with high-performance interconnects
- •Track record of managing 1000+ GPU training runs
- •Familiarity with high-performance storage solutions
- •Experience running hybrid training/inference infrastructure
- •Strong scripting skills in Python and Bash
Work Experience
approx. 1 - 4 years
Tasks
- •Design and maintain large-scale ML training clusters
- •Deploy SLURM for distributed workload orchestration
- •Implement node health monitoring systems
- •Automate failure detection and recovery workflows
- •Ensure cluster availability with cloud providers
- •Monitor performance with colocation partners
- •Establish security best practices for ML infrastructure
- •Build developer-facing tools and APIs for ML workflows
- •Collaborate with ML research teams on infrastructure needs
Tools & Technologies
Languages
English – Business Fluent
About the Company
Black Forest Labs
Industry
IT
Description
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. The company focuses on innovation and developing advanced ML infrastructure.
- Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Full-timeOn-siteSeniorFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedfrom 140,000 / yearBerlin, Freiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Black Forest Labs
Member of Technical Staff - Data Engineering(m/w/x)
Full-timeOn-siteExperiencedFreiburg im Breisgau - Black Forest Labs
Member of Technical Staff - Image / Video Applications(m/w/x)
Full-timeOn-siteNot specifiedFreiburg im Breisgau