Dein persönlicher KI-Karriere-Agent
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Anforderungen
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Aufgaben
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Berufserfahrung
- ca. 1 - 4 Jahre
Ausbildung
- Abgeschlossene BerufsausbildungODER
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Noch nicht perfekt?
- Prior LabsVollzeitnur vor OrtSeniorFreiburg im Breisgau, Berlin
- Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Black Forest Labs
Developer Relations Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahrenFreiburg im Breisgau
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Anforderungen
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Aufgaben
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Berufserfahrung
- ca. 1 - 4 Jahre
Ausbildung
- Abgeschlossene BerufsausbildungODER
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Über das Unternehmen
Black Forest Labs
Branche
IT
Beschreibung
The company advances generative deep learning for media, creating models that transform ideas into images and videos.
Noch nicht perfekt?
- Prior Labs
Senior ML Infrastructure Engineer(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau, Berlin - Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Black Forest Labs
Developer Relations Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahrenFreiburg im Breisgau