Dein persönlicher KI-Karriere-Agent
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Anforderungen
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Aufgaben
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Berufserfahrung
- ca. 1 - 4 Jahre
Ausbildung
- Abgeschlossene BerufsausbildungODER
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Gefällt dir diese Stelle?
BetaDein Career Agent findet täglich ähnliche Jobs für dich.
Noch nicht perfekt?
- Prior LabsVollzeitnur vor OrtSeniorFreiburg im Breisgau, Berlin
- Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
VollzeitPraktikumnur vor OrtBerlin, Freiburg im Breisgau
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Anforderungen
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Aufgaben
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Berufserfahrung
- ca. 1 - 4 Jahre
Ausbildung
- Abgeschlossene BerufsausbildungODER
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Gefällt dir diese Stelle?
BetaDein Career Agent findet täglich ähnliche Jobs für dich.
Über das Unternehmen
Black Forest Labs
Branche
IT
Beschreibung
The company advances generative deep learning for media, creating models that transform ideas into images and videos.
Noch nicht perfekt?
- Prior Labs
Senior ML Infrastructure Engineer(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau, Berlin - Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Prior Labs
Research Scientist Intern (PhD)(m/w/x)
VollzeitPraktikumnur vor OrtBerlin, Freiburg im Breisgau