Die KI-Suchmaschine für Jobs
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Anforderungen
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Aufgaben
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Berufserfahrung
- ca. 1 - 4 Jahre
Ausbildung
- Abgeschlossene BerufsausbildungODER
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Noch nicht perfekt?
- Black Forest LabsVollzeitnur vor OrtSeniorFreiburg im Breisgau
- Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Black Forest Labs
Developer Relations Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahrenFreiburg im Breisgau
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.
Anforderungen
- Production experience managing SLURM clusters
- Hands-on experience with Docker or similar container runtimes
- Proven track record managing GPU clusters
- Understanding of distributed training patterns
- Experience with Kubernetes for containerized workloads
- Experience with high-performance interconnects
- Track record of managing 1000+ GPU training runs
- Familiarity with high-performance storage solutions
- Experience running hybrid training/inference infrastructure
- Strong scripting skills in Python and Bash
Aufgaben
- Design and maintain large-scale ML training clusters
- Deploy SLURM for distributed workload orchestration
- Implement node health monitoring systems
- Automate failure detection and recovery workflows
- Ensure cluster availability with cloud providers
- Monitor performance with colocation partners
- Establish security best practices for ML infrastructure
- Build developer-facing tools and APIs for ML workflows
- Collaborate with ML research teams on infrastructure needs
Berufserfahrung
- ca. 1 - 4 Jahre
Ausbildung
- Abgeschlossene BerufsausbildungODER
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- SLURM
- Docker
- Kubernetes
- InfiniBand
- RoCE
- NCCL
- Python
- Bash
Über das Unternehmen
Black Forest Labs
Branche
IT
Beschreibung
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. The company focuses on innovation and developing advanced ML infrastructure.
Noch nicht perfekt?
- Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSeniorFreiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 120.000 / Jahr - Black Forest Labs
Developer Relations Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahrenFreiburg im Breisgau