Skip to content
New Job?Nejo!

The AI Job Search Engine

BLBlack Forest Labs

Member of Technical Staff - Training Cluster Engineer(m/w/x)

Freiburg im Breisgau
Full-timeOn-siteExperienced
AI/ML

Designing and maintaining large-scale ML training clusters for generative image/video models, deploying SLURM for workload orchestration. Production experience managing SLURM and GPU clusters required; hands-on Docker or Kubernetes experience essential. Focus on critical ML infrastructure automation and cloud provider cluster availability.

Requirements

  • Production experience managing SLURM clusters
  • Hands-on experience with Docker or similar container runtimes
  • Proven track record managing GPU clusters
  • Understanding of distributed training patterns
  • Experience with Kubernetes for containerized workloads
  • Experience with high-performance interconnects
  • Track record of managing 1000+ GPU training runs
  • Familiarity with high-performance storage solutions
  • Experience running hybrid training/inference infrastructure
  • Strong scripting skills in Python and Bash

Tasks

  • Design and maintain large-scale ML training clusters
  • Deploy SLURM for distributed workload orchestration
  • Implement node health monitoring systems
  • Automate failure detection and recovery workflows
  • Ensure cluster availability with cloud providers
  • Monitor performance with colocation partners
  • Establish security best practices for ML infrastructure
  • Build developer-facing tools and APIs for ML workflows
  • Collaborate with ML research teams on infrastructure needs

Work Experience

  • approx. 1 - 4 years

Education

  • Vocational certificationOR
  • Bachelor's degreeOR
  • Master's degree

Languages

  • EnglishBusiness Fluent

Tools & Technologies

  • SLURM
  • Docker
  • Kubernetes
  • InfiniBand
  • RoCE
  • NCCL
  • Python
  • Bash
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Black Forest Labs and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

  • Black Forest Labs

    Member of Technical Staff - Large scale data infrastructure(m/w/x)

    Full-timeOn-siteSenior
    Freiburg im Breisgau
  • Prior Labs

    MLOps / ML Systems Engineer(m/w/x)

    Full-timeOn-siteSenior
    Berlin, Freiburg im Breisgau
  • Prior Labs

    ML Engineer, Cloud Platform(m/w/x)

    Full-timeOn-siteExperienced
    Berlin, Freiburg im Breisgau
    from 140,000 / year
  • Prior Labs

    ML Engineer, Foundation Model(m/w/x)

    Full-timeOn-siteExperienced
    Berlin, Freiburg im Breisgau
    from 120,000 / year
  • Black Forest Labs

    Developer Relations Engineer(m/w/x)

    Full-timeOn-siteExperienced
    Freiburg im Breisgau
View all 100+ similar jobs

Nejo is an AI – results may be incomplete or contain mistakes