New Job?Nejo!

The AI Job Search Engine

BL
Black Forest Labs
4mo ago

Member of Technical Staff - Training Cluster Engineer(m/w/x)

Freiburg im Breisgau
Full-timeOn-siteExperienced
AI/ML

Description

You design and maintain ML training clusters, ensuring their performance and security. By collaborating with research teams, you translate their computational needs into effective infrastructure solutions.

Let AI find the perfect jobs for you!

Upload your CV and Nejo AI will find matching job offers for you.

Requirements

  • Production experience managing SLURM clusters
  • Hands-on experience with Docker or similar container runtimes
  • Proven track record managing GPU clusters
  • Understanding of distributed training patterns
  • Experience with Kubernetes for containerized workloads
  • Experience with high-performance interconnects
  • Track record of managing 1000+ GPU training runs
  • Familiarity with high-performance storage solutions
  • Experience running hybrid training/inference infrastructure
  • Strong scripting skills in Python and Bash

Work Experience

approx. 1 - 4 years

Tasks

  • Design and maintain large-scale ML training clusters
  • Deploy SLURM for distributed workload orchestration
  • Implement node health monitoring systems
  • Automate failure detection and recovery workflows
  • Ensure cluster availability with cloud providers
  • Monitor performance with colocation partners
  • Establish security best practices for ML infrastructure
  • Build developer-facing tools and APIs for ML workflows
  • Collaborate with ML research teams on infrastructure needs

Tools & Technologies

SLURMDockerKubernetesInfiniBandRoCENCCLPythonBash

Languages

EnglishBusiness Fluent

Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Black Forest Labs and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.
Not a perfect match?
100+ Similar Jobs in Freiburg im Breisgau
  • Black Forest Labs

    Member of Technical Staff - Large scale data infrastructure(m/w/x)

    Full-timeOn-siteSenior
    Freiburg im Breisgau
  • Prior Labs

    ML Engineer, Cloud Platform(m/w/x)

    Full-timeOn-siteExperienced
    from 140,000 / year
    Berlin, Freiburg im Breisgau
  • Prior Labs

    MLOps / ML Systems Engineer(m/w/x)

    Full-timeOn-siteSenior
    Berlin, Freiburg im Breisgau
  • Black Forest Labs

    Member of Technical Staff - Data Engineering(m/w/x)

    Full-timeOn-siteExperienced
    Freiburg im Breisgau
  • Black Forest Labs

    Member of Technical Staff - Image / Video Applications(m/w/x)

    Full-timeOn-siteNot specified
    Freiburg im Breisgau
100+ View all similar jobs