Skip to content
New Job?Nejo!

The AI Job Search Engine

PRPrior Labs

Senior ML Infrastructure Engineer(m/w/x)

Freiburg im Breisgau, Berlin
Full-timeOn-siteSenior
AI/ML
Data Science

Multi-cluster GPU infrastructure management for tabular foundation models. Deep Slurm and cluster optimization experience required. Commitment to diversity and inclusion.

Requirements

  • 5+ years building/operating production GPU infrastructure or distributed training systems at scale
  • Deep hands-on Slurm and cluster management experience
  • Debugging scheduling failures
  • Optimizing multi-tenant GPU workload utilization
  • Operating infrastructure with real cost of downtime
  • Expert-level systems thinking: memory bandwidth, GPU profiling
  • Reasoning about hardware, not configs
  • Strong Python and genuine fluency with PyTorch internals
  • Profiling training runs to identify bottlenecks
  • Track record of infrastructure decisions improving training throughput or cost efficiency
  • Strong AI tooling skills
  • Fluent use of Claude Code, Cursor, or similar
  • Experience operating at tens-of-millions-scale GPU spend
  • Multi-cloud or hybrid HPC/cloud infrastructure experience
  • Triton, CUDA, or custom kernel experience
  • Experience scaling from single cluster to multi-cluster orchestration
  • Background building experiment tracking, model registry, or ML pipeline tooling

Tasks

  • Own and evolve multi-cluster GPU infrastructure
  • Manage Slurm on GCP and multi-provider/new hardware deployments
  • Optimize cluster architecture, scheduling, and reliability
  • Drive GPU utilization and training throughput
  • Profile and optimize memory for distributed training
  • Identify and resolve communication bottlenecks in distributed training
  • Debug distributed training systems for large runs
  • Architect next-generation infrastructure
  • Orchestrate multi-cluster environments
  • Integrate new GPU generations
  • Diversify cloud providers
  • Plan capacity for growing compute demands
  • Build the developer productivity layer
  • Develop CI pipelines
  • Implement experiment tracking
  • Manage model registry
  • Oversee data processing infrastructure
  • Create internal tooling for research iteration
  • Own the compute budget
  • Analyze cost per FLOP across providers and hardware
  • Minimize wasted compute resources

Work Experience

  • 5 years

Education

  • Bachelor's degreeOR
  • Master's degree

Languages

  • EnglishBusiness Fluent

Tools & Technologies

  • GPU
  • Slurm
  • Python
  • PyTorch
  • Claude Code
  • Cursor
  • Triton
  • CUDA

Benefits

Informal Culture

  • Commitment to diversity and inclusion

Job Security

  • Safe and inclusive environment

Other Benefits

  • Equal opportunities
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Prior Labs and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

  • Black Forest Labs

    Member of Technical Staff - Training Cluster Engineer(m/w/x)

    Full-timeOn-siteExperienced
    Freiburg im Breisgau
  • Prior Labs

    ML Engineer, Cloud Platform(m/w/x)

    Full-timeOn-siteExperienced
    Berlin, Freiburg im Breisgau
    from 140,000 / year
  • Prior Labs

    ML Engineer, Foundation Model(m/w/x)

    Full-timeOn-siteExperienced
    Berlin, Freiburg im Breisgau
    from 120,000 / year
  • Black Forest Labs

    Member of Technical Staff - Large scale data infrastructure(m/w/x)

    Full-timeOn-siteSenior
    Freiburg im Breisgau
  • Prior Labs

    Research Scientist Intern (PhD)(m/w/x)

    Full-timeInternshipOn-site
    Berlin, Freiburg im Breisgau
View all 100+ similar jobs

Nejo is an AI – results may be incomplete or contain mistakes