Neuer Job?Nejo!

Dein persönlicher KI-Karriere-Agent

PRPrior Labs

vor 2 Monaten

Senior ML Infrastructure Engineer(m/w/x)

Freiburg im Breisgau, Berlin

VollzeitVor OrtSenior

AI/ML

Data Science

Nejo KI-Zusammenfassung

Jetzt bewerben

Multi-cluster GPU infrastructure management for tabular foundation models. Deep Slurm and cluster optimization experience required. Commitment to diversity and inclusion.

Anforderungen

5+ years building/operating production GPU infrastructure or distributed training systems at scale
Deep hands-on Slurm and cluster management experience
Debugging scheduling failures
Optimizing multi-tenant GPU workload utilization
Operating infrastructure with real cost of downtime
Expert-level systems thinking: memory bandwidth, GPU profiling
Reasoning about hardware, not configs
Strong Python and genuine fluency with PyTorch internals
Profiling training runs to identify bottlenecks
Track record of infrastructure decisions improving training throughput or cost efficiency
Strong AI tooling skills
Fluent use of Claude Code, Cursor, or similar
Experience operating at tens-of-millions-scale GPU spend
Multi-cloud or hybrid HPC/cloud infrastructure experience
Triton, CUDA, or custom kernel experience
Experience scaling from single cluster to multi-cluster orchestration
Background building experiment tracking, model registry, or ML pipeline tooling

Aufgaben

Own and evolve multi-cluster GPU infrastructure
Manage Slurm on GCP and multi-provider/new hardware deployments
Optimize cluster architecture, scheduling, and reliability
Drive GPU utilization and training throughput
Profile and optimize memory for distributed training
Identify and resolve communication bottlenecks in distributed training
Debug distributed training systems for large runs
Architect next-generation infrastructure
Orchestrate multi-cluster environments
Integrate new GPU generations
Diversify cloud providers
Plan capacity for growing compute demands
Build the developer productivity layer
Develop CI pipelines
Implement experiment tracking
Manage model registry
Oversee data processing infrastructure
Create internal tooling for research iteration
Own the compute budget
Analyze cost per FLOP across providers and hardware
Minimize wasted compute resources

Berufserfahrung

5 Jahre

Ausbildung

Bachelor-AbschlussODER
Master-Abschluss

Sprachen

Englisch – verhandlungssicher

Tools & Technologien

GPU
Slurm
Python
PyTorch
Claude Code
Cursor
Triton
CUDA

Benefits

Lockere Unternehmenskultur

Commitment to diversity and inclusion

Sicherer Arbeitsplatz

Safe and inclusive environment

Sonstige Vorteile

Equal opportunities

Die Originalanzeige dieses Stellenangebotes in der aktuellsten Version findest du hier. Nejo hat diesen Job automatisch von der Website des Unternehmens Prior Labs erfasst und die Informationen auf Nejo mit Hilfe von KI für dich aufbereitet. Trotz sorgfältiger Analyse können einzelne Informationen unvollständig oder ungenau sein. Bitte prüfe immer alle Angaben in der Originalanzeige! Inhalte und Urheberrechte der Originalanzeige liegen beim ausschreibenden Unternehmen.

Gefällt dir diese Stelle?

Beta

Dein Career Agent findet täglich ähnliche Jobs für dich.

Noch nicht perfekt?

Black Forest Labs
Member of Technical Staff - Training Cluster Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahren
Freiburg im Breisgau
Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahren
Berlin, Freiburg im Breisgau
ab 140.000 / Jahr
Prior Labs
ML Engineer, Foundation Model(m/w/x)
Vollzeitnur vor OrtBerufserfahren
Berlin, Freiburg im Breisgau
ab 120.000 / Jahr
Black Forest Labs
Member of Technical Staff - Large scale data infrastructure(m/w/x)
Vollzeitnur vor OrtSenior
Freiburg im Breisgau
Prior Labs
Research Scientist Intern (PhD)(m/w/x)
VollzeitPraktikumnur vor Ort
Berlin, Freiburg im Breisgau

Alle 100+ ähnlichen Jobs ansehen

PRPrior Labs

vor 2 Monaten