Skip to content
New Job?Nejo!

Your personal AI career agent

SPSpAItial

Machine Learning & Cloud Infra Engineer(m/w/x)

München
Full-timeOn-siteExperienced
AI/ML
Data Science

Building 3D world models with generative AI on GPU clusters. ML infrastructure experience preferred. Multi-node, multi-GPU training operations.

Requirements

  • 3+ years infrastructure, platform, or cloud engineering experience
  • ML infrastructure experience strongly preferred
  • Hands-on GPU compute experience
  • GPU performance debugging experience
  • CUDA/NCCL concepts knowledge
  • GPU utilization understanding
  • Networking bottlenecks understanding
  • Profiling experience
  • Strong cloud environment operation experience
  • AWS, GCP, or Azure experience
  • Cloud networking experience
  • Cloud IAM experience
  • Cloud cost management experience
  • Containers and orchestration proficiency
  • Docker proficiency
  • Kubernetes proficiency
  • Infrastructure-as-code proficiency
  • Terraform proficiency
  • Strong scripting skills
  • Strong automation skills
  • Python scripting skills
  • Bash/PowerShell scripting skills
  • Distributed training familiarity
  • Modern ML stacks familiarity
  • PyTorch familiarity
  • DDP/FSDP familiarity
  • Monitoring tooling experience
  • Observability tooling experience
  • Prometheus/Grafana experience
  • OpenTelemetry experience
  • ELK experience
  • CI/CD experience for infra
  • CI/CD experience for ML workflows
  • CircleCI experience
  • GitHub Actions experience

Tasks

  • Design and implement scalable training systems
  • Operate GPU clusters for multi-node, multi-GPU training
  • Provision and maintain training environments
  • Support high-throughput training stacks
  • Ensure performance and stability in large runs
  • Build and optimize storage systems for petabyte-scale datasets
  • Enhance data throughput with caching and data locality
  • Package and deploy workloads using Docker and Kubernetes
  • Maintain infrastructure-as-code with Terraform
  • Implement monitoring and logging for cluster health
  • Define SLOs and on-call/incident response practices
  • Manage secrets and IAM for secure systems
  • Ensure secure network boundaries
  • Collaborate with ML researchers and engineers
  • Unblock training and improve developer experience
  • Support model evaluation and serving infrastructure
  • Facilitate smooth transitions from research to production

Work Experience

  • 3 years

Education

  • Vocational certificationOR
  • Bachelor's degreeOR
  • Master's degree

Languages

  • EnglishBusiness Fluent

Tools & Technologies

  • AWS
  • GCP
  • Azure
  • Docker
  • Kubernetes
  • Terraform
  • Python
  • Bash
  • PowerShell
  • PyTorch
  • DDP
  • FSDP
  • Prometheus
  • Grafana
  • OpenTelemetry
  • ELK
  • CircleCI
  • GitHub Actions
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of SpAItial and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

  • Intrinsic

    Senior Software Engineer, ML Ops & Infrastructure(m/w/x)

    Full-timeOn-siteSenior
    München
  • SpAItial

    Research Engineer(m/w/x)

    Full-timeOn-siteExperienced
    München
  • BMW Group

    AI Infrastructure Engineer(m/w/x)

    Full-timeOn-siteExperienced
    München
  • Helsing

    AI Research Engineer - ML Engineering(m/w/x)

    Full-timeOn-siteExperienced
    Berlin, München
  • SpAItial

    Research Engineer - Graphics(m/w/x)

    Full-timeOn-siteExperienced
    München
View all 100+ similar jobs

Nejo is an AI – results may be incomplete or contain mistakes