Skip to content
New Job?Nejo!

The AI Job Search Engine

NENebius

Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)

Berlin
Full-timeOn-siteSenior
AI/ML

Optimizing AI inference stack reliability, tuning Kubernetes autoscalers for GPU efficiency in cloud computing for AI. Deep Kubernetes fluency, Prometheus, Grafana, and Terraform skills required. Flexible working arrangements.

Requirements

  • Deep fluency with Kubernetes
  • Fluency with Prometheus
  • Fluency with Grafana
  • Fluency with Terraform
  • Scripting in Python or Bash
  • Understanding of alert design and SLOs
  • Experience with GPU-heavy workloads
  • Background in MLOps or model-hosting platforms
  • Interest in building self-healing systems
  • Enjoyment of debugging performance
  • Collaboration with software engineers

Tasks

  • Own the reliability of the inference stack
  • Design and refine telemetry pipelines
  • Tune Kubernetes autoscalers for GPU efficiency
  • Craft Terraform modules for resilient clusters
  • Harden request-routing and retry logic
  • Detect, isolate, and remediate incidents quickly
  • Drive post-mortem culture to prevent recurrence
  • Scale the platform while meeting cost and reliability targets

Work Experience

  • approx. 4 - 6 years

Education

  • Bachelor's degreeOR
  • Master's degree

Languages

  • EnglishBusiness Fluent

Tools & Technologies

  • Kubernetes
  • Prometheus
  • Grafana
  • Terraform
  • Python
  • Bash
  • vLLM
  • Triton
  • Ray

Benefits

Flexible Working

  • Flexible working arrangements

Other Benefits

  • Comprehensive benefits package

Career Advancement

  • Opportunities for professional growth

Informal Culture

  • Dynamic and collaborative work environment
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Nebius and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

  • Workato

    Senior Infrastructure Engineer - Observability(m/w/x)

    Full-timeOn-siteSenior
    Berlin, Frankfurt am Main, München
  • SysEleven GmbH

    Senior Site Reliability Engineer Managed Kubernetes(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • Prior Labs

    MLOps / ML Systems Engineer(m/w/x)

    Full-timeOn-siteSenior
    Berlin, Freiburg im Breisgau
  • Prior Labs

    ML Engineer, Cloud Platform(m/w/x)

    Full-timeOn-siteExperienced
    Berlin, Freiburg im Breisgau
    from 140,000 / year
  • Trade Republic

    Staff Engineer – Cloud Platform(m/w/x)

    Full-timeOn-siteSenior
    Berlin
View all 100+ similar jobs

Nejo is an AI – results may be incomplete or contain mistakes