Skip to content
New Job?Nejo!

Your personal AI career agent

NENebius

Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)

Berlin
Full-timeOn-siteSenior
AI/ML

Optimizing AI inference stack reliability, tuning Kubernetes autoscalers for GPU efficiency in cloud computing for AI. Deep Kubernetes fluency, Prometheus, Grafana, and Terraform skills required. Flexible working arrangements.

Requirements

  • Deep fluency with Kubernetes
  • Fluency with Prometheus
  • Fluency with Grafana
  • Fluency with Terraform
  • Scripting in Python or Bash
  • Understanding of alert design and SLOs
  • Experience with GPU-heavy workloads
  • Background in MLOps or model-hosting platforms
  • Interest in building self-healing systems
  • Enjoyment of debugging performance
  • Collaboration with software engineers

Tasks

  • Own the reliability of the inference stack
  • Design and refine telemetry pipelines
  • Tune Kubernetes autoscalers for GPU efficiency
  • Craft Terraform modules for resilient clusters
  • Harden request-routing and retry logic
  • Detect, isolate, and remediate incidents quickly
  • Drive post-mortem culture to prevent recurrence
  • Scale the platform while meeting cost and reliability targets

Work Experience

  • approx. 4 - 6 years

Education

  • Bachelor's degreeOR
  • Master's degree

Languages

  • EnglishBusiness Fluent

Tools & Technologies

  • Kubernetes
  • Prometheus
  • Grafana
  • Terraform
  • Python
  • Bash
  • vLLM
  • Triton
  • Ray

Benefits

Flexible Working

  • Flexible working arrangements

Other Benefits

  • Comprehensive benefits package

Career Advancement

  • Opportunities for professional growth

Informal Culture

  • Dynamic and collaborative work environment
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Nebius and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

  • Forto

    Senior Site Reliability Engineer(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • Air Apps

    Site Reliability Engineer (SRE)(m/w/x)

    Full-timeOn-siteExperienced
    Berlin
  • Workato

    Senior Infrastructure Engineer - Observability(m/w/x)

    Full-timeOn-siteSenior
    Berlin, Frankfurt am Main, München
  • SysEleven GmbH

    Senior Site Reliability Engineer Managed Kubernetes(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • Prior Labs

    Senior ML Infrastructure Engineer(m/w/x)

    Full-timeOn-siteSenior
    Freiburg im Breisgau, Berlin
View all 100+ similar jobs

Nejo is an AI – results may be incomplete or contain mistakes