New Job?Nejo!

The AI Job Search Engine

NE
Nebius
last mo.

Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)

Berlin
Full-timeOn-siteSenior
AI/ML

Description

In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.

Let AI find the perfect jobs for you!

Upload your CV and Nejo AI will find matching job offers for you.

Requirements

  • Deep fluency with Kubernetes
  • Fluency with Prometheus
  • Fluency with Grafana
  • Fluency with Terraform
  • Scripting in Python or Bash
  • Understanding of alert design and SLOs
  • Experience with GPU-heavy workloads
  • Background in MLOps or model-hosting platforms
  • Interest in building self-healing systems
  • Enjoyment of debugging performance
  • Collaboration with software engineers

Work Experience

approx. 4 - 6 years

Tasks

  • Own the reliability of the inference stack
  • Design and refine telemetry pipelines
  • Tune Kubernetes autoscalers for GPU efficiency
  • Craft Terraform modules for resilient clusters
  • Harden request-routing and retry logic
  • Detect, isolate, and remediate incidents quickly
  • Drive post-mortem culture to prevent recurrence
  • Scale the platform while meeting cost and reliability targets

Tools & Technologies

KubernetesPrometheusGrafanaTerraformPythonBashvLLMTritonRay

Languages

EnglishBusiness Fluent

Benefits

Flexible Working

  • Flexible working arrangements

Other Benefits

  • Comprehensive benefits package

Career Advancement

  • Opportunities for professional growth

Informal Culture

  • Dynamic and collaborative work environment
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Nebius and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.
Not a perfect match?
100+ Similar Jobs in Berlin
  • Trade Republic

    Senior Site Reliability Engineer – Data and ML Platform(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • SysEleven GmbH

    Senior Site Reliability Engineer Managed Kubernetes(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • Prior Labs

    ML Engineer, Cloud Platform(m/w/x)

    Full-timeOn-siteExperienced
    from 140,000 / year
    Berlin, Freiburg im Breisgau
  • Prior Labs

    MLOps / ML Systems Engineer(m/w/x)

    Full-timeOn-siteSenior
    Berlin, Freiburg im Breisgau
  • Langdock

    Platform Engineer(m/w/x)

    Full-timeOn-siteExperienced
    from 120,000 / year
    Berlin
100+ View all similar jobs