Die KI-Suchmaschine für Jobs
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Beschreibung
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Lass KI die perfekten Jobs für dich finden!
Lade deinen CV hoch und die Nejo-KI findet passende Stellenangebote für dich.
Anforderungen
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Ausbildung
Berufserfahrung
ca. 4 - 6 Jahre
Aufgaben
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologien
Sprachen
Englisch – verhandlungssicher
Benefits
Flexibles Arbeiten
- •Flexible working arrangements
Sonstige Vorteile
- •Comprehensive benefits package
Karriere- und Weiterentwicklung
- •Opportunities for professional growth
Lockere Unternehmenskultur
- •Dynamic and collaborative work environment
- PlayStation GlobalVollzeitnur vor OrtSeniorBerlin
- SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - PlayStation Global
Senior Service Reliability Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenab 140.000 / JahrBerlin, Freiburg im Breisgau
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Die KI-Suchmaschine für Jobs
Beschreibung
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Lass KI die perfekten Jobs für dich finden!
Lade deinen CV hoch und die Nejo-KI findet passende Stellenangebote für dich.
Anforderungen
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Ausbildung
Berufserfahrung
ca. 4 - 6 Jahre
Aufgaben
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologien
Sprachen
Englisch – verhandlungssicher
Benefits
Flexibles Arbeiten
- •Flexible working arrangements
Sonstige Vorteile
- •Comprehensive benefits package
Karriere- und Weiterentwicklung
- •Opportunities for professional growth
Lockere Unternehmenskultur
- •Dynamic and collaborative work environment
Über das Unternehmen
Nebius
Branche
IT
Beschreibung
The company is leading a new era in cloud computing to serve the global AI economy by creating tools and resources for real-world challenges.
- PlayStation Global
Site Reliability Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - PlayStation Global
Senior Service Reliability Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenab 140.000 / JahrBerlin, Freiburg im Breisgau