Die KI-Suchmaschine für Jobs
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Beschreibung
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Lass KI die perfekten Jobs für dich finden!
Lade deinen CV hoch und die Nejo-KI findet passende Stellenangebote für dich.
Anforderungen
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Berufserfahrung
ca. 4 - 6 Jahre
Aufgaben
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologien
Sprachen
Englisch – verhandlungssicher
Benefits
Flexibles Arbeiten
- •Flexible working arrangements
Sonstige Vorteile
- •Comprehensive benefits package
Karriere- und Weiterentwicklung
- •Opportunities for professional growth
Lockere Unternehmenskultur
- •Dynamic and collaborative work environment
- Trade RepublicVollzeitnur vor OrtSeniorBerlin
- SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenab 140.000 / JahrBerlin, Freiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Langdock
Platform Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahrenab 120.000 / JahrBerlin
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Die KI-Suchmaschine für Jobs
Beschreibung
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Lass KI die perfekten Jobs für dich finden!
Lade deinen CV hoch und die Nejo-KI findet passende Stellenangebote für dich.
Anforderungen
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Berufserfahrung
ca. 4 - 6 Jahre
Aufgaben
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologien
Sprachen
Englisch – verhandlungssicher
Benefits
Flexibles Arbeiten
- •Flexible working arrangements
Sonstige Vorteile
- •Comprehensive benefits package
Karriere- und Weiterentwicklung
- •Opportunities for professional growth
Lockere Unternehmenskultur
- •Dynamic and collaborative work environment
Über das Unternehmen
Nebius
Branche
IT
Beschreibung
The company is leading a new era in cloud computing to serve the global AI economy by creating tools and resources for real-world challenges.
- Trade Republic
Senior Site Reliability Engineer – Data and ML Platform(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenab 140.000 / JahrBerlin, Freiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Langdock
Platform Engineer(m/w/x)
Vollzeitnur vor OrtBerufserfahrenab 120.000 / JahrBerlin