Die KI-Suchmaschine für Jobs
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Optimizing AI inference stack reliability, tuning Kubernetes autoscalers for GPU efficiency in cloud computing for AI. Deep Kubernetes fluency, Prometheus, Grafana, and Terraform skills required. Flexible working arrangements.
Anforderungen
- Deep fluency with Kubernetes
- Fluency with Prometheus
- Fluency with Grafana
- Fluency with Terraform
- Scripting in Python or Bash
- Understanding of alert design and SLOs
- Experience with GPU-heavy workloads
- Background in MLOps or model-hosting platforms
- Interest in building self-healing systems
- Enjoyment of debugging performance
- Collaboration with software engineers
Aufgaben
- Own the reliability of the inference stack
- Design and refine telemetry pipelines
- Tune Kubernetes autoscalers for GPU efficiency
- Craft Terraform modules for resilient clusters
- Harden request-routing and retry logic
- Detect, isolate, and remediate incidents quickly
- Drive post-mortem culture to prevent recurrence
- Scale the platform while meeting cost and reliability targets
Berufserfahrung
- ca. 4 - 6 Jahre
Ausbildung
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- Kubernetes
- Prometheus
- Grafana
- Terraform
- Python
- Bash
- vLLM
- Triton
- Ray
Benefits
Flexibles Arbeiten
- Flexible working arrangements
Sonstige Vorteile
- Comprehensive benefits package
Karriere- und Weiterentwicklung
- Opportunities for professional growth
Lockere Unternehmenskultur
- Dynamic and collaborative work environment
Noch nicht perfekt?
- SysEleven GmbHVollzeitnur vor OrtSeniorBerlin
- Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Workato
Senior Infrastructure Engineer - Observability(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Frankfurt am Main, München - Trade Republic
Staff Engineer – Cloud Platform(m/w/x)
Vollzeitnur vor OrtSeniorBerlin
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Optimizing AI inference stack reliability, tuning Kubernetes autoscalers for GPU efficiency in cloud computing for AI. Deep Kubernetes fluency, Prometheus, Grafana, and Terraform skills required. Flexible working arrangements.
Anforderungen
- Deep fluency with Kubernetes
- Fluency with Prometheus
- Fluency with Grafana
- Fluency with Terraform
- Scripting in Python or Bash
- Understanding of alert design and SLOs
- Experience with GPU-heavy workloads
- Background in MLOps or model-hosting platforms
- Interest in building self-healing systems
- Enjoyment of debugging performance
- Collaboration with software engineers
Aufgaben
- Own the reliability of the inference stack
- Design and refine telemetry pipelines
- Tune Kubernetes autoscalers for GPU efficiency
- Craft Terraform modules for resilient clusters
- Harden request-routing and retry logic
- Detect, isolate, and remediate incidents quickly
- Drive post-mortem culture to prevent recurrence
- Scale the platform while meeting cost and reliability targets
Berufserfahrung
- ca. 4 - 6 Jahre
Ausbildung
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- Kubernetes
- Prometheus
- Grafana
- Terraform
- Python
- Bash
- vLLM
- Triton
- Ray
Benefits
Flexibles Arbeiten
- Flexible working arrangements
Sonstige Vorteile
- Comprehensive benefits package
Karriere- und Weiterentwicklung
- Opportunities for professional growth
Lockere Unternehmenskultur
- Dynamic and collaborative work environment
Über das Unternehmen
Nebius
Branche
IT
Beschreibung
The company is leading a new era in cloud computing to serve the global AI economy by creating tools and resources for real-world challenges.
Noch nicht perfekt?
- SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin, Freiburg im Breisgauab 140.000 / Jahr - Workato
Senior Infrastructure Engineer - Observability(m/w/x)
Vollzeitnur vor OrtSeniorBerlin, Frankfurt am Main, München - Trade Republic
Staff Engineer – Cloud Platform(m/w/x)
Vollzeitnur vor OrtSeniorBerlin