The AI Job Search Engine
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Description
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Education
Work Experience
approx. 4 - 6 years
Tasks
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologies
Languages
English – Business Fluent
Benefits
Flexible Working
- •Flexible working arrangements
Other Benefits
- •Comprehensive benefits package
Career Advancement
- •Opportunities for professional growth
Informal Culture
- •Dynamic and collaborative work environment
- PlayStation GlobalFull-timeOn-siteSeniorBerlin
- SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin - PlayStation Global
Senior Service Reliability Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedfrom 140,000 / yearBerlin, Freiburg im Breisgau
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
The AI Job Search Engine
Description
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Education
Work Experience
approx. 4 - 6 years
Tasks
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologies
Languages
English – Business Fluent
Benefits
Flexible Working
- •Flexible working arrangements
Other Benefits
- •Comprehensive benefits package
Career Advancement
- •Opportunities for professional growth
Informal Culture
- •Dynamic and collaborative work environment
About the Company
Nebius
Industry
IT
Description
The company is leading a new era in cloud computing to serve the global AI economy by creating tools and resources for real-world challenges.
- PlayStation Global
Site Reliability Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin - PlayStation Global
Senior Service Reliability Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedfrom 140,000 / yearBerlin, Freiburg im Breisgau