The AI Job Search Engine
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Description
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Work Experience
approx. 4 - 6 years
Tasks
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologies
Languages
English – Business Fluent
Benefits
Flexible Working
- •Flexible working arrangements
Other Benefits
- •Comprehensive benefits package
Career Advancement
- •Opportunities for professional growth
Informal Culture
- •Dynamic and collaborative work environment
- Trade RepublicFull-timeOn-siteSeniorBerlin
- SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedfrom 140,000 / yearBerlin, Freiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Langdock
Platform Engineer(m/w/x)
Full-timeOn-siteExperiencedfrom 120,000 / yearBerlin
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
The AI Job Search Engine
Description
In this role, you will ensure the reliability and performance of the inference platform by designing telemetry pipelines, tuning Kubernetes, and crafting resilient infrastructure. The focus will be on maintaining smooth operations and quickly resolving incidents to support the demands of AI workloads.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •Deep fluency with Kubernetes
- •Fluency with Prometheus
- •Fluency with Grafana
- •Fluency with Terraform
- •Scripting in Python or Bash
- •Understanding of alert design and SLOs
- •Experience with GPU-heavy workloads
- •Background in MLOps or model-hosting platforms
- •Interest in building self-healing systems
- •Enjoyment of debugging performance
- •Collaboration with software engineers
Work Experience
approx. 4 - 6 years
Tasks
- •Own the reliability of the inference stack
- •Design and refine telemetry pipelines
- •Tune Kubernetes autoscalers for GPU efficiency
- •Craft Terraform modules for resilient clusters
- •Harden request-routing and retry logic
- •Detect, isolate, and remediate incidents quickly
- •Drive post-mortem culture to prevent recurrence
- •Scale the platform while meeting cost and reliability targets
Tools & Technologies
Languages
English – Business Fluent
Benefits
Flexible Working
- •Flexible working arrangements
Other Benefits
- •Comprehensive benefits package
Career Advancement
- •Opportunities for professional growth
Informal Culture
- •Dynamic and collaborative work environment
About the Company
Nebius
Industry
IT
Description
The company is leading a new era in cloud computing to serve the global AI economy by creating tools and resources for real-world challenges.
- Trade Republic
Senior Site Reliability Engineer – Data and ML Platform(m/w/x)
Full-timeOn-siteSeniorBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedfrom 140,000 / yearBerlin, Freiburg im Breisgau - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Langdock
Platform Engineer(m/w/x)
Full-timeOn-siteExperiencedfrom 120,000 / yearBerlin