The AI Job Search Engine
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Optimizing AI inference stack reliability, tuning Kubernetes autoscalers for GPU efficiency in cloud computing for AI. Deep Kubernetes fluency, Prometheus, Grafana, and Terraform skills required. Flexible working arrangements.
Requirements
- Deep fluency with Kubernetes
- Fluency with Prometheus
- Fluency with Grafana
- Fluency with Terraform
- Scripting in Python or Bash
- Understanding of alert design and SLOs
- Experience with GPU-heavy workloads
- Background in MLOps or model-hosting platforms
- Interest in building self-healing systems
- Enjoyment of debugging performance
- Collaboration with software engineers
Tasks
- Own the reliability of the inference stack
- Design and refine telemetry pipelines
- Tune Kubernetes autoscalers for GPU efficiency
- Craft Terraform modules for resilient clusters
- Harden request-routing and retry logic
- Detect, isolate, and remediate incidents quickly
- Drive post-mortem culture to prevent recurrence
- Scale the platform while meeting cost and reliability targets
Work Experience
- approx. 4 - 6 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- Kubernetes
- Prometheus
- Grafana
- Terraform
- Python
- Bash
- vLLM
- Triton
- Ray
Benefits
Flexible Working
- Flexible working arrangements
Other Benefits
- Comprehensive benefits package
Career Advancement
- Opportunities for professional growth
Informal Culture
- Dynamic and collaborative work environment
Not a perfect match?
- WorkatoFull-timeOn-siteSeniorBerlin, Frankfurt am Main, München
- SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Trade Republic
Staff Engineer – Cloud Platform(m/w/x)
Full-timeOn-siteSeniorBerlin
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Optimizing AI inference stack reliability, tuning Kubernetes autoscalers for GPU efficiency in cloud computing for AI. Deep Kubernetes fluency, Prometheus, Grafana, and Terraform skills required. Flexible working arrangements.
Requirements
- Deep fluency with Kubernetes
- Fluency with Prometheus
- Fluency with Grafana
- Fluency with Terraform
- Scripting in Python or Bash
- Understanding of alert design and SLOs
- Experience with GPU-heavy workloads
- Background in MLOps or model-hosting platforms
- Interest in building self-healing systems
- Enjoyment of debugging performance
- Collaboration with software engineers
Tasks
- Own the reliability of the inference stack
- Design and refine telemetry pipelines
- Tune Kubernetes autoscalers for GPU efficiency
- Craft Terraform modules for resilient clusters
- Harden request-routing and retry logic
- Detect, isolate, and remediate incidents quickly
- Drive post-mortem culture to prevent recurrence
- Scale the platform while meeting cost and reliability targets
Work Experience
- approx. 4 - 6 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- Kubernetes
- Prometheus
- Grafana
- Terraform
- Python
- Bash
- vLLM
- Triton
- Ray
Benefits
Flexible Working
- Flexible working arrangements
Other Benefits
- Comprehensive benefits package
Career Advancement
- Opportunities for professional growth
Informal Culture
- Dynamic and collaborative work environment
About the Company
Nebius
Industry
IT
Description
The company is leading a new era in cloud computing to serve the global AI economy by creating tools and resources for real-world challenges.
Not a perfect match?
- Workato
Senior Infrastructure Engineer - Observability(m/w/x)
Full-timeOn-siteSeniorBerlin, Frankfurt am Main, München - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin - Prior Labs
MLOps / ML Systems Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin, Freiburg im Breisgau - Prior Labs
ML Engineer, Cloud Platform(m/w/x)
Full-timeOn-siteExperiencedBerlin, Freiburg im Breisgaufrom 140,000 / year - Trade Republic
Staff Engineer – Cloud Platform(m/w/x)
Full-timeOn-siteSeniorBerlin