The AI Job Search Engine
Senior Site Reliability Engineer, Managed Kubernetes(m/w/x)
Description
In this role, you will manage Kubernetes clusters, ensuring their reliability and performance. Your daily responsibilities will include troubleshooting issues, automating processes, and collaborating with teams to enhance the cloud infrastructure.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •6+ years of experience in SRE, operations engineer, or similar role
- •Strong programming skills in Go and Python
- •Proven experience operating Kubernetes clusters in production environments
- •Ability to work independently or as part of a team
- •Ability to work with customers during incidents
- •Familiarity with observability tools like Prometheus, Grafana, FluentBit
- •Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API
- •Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience
- •Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters
- •Hybrid or multi-cloud Kubernetes environment experience
- •Contributions to CNCF projects or Kubernetes SIGs
- •Diversity of backgrounds, experiences, and skills welcomed
Work Experience
6 years
Tasks
- •Operate and maintain bare-metal Kubernetes clusters
- •Handle cluster degradation, recovery, resizing, and incident response
- •Participate in a well-managed on-call rotation for critical incidents
- •Assist customers with Kubernetes questions and workload integration
- •Collaborate with HPC Ops and Datacenter Ops teams on cross-functional issues
- •Use Python and Golang to create tooling and automate platform validation
- •Design, build, and maintain scalable control plane services and custom controllers
- •Develop automation for cluster lifecycle management, including provisioning and upgrades
- •Define and implement SLOs and SLIs for Kubernetes services and platform reliability
Tools & Technologies
Languages
English – Business Fluent
Benefits
Healthcare & Fitness
- •Health, dental, and vision coverage
- •Wellness stipend
Public Transport Subsidies
- •Commuter stipend
Retirement Plans
- •401k Plan with 2% company match
More Vacation Days
- •Flexible Paid Time Off Plan
- GetYourGuideFull-timeWith HomeofficeSeniorBerlin
- Redcare Pharmacy
Senior Site Reliability Engineer(m/w/x)
Full-timeWith HomeofficeSeniorBerlin - Nebius
Senior Site Reliability Engineer(m/w/x)
Full-timeWith HomeofficeSeniorBerlin - fiskaly
Site Reliability Engineer(m/w/x)
Full-timeWith HomeofficeNot specifiedfrom 80,000 / yearBerlin, Wien - Wire Germany GmbH
Site Reliability Engineer / Systems Engineer(m/w/x)
Full-timeWith HomeofficeSeniorBerlin
Senior Site Reliability Engineer, Managed Kubernetes(m/w/x)
The AI Job Search Engine
Description
In this role, you will manage Kubernetes clusters, ensuring their reliability and performance. Your daily responsibilities will include troubleshooting issues, automating processes, and collaborating with teams to enhance the cloud infrastructure.
Let AI find the perfect jobs for you!
Upload your CV and Nejo AI will find matching job offers for you.
Requirements
- •6+ years of experience in SRE, operations engineer, or similar role
- •Strong programming skills in Go and Python
- •Proven experience operating Kubernetes clusters in production environments
- •Ability to work independently or as part of a team
- •Ability to work with customers during incidents
- •Familiarity with observability tools like Prometheus, Grafana, FluentBit
- •Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API
- •Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience
- •Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters
- •Hybrid or multi-cloud Kubernetes environment experience
- •Contributions to CNCF projects or Kubernetes SIGs
- •Diversity of backgrounds, experiences, and skills welcomed
Work Experience
6 years
Tasks
- •Operate and maintain bare-metal Kubernetes clusters
- •Handle cluster degradation, recovery, resizing, and incident response
- •Participate in a well-managed on-call rotation for critical incidents
- •Assist customers with Kubernetes questions and workload integration
- •Collaborate with HPC Ops and Datacenter Ops teams on cross-functional issues
- •Use Python and Golang to create tooling and automate platform validation
- •Design, build, and maintain scalable control plane services and custom controllers
- •Develop automation for cluster lifecycle management, including provisioning and upgrades
- •Define and implement SLOs and SLIs for Kubernetes services and platform reliability
Tools & Technologies
Languages
English – Business Fluent
Benefits
Healthcare & Fitness
- •Health, dental, and vision coverage
- •Wellness stipend
Public Transport Subsidies
- •Commuter stipend
Retirement Plans
- •401k Plan with 2% company match
More Vacation Days
- •Flexible Paid Time Off Plan
About the Company
Lambda
Industry
IT
Description
The company builds Gigawatt-scale AI Factories for Training and Inference and aims to make compute as ubiquitous as electricity.
- GetYourGuide
Senior Site Reliability Engineer(m/w/x)
Full-timeWith HomeofficeSeniorBerlin - Redcare Pharmacy
Senior Site Reliability Engineer(m/w/x)
Full-timeWith HomeofficeSeniorBerlin - Nebius
Senior Site Reliability Engineer(m/w/x)
Full-timeWith HomeofficeSeniorBerlin - fiskaly
Site Reliability Engineer(m/w/x)
Full-timeWith HomeofficeNot specifiedfrom 80,000 / yearBerlin, Wien - Wire Germany GmbH
Site Reliability Engineer / Systems Engineer(m/w/x)
Full-timeWith HomeofficeSeniorBerlin