Dein persönlicher KI-Karriere-Agent
Senior Site Reliability Engineer (SRE)(m/w/x)
Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.
Anforderungen
- 5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
- Strong expertise in Linux, distributed systems, networking
- Proven experience building/running high-availability production systems
- Hands-on experience with redundancy, failover testing, DR, HA validation
- Deep understanding of monitoring, observability, incident management
- Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
- Proficiency in Python, Go, Bash for automation
- Strong knowledge of Kubernetes, container orchestration, service mesh
- Experience with AWS (EKS, EC2, VPC) and on-premises integration
- Proficiency in Infrastructure as Code tools like Terraform
- Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
- Excellent analytical and problem-solving skills under pressure
- Strong communication and collaboration skills across teams
- Experience in telecom, carrier-grade, or large-scale distributed systems
- Hands-on experience with chaos engineering and automated failure validation
- Strong understanding of high-availability networking concepts
- Background in capacity planning, traffic engineering, multi-region failover
- Experience building reliability dashboards and integrating SRE metrics
- Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)
Aufgaben
- Strengthen global infrastructure stability, scalability, and reliability
- Proactively identify system weaknesses
- Improve reliability through redundancy testing, automation, and observability
- Mentor peers and set technical standards for reliability engineering
- Define, measure, and maintain SLIs and SLOs
- Plan and execute redundancy and resilience testing
- Validate failover, HA configurations, and disaster recovery readiness
- Design and implement automated recovery mechanisms
- Create self-healing workflows and intelligent alerting systems
- Drive incident response and root-cause analysis
- Conduct blameless post-mortems
- Implement and track corrective and preventive actions
- Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
- Ensure deployment safety, rollback policies, and configuration consistency
- Identify weaknesses through fault-injection, load, and chaos testing
- Reduce operational toil through automation and reliability tooling
- Contribute to on-call practices
- Improve alert quality, runbooks, and escalation procedures
- Manage incident response processes
- Perform capacity planning and performance benchmarking
- Conduct resilience audits across systems
- Ensure compliance with security, reliability, and availability standards
- Create and maintain internal documentation and playbooks
- Contribute to cloud cost-optimization initiatives
- Plan reserved capacity and autoscaling design
- Implement storage tiering and workload right-sizing
- Detect and address continuous anomalies
Berufserfahrung
- 5 Jahre
Ausbildung
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- Linux
- Prometheus
- Grafana
- Loki
- Thanos
- OpenTelemetry
- Python
- Go
- Bash
- Kubernetes
- AWS
- EKS
- EC2
- VPC
- Terraform
- BGP
- DNS
- VXLAN
- ISO 27001
- NIST SP 800-53
Benefits
Karriere- und Weiterentwicklung
- Growth opportunities
Sonstige Vorteile
- Major transaction exposure
Startup-Atmosphäre
- Work with talented team
- Dynamic work environment
- Get things done attitude
Weiterbildungsangebote
- Professional development
Lockere Unternehmenskultur
- International experience
- Open communication culture
Gefällt dir diese Stelle?
BetaDein Career Agent findet täglich ähnliche Jobs für dich.
Noch nicht perfekt?
- FortoVollzeitnur vor OrtSeniorBerlin
- Nebius
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - emnify
Staff/Senior AWS Cloud Platform Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Air Apps
Site Reliability Engineer (SRE)(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Vollzeitnur vor OrtSeniorBerlin
Senior Site Reliability Engineer (SRE)(m/w/x)
Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.
Anforderungen
- 5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
- Strong expertise in Linux, distributed systems, networking
- Proven experience building/running high-availability production systems
- Hands-on experience with redundancy, failover testing, DR, HA validation
- Deep understanding of monitoring, observability, incident management
- Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
- Proficiency in Python, Go, Bash for automation
- Strong knowledge of Kubernetes, container orchestration, service mesh
- Experience with AWS (EKS, EC2, VPC) and on-premises integration
- Proficiency in Infrastructure as Code tools like Terraform
- Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
- Excellent analytical and problem-solving skills under pressure
- Strong communication and collaboration skills across teams
- Experience in telecom, carrier-grade, or large-scale distributed systems
- Hands-on experience with chaos engineering and automated failure validation
- Strong understanding of high-availability networking concepts
- Background in capacity planning, traffic engineering, multi-region failover
- Experience building reliability dashboards and integrating SRE metrics
- Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)
Aufgaben
- Strengthen global infrastructure stability, scalability, and reliability
- Proactively identify system weaknesses
- Improve reliability through redundancy testing, automation, and observability
- Mentor peers and set technical standards for reliability engineering
- Define, measure, and maintain SLIs and SLOs
- Plan and execute redundancy and resilience testing
- Validate failover, HA configurations, and disaster recovery readiness
- Design and implement automated recovery mechanisms
- Create self-healing workflows and intelligent alerting systems
- Drive incident response and root-cause analysis
- Conduct blameless post-mortems
- Implement and track corrective and preventive actions
- Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
- Ensure deployment safety, rollback policies, and configuration consistency
- Identify weaknesses through fault-injection, load, and chaos testing
- Reduce operational toil through automation and reliability tooling
- Contribute to on-call practices
- Improve alert quality, runbooks, and escalation procedures
- Manage incident response processes
- Perform capacity planning and performance benchmarking
- Conduct resilience audits across systems
- Ensure compliance with security, reliability, and availability standards
- Create and maintain internal documentation and playbooks
- Contribute to cloud cost-optimization initiatives
- Plan reserved capacity and autoscaling design
- Implement storage tiering and workload right-sizing
- Detect and address continuous anomalies
Berufserfahrung
- 5 Jahre
Ausbildung
- Bachelor-AbschlussODER
- Master-Abschluss
Sprachen
- Englisch – verhandlungssicher
Tools & Technologien
- Linux
- Prometheus
- Grafana
- Loki
- Thanos
- OpenTelemetry
- Python
- Go
- Bash
- Kubernetes
- AWS
- EKS
- EC2
- VPC
- Terraform
- BGP
- DNS
- VXLAN
- ISO 27001
- NIST SP 800-53
Benefits
Karriere- und Weiterentwicklung
- Growth opportunities
Sonstige Vorteile
- Major transaction exposure
Startup-Atmosphäre
- Work with talented team
- Dynamic work environment
- Get things done attitude
Weiterbildungsangebote
- Professional development
Lockere Unternehmenskultur
- International experience
- Open communication culture
Gefällt dir diese Stelle?
BetaDein Career Agent findet täglich ähnliche Jobs für dich.
Über das Unternehmen
1GLOBAL
Branche
Telecommunications
Beschreibung
1GLOBAL is a technology-driven global mobile communications provider delivering connectivity solutions to enterprises and consumers, operating as a regulated telecommunications provider across 40 countries.
Noch nicht perfekt?
- Forto
Senior Site Reliability Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Nebius
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - emnify
Staff/Senior AWS Cloud Platform Engineer(m/w/x)
Vollzeitnur vor OrtSeniorBerlin - Air Apps
Site Reliability Engineer (SRE)(m/w/x)
Vollzeitnur vor OrtBerufserfahrenBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Vollzeitnur vor OrtSeniorBerlin