Your personal AI career agent
Senior Site Reliability Engineer (SRE)(m/w/x)
Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.
Requirements
- 5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
- Strong expertise in Linux, distributed systems, networking
- Proven experience building/running high-availability production systems
- Hands-on experience with redundancy, failover testing, DR, HA validation
- Deep understanding of monitoring, observability, incident management
- Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
- Proficiency in Python, Go, Bash for automation
- Strong knowledge of Kubernetes, container orchestration, service mesh
- Experience with AWS (EKS, EC2, VPC) and on-premises integration
- Proficiency in Infrastructure as Code tools like Terraform
- Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
- Excellent analytical and problem-solving skills under pressure
- Strong communication and collaboration skills across teams
- Experience in telecom, carrier-grade, or large-scale distributed systems
- Hands-on experience with chaos engineering and automated failure validation
- Strong understanding of high-availability networking concepts
- Background in capacity planning, traffic engineering, multi-region failover
- Experience building reliability dashboards and integrating SRE metrics
- Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)
Tasks
- Strengthen global infrastructure stability, scalability, and reliability
- Proactively identify system weaknesses
- Improve reliability through redundancy testing, automation, and observability
- Mentor peers and set technical standards for reliability engineering
- Define, measure, and maintain SLIs and SLOs
- Plan and execute redundancy and resilience testing
- Validate failover, HA configurations, and disaster recovery readiness
- Design and implement automated recovery mechanisms
- Create self-healing workflows and intelligent alerting systems
- Drive incident response and root-cause analysis
- Conduct blameless post-mortems
- Implement and track corrective and preventive actions
- Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
- Ensure deployment safety, rollback policies, and configuration consistency
- Identify weaknesses through fault-injection, load, and chaos testing
- Reduce operational toil through automation and reliability tooling
- Contribute to on-call practices
- Improve alert quality, runbooks, and escalation procedures
- Manage incident response processes
- Perform capacity planning and performance benchmarking
- Conduct resilience audits across systems
- Ensure compliance with security, reliability, and availability standards
- Create and maintain internal documentation and playbooks
- Contribute to cloud cost-optimization initiatives
- Plan reserved capacity and autoscaling design
- Implement storage tiering and workload right-sizing
- Detect and address continuous anomalies
Work Experience
- 5 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- Linux
- Prometheus
- Grafana
- Loki
- Thanos
- OpenTelemetry
- Python
- Go
- Bash
- Kubernetes
- AWS
- EKS
- EC2
- VPC
- Terraform
- BGP
- DNS
- VXLAN
- ISO 27001
- NIST SP 800-53
Benefits
Career Advancement
- Growth opportunities
Other Benefits
- Major transaction exposure
Startup Environment
- Work with talented team
- Dynamic work environment
- Get things done attitude
Learning & Development
- Professional development
Informal Culture
- International experience
- Open communication culture
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
Not a perfect match?
- FortoFull-timeOn-siteSeniorBerlin
- Air Apps
Site Reliability Engineer (SRE)(m/w/x)
Full-timeOn-siteExperiencedBerlin - Nebius
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Full-timeOn-siteSeniorBerlin - emnify
Staff/Senior AWS Cloud Platform Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin
Senior Site Reliability Engineer (SRE)(m/w/x)
Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.
Requirements
- 5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
- Strong expertise in Linux, distributed systems, networking
- Proven experience building/running high-availability production systems
- Hands-on experience with redundancy, failover testing, DR, HA validation
- Deep understanding of monitoring, observability, incident management
- Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
- Proficiency in Python, Go, Bash for automation
- Strong knowledge of Kubernetes, container orchestration, service mesh
- Experience with AWS (EKS, EC2, VPC) and on-premises integration
- Proficiency in Infrastructure as Code tools like Terraform
- Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
- Excellent analytical and problem-solving skills under pressure
- Strong communication and collaboration skills across teams
- Experience in telecom, carrier-grade, or large-scale distributed systems
- Hands-on experience with chaos engineering and automated failure validation
- Strong understanding of high-availability networking concepts
- Background in capacity planning, traffic engineering, multi-region failover
- Experience building reliability dashboards and integrating SRE metrics
- Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)
Tasks
- Strengthen global infrastructure stability, scalability, and reliability
- Proactively identify system weaknesses
- Improve reliability through redundancy testing, automation, and observability
- Mentor peers and set technical standards for reliability engineering
- Define, measure, and maintain SLIs and SLOs
- Plan and execute redundancy and resilience testing
- Validate failover, HA configurations, and disaster recovery readiness
- Design and implement automated recovery mechanisms
- Create self-healing workflows and intelligent alerting systems
- Drive incident response and root-cause analysis
- Conduct blameless post-mortems
- Implement and track corrective and preventive actions
- Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
- Ensure deployment safety, rollback policies, and configuration consistency
- Identify weaknesses through fault-injection, load, and chaos testing
- Reduce operational toil through automation and reliability tooling
- Contribute to on-call practices
- Improve alert quality, runbooks, and escalation procedures
- Manage incident response processes
- Perform capacity planning and performance benchmarking
- Conduct resilience audits across systems
- Ensure compliance with security, reliability, and availability standards
- Create and maintain internal documentation and playbooks
- Contribute to cloud cost-optimization initiatives
- Plan reserved capacity and autoscaling design
- Implement storage tiering and workload right-sizing
- Detect and address continuous anomalies
Work Experience
- 5 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- Linux
- Prometheus
- Grafana
- Loki
- Thanos
- OpenTelemetry
- Python
- Go
- Bash
- Kubernetes
- AWS
- EKS
- EC2
- VPC
- Terraform
- BGP
- DNS
- VXLAN
- ISO 27001
- NIST SP 800-53
Benefits
Career Advancement
- Growth opportunities
Other Benefits
- Major transaction exposure
Startup Environment
- Work with talented team
- Dynamic work environment
- Get things done attitude
Learning & Development
- Professional development
Informal Culture
- International experience
- Open communication culture
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
About the Company
1GLOBAL
Industry
Telecommunications
Description
1GLOBAL is a technology-driven global mobile communications provider delivering connectivity solutions to enterprises and consumers, operating as a regulated telecommunications provider across 40 countries.
Not a perfect match?
- Forto
Senior Site Reliability Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin - Air Apps
Site Reliability Engineer (SRE)(m/w/x)
Full-timeOn-siteExperiencedBerlin - Nebius
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Full-timeOn-siteSeniorBerlin - emnify
Staff/Senior AWS Cloud Platform Engineer(m/w/x)
Full-timeOn-siteSeniorBerlin - SysEleven GmbH
Senior Site Reliability Engineer Managed Kubernetes(m/w/x)
Full-timeOn-siteSeniorBerlin