New Job?Nejo!

Your personal AI career agent

1G1GLOBAL

last mo.

Senior Site Reliability Engineer (SRE)(m/w/x)

Berlin

Full-timeOn-siteSenior

Nejo AI Summary

Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.

Requirements

5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
Strong expertise in Linux, distributed systems, networking
Proven experience building/running high-availability production systems
Hands-on experience with redundancy, failover testing, DR, HA validation
Deep understanding of monitoring, observability, incident management
Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
Proficiency in Python, Go, Bash for automation
Strong knowledge of Kubernetes, container orchestration, service mesh
Experience with AWS (EKS, EC2, VPC) and on-premises integration
Proficiency in Infrastructure as Code tools like Terraform
Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
Excellent analytical and problem-solving skills under pressure
Strong communication and collaboration skills across teams
Experience in telecom, carrier-grade, or large-scale distributed systems
Hands-on experience with chaos engineering and automated failure validation
Strong understanding of high-availability networking concepts
Background in capacity planning, traffic engineering, multi-region failover
Experience building reliability dashboards and integrating SRE metrics
Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)

Tasks

Strengthen global infrastructure stability, scalability, and reliability
Proactively identify system weaknesses
Improve reliability through redundancy testing, automation, and observability
Mentor peers and set technical standards for reliability engineering
Define, measure, and maintain SLIs and SLOs
Plan and execute redundancy and resilience testing
Validate failover, HA configurations, and disaster recovery readiness
Design and implement automated recovery mechanisms
Create self-healing workflows and intelligent alerting systems
Drive incident response and root-cause analysis
Conduct blameless post-mortems
Implement and track corrective and preventive actions
Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
Ensure deployment safety, rollback policies, and configuration consistency
Identify weaknesses through fault-injection, load, and chaos testing
Reduce operational toil through automation and reliability tooling
Contribute to on-call practices
Improve alert quality, runbooks, and escalation procedures
Manage incident response processes
Perform capacity planning and performance benchmarking
Conduct resilience audits across systems
Ensure compliance with security, reliability, and availability standards
Create and maintain internal documentation and playbooks
Contribute to cloud cost-optimization initiatives
Plan reserved capacity and autoscaling design
Implement storage tiering and workload right-sizing
Detect and address continuous anomalies

Work Experience

5 years

Education

Bachelor's degreeOR
Master's degree

Languages

English – Business Fluent

Tools & Technologies

Linux
Prometheus
Grafana
Loki
Thanos
OpenTelemetry
Python
Go
Bash
Kubernetes
AWS
EKS
EC2
VPC
Terraform
BGP
DNS
VXLAN
ISO 27001
NIST SP 800-53

Benefits

Career Advancement

Growth opportunities

Other Benefits

Major transaction exposure

Startup Environment

Work with talented team
Dynamic work environment
Get things done attitude

Learning & Development

Professional development

Informal Culture

International experience
Open communication culture

Find the original job posting in its most current version here. Nejo automatically captured this job from the website of 1GLOBAL and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

Like this job?

Beta

Your Career Agent finds similar jobs for you every day.

Not a perfect match?

100+ Similar Jobs in Berlin View all

Forto
Senior Site Reliability Engineer(m/w/x)
Full-timeOn-siteSenior
Berlin
Air Apps
Site Reliability Engineer (SRE)(m/w/x)
Full-timeOn-siteExperienced
Berlin
Nebius
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Full-timeOn-siteSenior
Berlin
emnify
Staff/Senior AWS Cloud Platform Engineer(m/w/x)
Full-timeOn-siteSenior
Berlin
Almedia
Site Reliability Engineer / DevOps(m/w/x)
Full-timeOn-siteNot specified
Berlin
from 80,000 - 190,000 / year

View all 100+ similar jobs

1G1GLOBAL

last mo.