Neuer Job?Nejo!

Dein persönlicher KI-Karriere-Agent

1G1GLOBAL

letzten Monat

Senior Site Reliability Engineer (SRE)(m/w/x)

Berlin

VollzeitVor OrtSenior

Nejo KI-Zusammenfassung

Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.

Anforderungen

5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
Strong expertise in Linux, distributed systems, networking
Proven experience building/running high-availability production systems
Hands-on experience with redundancy, failover testing, DR, HA validation
Deep understanding of monitoring, observability, incident management
Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
Proficiency in Python, Go, Bash for automation
Strong knowledge of Kubernetes, container orchestration, service mesh
Experience with AWS (EKS, EC2, VPC) and on-premises integration
Proficiency in Infrastructure as Code tools like Terraform
Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
Excellent analytical and problem-solving skills under pressure
Strong communication and collaboration skills across teams
Experience in telecom, carrier-grade, or large-scale distributed systems
Hands-on experience with chaos engineering and automated failure validation
Strong understanding of high-availability networking concepts
Background in capacity planning, traffic engineering, multi-region failover
Experience building reliability dashboards and integrating SRE metrics
Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)

Aufgaben

Strengthen global infrastructure stability, scalability, and reliability
Proactively identify system weaknesses
Improve reliability through redundancy testing, automation, and observability
Mentor peers and set technical standards for reliability engineering
Define, measure, and maintain SLIs and SLOs
Plan and execute redundancy and resilience testing
Validate failover, HA configurations, and disaster recovery readiness
Design and implement automated recovery mechanisms
Create self-healing workflows and intelligent alerting systems
Drive incident response and root-cause analysis
Conduct blameless post-mortems
Implement and track corrective and preventive actions
Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
Ensure deployment safety, rollback policies, and configuration consistency
Identify weaknesses through fault-injection, load, and chaos testing
Reduce operational toil through automation and reliability tooling
Contribute to on-call practices
Improve alert quality, runbooks, and escalation procedures
Manage incident response processes
Perform capacity planning and performance benchmarking
Conduct resilience audits across systems
Ensure compliance with security, reliability, and availability standards
Create and maintain internal documentation and playbooks
Contribute to cloud cost-optimization initiatives
Plan reserved capacity and autoscaling design
Implement storage tiering and workload right-sizing
Detect and address continuous anomalies

Berufserfahrung

5 Jahre

Ausbildung

Bachelor-AbschlussODER
Master-Abschluss

Sprachen

Englisch – verhandlungssicher

Tools & Technologien

Linux
Prometheus
Grafana
Loki
Thanos
OpenTelemetry
Python
Go
Bash
Kubernetes
AWS
EKS
EC2
VPC
Terraform
BGP
DNS
VXLAN
ISO 27001
NIST SP 800-53

Benefits

Karriere- und Weiterentwicklung

Growth opportunities

Sonstige Vorteile

Major transaction exposure

Startup-Atmosphäre

Work with talented team
Dynamic work environment
Get things done attitude

Weiterbildungsangebote

Professional development

Lockere Unternehmenskultur

International experience
Open communication culture

Die Originalanzeige dieses Stellenangebotes in der aktuellsten Version findest du hier. Nejo hat diesen Job automatisch von der Website des Unternehmens 1GLOBAL erfasst und die Informationen auf Nejo mit Hilfe von KI für dich aufbereitet. Trotz sorgfältiger Analyse können einzelne Informationen unvollständig oder ungenau sein. Bitte prüfe immer alle Angaben in der Originalanzeige! Inhalte und Urheberrechte der Originalanzeige liegen beim ausschreibenden Unternehmen.

Gefällt dir diese Stelle?

Beta

Dein Career Agent findet täglich ähnliche Jobs für dich.

Noch nicht perfekt?

Forto
Senior Site Reliability Engineer(m/w/x)
Vollzeitnur vor OrtSenior
Berlin
Air Apps
Site Reliability Engineer (SRE)(m/w/x)
Vollzeitnur vor OrtBerufserfahren
Berlin
Nebius
Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)
Vollzeitnur vor OrtSenior
Berlin
emnify
Staff/Senior AWS Cloud Platform Engineer(m/w/x)
Vollzeitnur vor OrtSenior
Berlin
Almedia
Site Reliability Engineer / DevOps(m/w/x)
Vollzeitnur vor OrtKeine Angabe
Berlin
ab 80.000 - 190.000 / Jahr

Alle 100+ ähnlichen Jobs ansehen

1G1GLOBAL

letzten Monat