Skip to content
Neuer Job?Nejo!

Dein persönlicher KI-Karriere-Agent

1G1GLOBAL

Senior Site Reliability Engineer (SRE)(m/w/x)

Berlin
VollzeitVor OrtSenior

Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.

Anforderungen

  • 5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
  • Strong expertise in Linux, distributed systems, networking
  • Proven experience building/running high-availability production systems
  • Hands-on experience with redundancy, failover testing, DR, HA validation
  • Deep understanding of monitoring, observability, incident management
  • Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
  • Proficiency in Python, Go, Bash for automation
  • Strong knowledge of Kubernetes, container orchestration, service mesh
  • Experience with AWS (EKS, EC2, VPC) and on-premises integration
  • Proficiency in Infrastructure as Code tools like Terraform
  • Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
  • Excellent analytical and problem-solving skills under pressure
  • Strong communication and collaboration skills across teams
  • Experience in telecom, carrier-grade, or large-scale distributed systems
  • Hands-on experience with chaos engineering and automated failure validation
  • Strong understanding of high-availability networking concepts
  • Background in capacity planning, traffic engineering, multi-region failover
  • Experience building reliability dashboards and integrating SRE metrics
  • Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)

Aufgaben

  • Strengthen global infrastructure stability, scalability, and reliability
  • Proactively identify system weaknesses
  • Improve reliability through redundancy testing, automation, and observability
  • Mentor peers and set technical standards for reliability engineering
  • Define, measure, and maintain SLIs and SLOs
  • Plan and execute redundancy and resilience testing
  • Validate failover, HA configurations, and disaster recovery readiness
  • Design and implement automated recovery mechanisms
  • Create self-healing workflows and intelligent alerting systems
  • Drive incident response and root-cause analysis
  • Conduct blameless post-mortems
  • Implement and track corrective and preventive actions
  • Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
  • Ensure deployment safety, rollback policies, and configuration consistency
  • Identify weaknesses through fault-injection, load, and chaos testing
  • Reduce operational toil through automation and reliability tooling
  • Contribute to on-call practices
  • Improve alert quality, runbooks, and escalation procedures
  • Manage incident response processes
  • Perform capacity planning and performance benchmarking
  • Conduct resilience audits across systems
  • Ensure compliance with security, reliability, and availability standards
  • Create and maintain internal documentation and playbooks
  • Contribute to cloud cost-optimization initiatives
  • Plan reserved capacity and autoscaling design
  • Implement storage tiering and workload right-sizing
  • Detect and address continuous anomalies

Berufserfahrung

  • 5 Jahre

Ausbildung

  • Bachelor-AbschlussODER
  • Master-Abschluss

Sprachen

  • Englischverhandlungssicher

Tools & Technologien

  • Linux
  • Prometheus
  • Grafana
  • Loki
  • Thanos
  • OpenTelemetry
  • Python
  • Go
  • Bash
  • Kubernetes
  • AWS
  • EKS
  • EC2
  • VPC
  • Terraform
  • BGP
  • DNS
  • VXLAN
  • ISO 27001
  • NIST SP 800-53

Benefits

Karriere- und Weiterentwicklung

  • Growth opportunities

Sonstige Vorteile

  • Major transaction exposure

Startup-Atmosphäre

  • Work with talented team
  • Dynamic work environment
  • Get things done attitude

Weiterbildungsangebote

  • Professional development

Lockere Unternehmenskultur

  • International experience
  • Open communication culture
Die Originalanzeige dieses Stellenangebotes in der aktuellsten Version findest du hier. Nejo hat diesen Job automatisch von der Website des Unternehmens 1GLOBAL erfasst und die Informationen auf Nejo mit Hilfe von KI für dich aufbereitet. Trotz sorgfältiger Analyse können einzelne Informationen unvollständig oder ungenau sein. Bitte prüfe immer alle Angaben in der Originalanzeige! Inhalte und Urheberrechte der Originalanzeige liegen beim ausschreibenden Unternehmen.

Gefällt dir diese Stelle?

Beta

Dein Career Agent findet täglich ähnliche Jobs für dich.


  • Forto

    Senior Site Reliability Engineer(m/w/x)

    Vollzeitnur vor OrtSenior
    Berlin
  • Nebius

    Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)

    Vollzeitnur vor OrtSenior
    Berlin
  • emnify

    Staff/Senior AWS Cloud Platform Engineer(m/w/x)

    Vollzeitnur vor OrtSenior
    Berlin
  • Air Apps

    Site Reliability Engineer (SRE)(m/w/x)

    Vollzeitnur vor OrtBerufserfahren
    Berlin
  • SysEleven GmbH

    Senior Site Reliability Engineer Managed Kubernetes(m/w/x)

    Vollzeitnur vor OrtSenior
    Berlin
Alle 100+ ähnlichen Jobs ansehen

Nejo ist eine KI – Ergebnisse können unvollständig sein oder Fehler enthalten

Diese Jobs könnten dich auch interessieren