Skip to content
New Job?Nejo!

Your personal AI career agent

1G1GLOBAL

Senior Site Reliability Engineer (SRE)(m/w/x)

Berlin
Full-timeOn-siteSenior

Global infrastructure stability and reliability for a regulated telecom provider across 40 countries. 5+ years SRE experience required. Exposure to major transactions, dynamic work environment.

Requirements

  • 5+ years Site Reliability, Systems, or Infrastructure Engineering (2+ years dedicated SRE)
  • Strong expertise in Linux, distributed systems, networking
  • Proven experience building/running high-availability production systems
  • Hands-on experience with redundancy, failover testing, DR, HA validation
  • Deep understanding of monitoring, observability, incident management
  • Experience with Prometheus, Grafana, Loki, Thanos, OpenTelemetry or similar
  • Proficiency in Python, Go, Bash for automation
  • Strong knowledge of Kubernetes, container orchestration, service mesh
  • Experience with AWS (EKS, EC2, VPC) and on-premises integration
  • Proficiency in Infrastructure as Code tools like Terraform
  • Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN)
  • Excellent analytical and problem-solving skills under pressure
  • Strong communication and collaboration skills across teams
  • Experience in telecom, carrier-grade, or large-scale distributed systems
  • Hands-on experience with chaos engineering and automated failure validation
  • Strong understanding of high-availability networking concepts
  • Background in capacity planning, traffic engineering, multi-region failover
  • Experience building reliability dashboards and integrating SRE metrics
  • Familiarity with security and resilience standards (ISO 27001, NIST SP 800-53)

Tasks

  • Strengthen global infrastructure stability, scalability, and reliability
  • Proactively identify system weaknesses
  • Improve reliability through redundancy testing, automation, and observability
  • Mentor peers and set technical standards for reliability engineering
  • Define, measure, and maintain SLIs and SLOs
  • Plan and execute redundancy and resilience testing
  • Validate failover, HA configurations, and disaster recovery readiness
  • Design and implement automated recovery mechanisms
  • Create self-healing workflows and intelligent alerting systems
  • Drive incident response and root-cause analysis
  • Conduct blameless post-mortems
  • Implement and track corrective and preventive actions
  • Develop and enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry
  • Ensure deployment safety, rollback policies, and configuration consistency
  • Identify weaknesses through fault-injection, load, and chaos testing
  • Reduce operational toil through automation and reliability tooling
  • Contribute to on-call practices
  • Improve alert quality, runbooks, and escalation procedures
  • Manage incident response processes
  • Perform capacity planning and performance benchmarking
  • Conduct resilience audits across systems
  • Ensure compliance with security, reliability, and availability standards
  • Create and maintain internal documentation and playbooks
  • Contribute to cloud cost-optimization initiatives
  • Plan reserved capacity and autoscaling design
  • Implement storage tiering and workload right-sizing
  • Detect and address continuous anomalies

Work Experience

  • 5 years

Education

  • Bachelor's degreeOR
  • Master's degree

Languages

  • EnglishBusiness Fluent

Tools & Technologies

  • Linux
  • Prometheus
  • Grafana
  • Loki
  • Thanos
  • OpenTelemetry
  • Python
  • Go
  • Bash
  • Kubernetes
  • AWS
  • EKS
  • EC2
  • VPC
  • Terraform
  • BGP
  • DNS
  • VXLAN
  • ISO 27001
  • NIST SP 800-53

Benefits

Career Advancement

  • Growth opportunities

Other Benefits

  • Major transaction exposure

Startup Environment

  • Work with talented team
  • Dynamic work environment
  • Get things done attitude

Learning & Development

  • Professional development

Informal Culture

  • International experience
  • Open communication culture
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of 1GLOBAL and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

Like this job?

Beta

Your Career Agent finds similar jobs for you every day.


  • Forto

    Senior Site Reliability Engineer(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • Air Apps

    Site Reliability Engineer (SRE)(m/w/x)

    Full-timeOn-siteExperienced
    Berlin
  • Nebius

    Senior Site Reliability Engineer — AI Studio (Inference Platform)(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • emnify

    Staff/Senior AWS Cloud Platform Engineer(m/w/x)

    Full-timeOn-siteSenior
    Berlin
  • SysEleven GmbH

    Senior Site Reliability Engineer Managed Kubernetes(m/w/x)

    Full-timeOn-siteSenior
    Berlin
View all 100+ similar jobs

Nejo is an AI – results may be incomplete or contain mistakes