Skip to content
New Job?Nejo!

Your personal AI career agent

ALAleph Alpha

Senior AI Engineer – Pre-training Data(m/w/x)

Heidelberg
Full-timeWith Home OfficeSenior
AI/ML
Data Science

Defining and building systems for foundation model pre-training data at a European AI leader. High engineering competence and strong Python skills required. 30 days vacation, hybrid work, fitness offerings.

Requirements

  • Significant research experience (industry or academia)
  • High engineering competence
  • Track record of shipping impactful technical work
  • Strong Python skills
  • Comfort with data engineering and ML infrastructure
  • Experience with deep learning frameworks
  • Experience with workflow orchestration
  • Experience with object storage
  • Experience with columnar data formats
  • Experience with distributed processing
  • Ability to reason about dataset contribution to model training
  • Understanding of dataset relevance for model training
  • Ownership mentality
  • Willingness to relocate to Heidelberg
  • Travel at least fortnightly
  • Experience with large-scale data processing for ML
  • Experience with corpus sourcing
  • Experience with corpus curation
  • Experience with corpus cleaning
  • Experience with corpus deduplication
  • Experience with corpus filtering
  • Familiarity with data quality methods
  • Understanding of foundation model training
  • Understanding of data composition effects on capabilities
  • Understanding of scale effects on capabilities
  • Understanding of mixing ratios effects on capabilities
  • Experience with web-scale data sourcing
  • Experience with crawl processing
  • Rust proficiency
  • Infrastructure knowledge
  • Experience with Kubernetes
  • Experience with container orchestration
  • Experience with cloud-native ML infrastructure
  • PhD in machine learning, NLP, data engineering, or related field (valued but not required)
  • German language proficiency (bonus, not required)

Tasks

  • Define data for model training
  • Build systems for data sourcing and preparation
  • Ensure high-quality data for training team
  • Work on full stack of data preparation
  • Analyze data quality and corpus value
  • Optimize large-scale data processing pipelines
  • Build tools for data visibility
  • Stay updated on pre-training data research
  • Design and run data experiments
  • Co-own data pipelines end-to-end
  • Design and maintain data infrastructure
  • Curate and compose data mixtures
  • Balance data domains, languages, and quality
  • Build data quality tooling
  • Develop classifiers and heuristics
  • Monitor pipeline health and data quality
  • Close data gaps
  • Identify and address model weaknesses
  • Collaborate with post-training team
  • Support downstream fine-tuning and deployment
  • Ensure high-quality German-language data
  • Establish data-to-performance signal
  • Maintain data lineage and provenance

Work Experience

  • approx. 4 - 6 years

Education

  • Doctoral / PhD

Languages

  • GermanBasic

Tools & Technologies

  • Python
  • Deep learning frameworks
  • Workflow orchestration
  • Object storage
  • Columnar data formats
  • Distributed processing
  • Kubernetes
  • Container orchestration
  • Cloud-native ML infrastructure
  • Rust
  • Common Crawl
  • WARC pipelines

Benefits

Flexible Working

  • Flexible working hours
  • Hybrid working model

More Vacation Days

  • 30 days of paid vacation

Healthcare & Fitness

  • Fitness & wellness offerings

Mental Health Support

  • Mental health support

Retirement Plans

  • Subsidized company pension plan

Public Transport Subsidies

  • Subsidized Germany-wide transportation ticket

Additional Allowances

  • Budget for additional technical equipment

Competitive Pay

  • Virtual Stock Option Plan

Company Bike

  • Bike Lease
Find the original job posting in its most current version here. Nejo automatically captured this job from the website of Aleph Alpha and processed the information on Nejo with the help of AI for you. Despite careful analysis, some information may be incomplete or inaccurate. Please always verify all details in the original posting! Content and copyrights of the original posting belong to the advertising company.

  • Aleph Alpha

    Senior AI Researcher - Pre-training Data(m/w/x)

    Full-timeWith HomeofficeSenior
    Heidelberg
  • Aleph Alpha

    Senior AI Software Engineer - Model Evaluation(m/w/x)

    Full-timeWith HomeofficeSenior
    Heidelberg
  • Aleph Alpha

    Senior Performance Engineer- Pretraining(m/w/x)

    Full-timeWith HomeofficeSenior
    Heidelberg
  • Natuvion GmbH

    (Senior) AI Engineer(m/w/x)

    Full-timeWith HomeofficeExperienced
    Bratislava, München, Walldorf, Wien, Leipzig
  • Buhl Data Service GmbH

    Senior AI / Data Science Engineer(m/w/x)

    Full-timeWith HomeofficeSenior
    Mannheim
View all 100+ similar jobs

Nejo is an AI – results may be incomplete or contain mistakes