Your personal AI career agent
Senior AI Engineer – Pre-training Data(m/w/x)
Defining and building systems for foundation model pre-training data at a European AI leader. High engineering competence and strong Python skills required. 30 days vacation, hybrid work, fitness offerings.
Requirements
- Significant research experience (industry or academia)
- High engineering competence
- Track record of shipping impactful technical work
- Strong Python skills
- Comfort with data engineering and ML infrastructure
- Experience with deep learning frameworks
- Experience with workflow orchestration
- Experience with object storage
- Experience with columnar data formats
- Experience with distributed processing
- Ability to reason about dataset contribution to model training
- Understanding of dataset relevance for model training
- Ownership mentality
- Willingness to relocate to Heidelberg
- Travel at least fortnightly
- Experience with large-scale data processing for ML
- Experience with corpus sourcing
- Experience with corpus curation
- Experience with corpus cleaning
- Experience with corpus deduplication
- Experience with corpus filtering
- Familiarity with data quality methods
- Understanding of foundation model training
- Understanding of data composition effects on capabilities
- Understanding of scale effects on capabilities
- Understanding of mixing ratios effects on capabilities
- Experience with web-scale data sourcing
- Experience with crawl processing
- Rust proficiency
- Infrastructure knowledge
- Experience with Kubernetes
- Experience with container orchestration
- Experience with cloud-native ML infrastructure
- PhD in machine learning, NLP, data engineering, or related field (valued but not required)
- German language proficiency (bonus, not required)
Tasks
- Define data for model training
- Build systems for data sourcing and preparation
- Ensure high-quality data for training team
- Work on full stack of data preparation
- Analyze data quality and corpus value
- Optimize large-scale data processing pipelines
- Build tools for data visibility
- Stay updated on pre-training data research
- Design and run data experiments
- Co-own data pipelines end-to-end
- Design and maintain data infrastructure
- Curate and compose data mixtures
- Balance data domains, languages, and quality
- Build data quality tooling
- Develop classifiers and heuristics
- Monitor pipeline health and data quality
- Close data gaps
- Identify and address model weaknesses
- Collaborate with post-training team
- Support downstream fine-tuning and deployment
- Ensure high-quality German-language data
- Establish data-to-performance signal
- Maintain data lineage and provenance
Work Experience
- approx. 4 - 6 years
Education
- Doctoral / PhD
Languages
- German – Basic
Tools & Technologies
- Python
- Deep learning frameworks
- Workflow orchestration
- Object storage
- Columnar data formats
- Distributed processing
- Kubernetes
- Container orchestration
- Cloud-native ML infrastructure
- Rust
- Common Crawl
- WARC pipelines
Benefits
Flexible Working
- Flexible working hours
- Hybrid working model
More Vacation Days
- 30 days of paid vacation
Healthcare & Fitness
- Fitness & wellness offerings
Mental Health Support
- Mental health support
Retirement Plans
- Subsidized company pension plan
Public Transport Subsidies
- Subsidized Germany-wide transportation ticket
Additional Allowances
- Budget for additional technical equipment
Competitive Pay
- Virtual Stock Option Plan
Company Bike
- Bike Lease
Not a perfect match?
- Aleph AlphaFull-timeWith HomeofficeSeniorHeidelberg
- Aleph Alpha
Senior AI Software Engineer - Model Evaluation(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Aleph Alpha
Senior Performance Engineer- Pretraining(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Natuvion GmbH
(Senior) AI Engineer(m/w/x)
Full-timeWith HomeofficeExperiencedBratislava, München, Walldorf, Wien, Leipzig - Buhl Data Service GmbH
Senior AI / Data Science Engineer(m/w/x)
Full-timeWith HomeofficeSeniorMannheim
Senior AI Engineer – Pre-training Data(m/w/x)
Defining and building systems for foundation model pre-training data at a European AI leader. High engineering competence and strong Python skills required. 30 days vacation, hybrid work, fitness offerings.
Requirements
- Significant research experience (industry or academia)
- High engineering competence
- Track record of shipping impactful technical work
- Strong Python skills
- Comfort with data engineering and ML infrastructure
- Experience with deep learning frameworks
- Experience with workflow orchestration
- Experience with object storage
- Experience with columnar data formats
- Experience with distributed processing
- Ability to reason about dataset contribution to model training
- Understanding of dataset relevance for model training
- Ownership mentality
- Willingness to relocate to Heidelberg
- Travel at least fortnightly
- Experience with large-scale data processing for ML
- Experience with corpus sourcing
- Experience with corpus curation
- Experience with corpus cleaning
- Experience with corpus deduplication
- Experience with corpus filtering
- Familiarity with data quality methods
- Understanding of foundation model training
- Understanding of data composition effects on capabilities
- Understanding of scale effects on capabilities
- Understanding of mixing ratios effects on capabilities
- Experience with web-scale data sourcing
- Experience with crawl processing
- Rust proficiency
- Infrastructure knowledge
- Experience with Kubernetes
- Experience with container orchestration
- Experience with cloud-native ML infrastructure
- PhD in machine learning, NLP, data engineering, or related field (valued but not required)
- German language proficiency (bonus, not required)
Tasks
- Define data for model training
- Build systems for data sourcing and preparation
- Ensure high-quality data for training team
- Work on full stack of data preparation
- Analyze data quality and corpus value
- Optimize large-scale data processing pipelines
- Build tools for data visibility
- Stay updated on pre-training data research
- Design and run data experiments
- Co-own data pipelines end-to-end
- Design and maintain data infrastructure
- Curate and compose data mixtures
- Balance data domains, languages, and quality
- Build data quality tooling
- Develop classifiers and heuristics
- Monitor pipeline health and data quality
- Close data gaps
- Identify and address model weaknesses
- Collaborate with post-training team
- Support downstream fine-tuning and deployment
- Ensure high-quality German-language data
- Establish data-to-performance signal
- Maintain data lineage and provenance
Work Experience
- approx. 4 - 6 years
Education
- Doctoral / PhD
Languages
- German – Basic
Tools & Technologies
- Python
- Deep learning frameworks
- Workflow orchestration
- Object storage
- Columnar data formats
- Distributed processing
- Kubernetes
- Container orchestration
- Cloud-native ML infrastructure
- Rust
- Common Crawl
- WARC pipelines
Benefits
Flexible Working
- Flexible working hours
- Hybrid working model
More Vacation Days
- 30 days of paid vacation
Healthcare & Fitness
- Fitness & wellness offerings
Mental Health Support
- Mental health support
Retirement Plans
- Subsidized company pension plan
Public Transport Subsidies
- Subsidized Germany-wide transportation ticket
Additional Allowances
- Budget for additional technical equipment
Competitive Pay
- Virtual Stock Option Plan
Company Bike
- Bike Lease
About the Company
Aleph Alpha
Industry
IT
Description
The company develops cutting-edge generative AI solutions with a strong emphasis on sovereignty, ethical development, and societal benefit.
Not a perfect match?
- Aleph Alpha
Senior AI Researcher - Pre-training Data(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Aleph Alpha
Senior AI Software Engineer - Model Evaluation(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Aleph Alpha
Senior Performance Engineer- Pretraining(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Natuvion GmbH
(Senior) AI Engineer(m/w/x)
Full-timeWith HomeofficeExperiencedBratislava, München, Walldorf, Wien, Leipzig - Buhl Data Service GmbH
Senior AI / Data Science Engineer(m/w/x)
Full-timeWith HomeofficeSeniorMannheim