Your personal AI career agent
Senior AI Engineer – Pre-training Data(m/w/x)
Defining and preparing large-scale data for foundation model pre-training in finance, manufacturing, and public administration. Strong Python, data engineering, and ML infrastructure experience required. 30 days vacation, hybrid work, flexible hours.
Requirements
- Track record of shipping impactful technical work
- Strong Python skills
- Comfort with data engineering and ML infrastructure
- Experience with deep learning frameworks
- Experience with workflow orchestration
- Experience with object storage
- Experience with columnar data formats
- Experience with distributed processing
- Ability to reason about dataset contribution to model training
- Ownership mentality
- Willingness to relocate to Heidelberg
- Travel at least fortnightly
- Experience with large-scale data processing for ML
- Experience with corpus sourcing, curation, cleaning, deduplication, and filtering
- Familiarity with data quality methods
- Understanding of foundation model training
- Experience with web-scale data sourcing
- Experience with crawl processing
- Rust proficiency
- Infrastructure knowledge
- Experience with Kubernetes
- Experience with container orchestration
- Experience with cloud-native ML infrastructure
- PhD in machine learning, NLP, data engineering, or related field (valued but not required)
- German language proficiency (helpful but not required)
Tasks
- Define data for model inputs
- Build data sourcing and preparation systems
- Ensure high-quality data for training
- Analyze data quality and corpus value
- Optimize large-scale data processing pipelines
- Develop tools for data visibility
- Stay updated on pre-training data research
- Design and run data experiments
- Co-own end-to-end data pipelines
- Design and maintain data infrastructure
- Curate and iterate on data mixtures
- Balance data domains, languages, and quality
- Build data quality classifiers and heuristics
- Monitor pipeline health and data metrics
- Identify and address data coverage gaps
- Collaborate with post-training teams
- Ensure German-language data coverage
- Establish data-to-performance signals
- Maintain data lineage and provenance
Work Experience
- approx. 4 - 6 years
Education
- Doctoral / PhD
Languages
- German – Basic
Tools & Technologies
- Python
- deep learning frameworks
- workflow orchestration
- object storage
- columnar data formats
- distributed processing
- Kubernetes
- container orchestration
- cloud-native ML infrastructure
- Common Crawl
- WARC pipelines
- Rust
Benefits
Flexible Working
- Flexible working hours
- Hybrid working model
More Vacation Days
- 30 days of paid vacation
Healthcare & Fitness
- Fitness & wellness offerings
Mental Health Support
- Mental health support
Retirement Plans
- Subsidized company pension plan
Public Transport Subsidies
- Subsidized Germany-wide transportation ticket
Additional Allowances
- Budget for additional technical equipment
Competitive Pay
- Virtual Stock Option Plan
Company Bike
- Bike Lease
- Home
- Jobs in Germany
- Senior AI Engineer – Pre-training DataSenior AI Engineer – Pre-training Data at Aleph Alpha
Not a perfect match?
- Aleph AlphaFull-timeWith HomeofficeSeniorHeidelberg
- Aleph Alpha
Senior AI Software Engineer - Model Evaluation(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Buhl Data Service GmbH
Senior AI / Data Science Engineer(m/w/x)
Full-timeWith HomeofficeSeniorMannheim - Aleph Alpha
Senior AI Researcher- Reinforcement learning(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Computacenter
MLOPs Engineer - Data & AI Platforms(m/w/x)
Full-timeWith HomeofficeExperiencedFrankfurt am Main, Stuttgart, Hannover, Hamburg, München, Ludwigshafen am Rhein, Nürnberg, Köln, Berlin
- Home
- Jobs in Germany
- Senior AI Engineer – Pre-training DataSenior AI Engineer – Pre-training Data at Aleph Alpha
Senior AI Engineer – Pre-training Data(m/w/x)
Defining and preparing large-scale data for foundation model pre-training in finance, manufacturing, and public administration. Strong Python, data engineering, and ML infrastructure experience required. 30 days vacation, hybrid work, flexible hours.
Requirements
- Track record of shipping impactful technical work
- Strong Python skills
- Comfort with data engineering and ML infrastructure
- Experience with deep learning frameworks
- Experience with workflow orchestration
- Experience with object storage
- Experience with columnar data formats
- Experience with distributed processing
- Ability to reason about dataset contribution to model training
- Ownership mentality
- Willingness to relocate to Heidelberg
- Travel at least fortnightly
- Experience with large-scale data processing for ML
- Experience with corpus sourcing, curation, cleaning, deduplication, and filtering
- Familiarity with data quality methods
- Understanding of foundation model training
- Experience with web-scale data sourcing
- Experience with crawl processing
- Rust proficiency
- Infrastructure knowledge
- Experience with Kubernetes
- Experience with container orchestration
- Experience with cloud-native ML infrastructure
- PhD in machine learning, NLP, data engineering, or related field (valued but not required)
- German language proficiency (helpful but not required)
Tasks
- Define data for model inputs
- Build data sourcing and preparation systems
- Ensure high-quality data for training
- Analyze data quality and corpus value
- Optimize large-scale data processing pipelines
- Develop tools for data visibility
- Stay updated on pre-training data research
- Design and run data experiments
- Co-own end-to-end data pipelines
- Design and maintain data infrastructure
- Curate and iterate on data mixtures
- Balance data domains, languages, and quality
- Build data quality classifiers and heuristics
- Monitor pipeline health and data metrics
- Identify and address data coverage gaps
- Collaborate with post-training teams
- Ensure German-language data coverage
- Establish data-to-performance signals
- Maintain data lineage and provenance
Work Experience
- approx. 4 - 6 years
Education
- Doctoral / PhD
Languages
- German – Basic
Tools & Technologies
- Python
- deep learning frameworks
- workflow orchestration
- object storage
- columnar data formats
- distributed processing
- Kubernetes
- container orchestration
- cloud-native ML infrastructure
- Common Crawl
- WARC pipelines
- Rust
Benefits
Flexible Working
- Flexible working hours
- Hybrid working model
More Vacation Days
- 30 days of paid vacation
Healthcare & Fitness
- Fitness & wellness offerings
Mental Health Support
- Mental health support
Retirement Plans
- Subsidized company pension plan
Public Transport Subsidies
- Subsidized Germany-wide transportation ticket
Additional Allowances
- Budget for additional technical equipment
Competitive Pay
- Virtual Stock Option Plan
Company Bike
- Bike Lease
About the Company
Aleph Alpha
Industry
IT
Description
The company develops cutting-edge generative AI solutions with a strong emphasis on sovereignty, ethical development, and societal benefit.
Not a perfect match?
- Aleph Alpha
Senior Performance Engineer- Pretraining(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Aleph Alpha
Senior AI Software Engineer - Model Evaluation(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Buhl Data Service GmbH
Senior AI / Data Science Engineer(m/w/x)
Full-timeWith HomeofficeSeniorMannheim - Aleph Alpha
Senior AI Researcher- Reinforcement learning(m/w/x)
Full-timeWith HomeofficeSeniorHeidelberg - Computacenter
MLOPs Engineer - Data & AI Platforms(m/w/x)
Full-timeWith HomeofficeExperiencedFrankfurt am Main, Stuttgart, Hannover, Hamburg, München, Ludwigshafen am Rhein, Nürnberg, Köln, Berlin