Your personal AI career agent
Member of Technical Staff - ML Infrastructure Engineer(m/w/x)
Designing and deploying ML infrastructure for generative AI models, supporting multi-week training runs and production inference. Experience building and managing ML infrastructure at scale required. Travel costs covered, with on-call for failed training runs.
Requirements
- Built and managed ML infrastructure at scale
- Understanding of supporting AI research infrastructure
- Experience being paged for failed training runs
- Debugging storage bottlenecks
- Infrastructure for long-term experiments
- Strong proficiency in cloud platforms (AWS, Azure, GCP)
- Focus on ML/AI services
- Extensive Kubernetes experience
- Extensive Slurm cluster management experience
- Expertise in Infrastructure as Code tools
- Discipline to use IaC tools
- Managing network-based cloud file systems for ML
- Optimizing network-based cloud file systems for ML
- Managing object storage for ML workloads
- Optimizing object storage for ML workloads
- Experience with CI/CD tools in ML contexts
- Experience with CI/CD practices in ML contexts
- Strong understanding of cloud security principles
- Strong understanding of cloud security best practices
- Experience with monitoring tools
- Experience with observability tools
- Familiarity with ML workflows
- Familiarity with GPU infrastructure management
- Understanding researcher needs for ML infrastructure
- Handling complex migrations in production
- Handling breaking changes in production
- Experience building custom autoscaling solutions for ML
- Knowledge of cost optimization strategies for ML infrastructure
- Familiarity with MLOps practices
- Familiarity with MLOps tools
- Experience with high-performance computing (HPC)
- Understanding data versioning for ML
- Understanding experiment tracking for ML
- Knowledge of network optimization for distributed ML
- Experience with multi-cloud architectures
- Experience with hybrid cloud architectures
- Familiarity with container security
- Familiarity with vulnerability scanning tools
Tasks
- Design, deploy, and maintain ML infrastructure
- Support multi-week training runs and production inference
- Implement cloud-based ML training and inference clusters
- Manage network-based cloud file systems and blob/S3 storage
- Develop and maintain Infrastructure as Code (IaC)
- Optimize CI/CD pipelines for ML workflows
- Design custom autoscaling solutions for ML workloads
- Ensure security best practices in ML infrastructure
- Provide developer-friendly ML operations tools
Work Experience
- approx. 4 - 6 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- AWS
- Azure
- GCP
- Kubernetes
- Slurm
- Terraform
- Ansible
- CircleCI
- GitHub Actions
- ArgoCD
- Prometheus
- Grafana
- Loki
Benefits
Additional Allowances
- Reasonable travel costs covered
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
Not a perfect match?
- Black Forest LabsFull-timeWith HomeofficeSeniorFreiburg im Breisgau
- Black Forest Labs
Member of Technical Staff - VLM(m/w/x)
Full-timeWith HomeofficeSeniorFreiburg im Breisgau - Black Forest Labs
Member of Technical Staff - Image / Video Researcher(m/w/x)
Full-timeWith HomeofficeExperiencedFreiburg im Breisgau - Haufe Group
AI Automation Engineer(m/w/x)
Full-timeWith HomeofficeExperiencedFreiburg im Breisgau - Haufe Group
Senior Data Analytics Engineer(m/w/x)
Full-timeWith HomeofficeSeniorFreiburg im Breisgau
Member of Technical Staff - ML Infrastructure Engineer(m/w/x)
Designing and deploying ML infrastructure for generative AI models, supporting multi-week training runs and production inference. Experience building and managing ML infrastructure at scale required. Travel costs covered, with on-call for failed training runs.
Requirements
- Built and managed ML infrastructure at scale
- Understanding of supporting AI research infrastructure
- Experience being paged for failed training runs
- Debugging storage bottlenecks
- Infrastructure for long-term experiments
- Strong proficiency in cloud platforms (AWS, Azure, GCP)
- Focus on ML/AI services
- Extensive Kubernetes experience
- Extensive Slurm cluster management experience
- Expertise in Infrastructure as Code tools
- Discipline to use IaC tools
- Managing network-based cloud file systems for ML
- Optimizing network-based cloud file systems for ML
- Managing object storage for ML workloads
- Optimizing object storage for ML workloads
- Experience with CI/CD tools in ML contexts
- Experience with CI/CD practices in ML contexts
- Strong understanding of cloud security principles
- Strong understanding of cloud security best practices
- Experience with monitoring tools
- Experience with observability tools
- Familiarity with ML workflows
- Familiarity with GPU infrastructure management
- Understanding researcher needs for ML infrastructure
- Handling complex migrations in production
- Handling breaking changes in production
- Experience building custom autoscaling solutions for ML
- Knowledge of cost optimization strategies for ML infrastructure
- Familiarity with MLOps practices
- Familiarity with MLOps tools
- Experience with high-performance computing (HPC)
- Understanding data versioning for ML
- Understanding experiment tracking for ML
- Knowledge of network optimization for distributed ML
- Experience with multi-cloud architectures
- Experience with hybrid cloud architectures
- Familiarity with container security
- Familiarity with vulnerability scanning tools
Tasks
- Design, deploy, and maintain ML infrastructure
- Support multi-week training runs and production inference
- Implement cloud-based ML training and inference clusters
- Manage network-based cloud file systems and blob/S3 storage
- Develop and maintain Infrastructure as Code (IaC)
- Optimize CI/CD pipelines for ML workflows
- Design custom autoscaling solutions for ML workloads
- Ensure security best practices in ML infrastructure
- Provide developer-friendly ML operations tools
Work Experience
- approx. 4 - 6 years
Education
- Bachelor's degreeOR
- Master's degree
Languages
- English – Business Fluent
Tools & Technologies
- AWS
- Azure
- GCP
- Kubernetes
- Slurm
- Terraform
- Ansible
- CircleCI
- GitHub Actions
- ArgoCD
- Prometheus
- Grafana
- Loki
Benefits
Additional Allowances
- Reasonable travel costs covered
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
About the Company
Black Forest Labs
Industry
IT
Description
The company advances generative deep learning for media, creating models that transform ideas into images and videos.
Not a perfect match?
- Black Forest Labs
Member of Technical Staff - Pretraining(m/w/x)
Full-timeWith HomeofficeSeniorFreiburg im Breisgau - Black Forest Labs
Member of Technical Staff - VLM(m/w/x)
Full-timeWith HomeofficeSeniorFreiburg im Breisgau - Black Forest Labs
Member of Technical Staff - Image / Video Researcher(m/w/x)
Full-timeWith HomeofficeExperiencedFreiburg im Breisgau - Haufe Group
AI Automation Engineer(m/w/x)
Full-timeWith HomeofficeExperiencedFreiburg im Breisgau - Haufe Group
Senior Data Analytics Engineer(m/w/x)
Full-timeWith HomeofficeSeniorFreiburg im Breisgau