Your personal AI career agent
AI Research Engineer - Kernel & Inference Optimization(m/w/x)
Optimizing AI model serving pipelines for low-latency, high-throughput transactions. PhD in NLP/ML and custom compute shader experience required. Focus on Metal Shading Language (MSL) implementation.
Requirements
- Degree in Computer Science or related field
- PhD in NLP, Machine Learning, or related field
- Solid track record in AI R&D with publications
- Knowledge of Metal Shading Language (MSL)
- Comfortable writing custom compute shaders
- Proven experience in low-level kernel optimizations
- Proven experience in inference optimization on mobile devices
- Contributions leading to measurable improvements in inference latency, throughput, and memory footprint
- Deep understanding of modern model serving architectures
- Deep understanding of inference optimization techniques
- Strong expertise in writing GPU kernels for mobile devices
- Deep understanding of model serving frameworks and engines
- Practical experience in developing end-to-end inference pipelines
- Ability to apply empirical research to model serving challenges
- Proficient in designing robust evaluation frameworks
- Designing and optimizing high-performance inference engines
- Experience with Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism
- Deep understanding of Diffusion Models math and structure
- Deep understanding of Vision Transformers math and structure
- Understanding of Pruning
- Understanding of Quantization
- Understanding of Flash attention
- Understanding of KV Cache
- Understanding of Speculative Decoding (Eagle)
Tasks
- Drive innovation in model serving and inference architectures
- Optimize model deployment and inference strategies
- Design and deploy state-of-the-art model serving pipelines
- Ensure high throughput and low latency in model serving
- Optimize memory usage in model serving pipelines
- Establish clear performance targets for latency and memory
- Build and run controlled inference tests
- Monitor key performance indicators in production
- Document and validate performance across platforms
- Identify and prepare high-quality test datasets
- Set criteria for evaluating model performance
- Analyze computational efficiency and diagnose bottlenecks
- Address suboptimal batch processing and network delays
- Optimize serving infrastructure for scalability and reliability
- Integrate optimized serving frameworks into production
- Define success metrics for real-world performance
- Ensure continuous monitoring and iterative refinements
Work Experience
- approx. 4 - 6 years
Education
- Bachelor's degree
Languages
- English – Business Fluent
Tools & Technologies
- Metal Shading Language (MSL)
- Compute shaders
- GPU kernels
- Diffusion Models
- Vision Transformers
- Pruning
- Quantization
- Flash attention
- KV Cache
- Speculative Decoding (Eagle)
- Tensor Parallelism
- Pipeline Parallelism
- Expert Parallelism
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
Not a perfect match?
- Tether Operations LimitedFull-timeRemoteSeniorZürich
- Anthropic
Research Engineer / Research Scientist, Pre-training(m/w/x)
Full-timeWith HomeofficeExperiencedZürichfrom CHF 280,000 - 680,000 / year - ANYbotics
Senior AI Research Engineer, Visual Perception(m/w/x)
Full-timeWith HomeofficeSeniorZürich - Mistral
AI Scientist(m/w/x)
Full-timeWith HomeofficeNot specifiedZürich - Anthropic
Research Engineer, Production Model Post Training(m/w/x)
Full-timeWith HomeofficeExperiencedZürich
AI Research Engineer - Kernel & Inference Optimization(m/w/x)
Optimizing AI model serving pipelines for low-latency, high-throughput transactions. PhD in NLP/ML and custom compute shader experience required. Focus on Metal Shading Language (MSL) implementation.
Requirements
- Degree in Computer Science or related field
- PhD in NLP, Machine Learning, or related field
- Solid track record in AI R&D with publications
- Knowledge of Metal Shading Language (MSL)
- Comfortable writing custom compute shaders
- Proven experience in low-level kernel optimizations
- Proven experience in inference optimization on mobile devices
- Contributions leading to measurable improvements in inference latency, throughput, and memory footprint
- Deep understanding of modern model serving architectures
- Deep understanding of inference optimization techniques
- Strong expertise in writing GPU kernels for mobile devices
- Deep understanding of model serving frameworks and engines
- Practical experience in developing end-to-end inference pipelines
- Ability to apply empirical research to model serving challenges
- Proficient in designing robust evaluation frameworks
- Designing and optimizing high-performance inference engines
- Experience with Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism
- Deep understanding of Diffusion Models math and structure
- Deep understanding of Vision Transformers math and structure
- Understanding of Pruning
- Understanding of Quantization
- Understanding of Flash attention
- Understanding of KV Cache
- Understanding of Speculative Decoding (Eagle)
Tasks
- Drive innovation in model serving and inference architectures
- Optimize model deployment and inference strategies
- Design and deploy state-of-the-art model serving pipelines
- Ensure high throughput and low latency in model serving
- Optimize memory usage in model serving pipelines
- Establish clear performance targets for latency and memory
- Build and run controlled inference tests
- Monitor key performance indicators in production
- Document and validate performance across platforms
- Identify and prepare high-quality test datasets
- Set criteria for evaluating model performance
- Analyze computational efficiency and diagnose bottlenecks
- Address suboptimal batch processing and network delays
- Optimize serving infrastructure for scalability and reliability
- Integrate optimized serving frameworks into production
- Define success metrics for real-world performance
- Ensure continuous monitoring and iterative refinements
Work Experience
- approx. 4 - 6 years
Education
- Bachelor's degree
Languages
- English – Business Fluent
Tools & Technologies
- Metal Shading Language (MSL)
- Compute shaders
- GPU kernels
- Diffusion Models
- Vision Transformers
- Pruning
- Quantization
- Flash attention
- KV Cache
- Speculative Decoding (Eagle)
- Tensor Parallelism
- Pipeline Parallelism
- Expert Parallelism
Like this job?
BetaYour Career Agent finds similar jobs for you every day.
About the Company
Tether Operations Limited
Industry
FinancialServices
Description
The company pioneers a global financial revolution with blockchain solutions, enabling secure and instant digital token transactions.
Not a perfect match?
- Tether Operations Limited
.AI Research Engineer (Model Compression & Quantization)(m/w/x)
Full-timeRemoteSeniorZürich - Anthropic
Research Engineer / Research Scientist, Pre-training(m/w/x)
Full-timeWith HomeofficeExperiencedZürichfrom CHF 280,000 - 680,000 / year - ANYbotics
Senior AI Research Engineer, Visual Perception(m/w/x)
Full-timeWith HomeofficeSeniorZürich - Mistral
AI Scientist(m/w/x)
Full-timeWith HomeofficeNot specifiedZürich - Anthropic
Research Engineer, Production Model Post Training(m/w/x)
Full-timeWith HomeofficeExperiencedZürich