Training Infrastructure Engineer at Fireworks AI building distributed systems for large-scale LLM and multimodal model training. Focus on infrastructure design, pipeline optimization, and performance across GPUs and data centers.
As a Training Infrastructure Engineer, you'll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure. You'll collaborate with AI researchers and engineers to create robust training pipelines, optimize distributed training workloads, and ensure reliable model development.
Design and implement scalable infrastructure for large-scale model training workloads
Develop and maintain distributed training pipelines for LLMs and multimodal models
Optimize training performance across multiple GPUs, nodes, and data centers
Implement monitoring, logging, and debugging tools for training operations
Architect and maintain data storage solutions for large-scale training datasets
Automate infrastructure provisioning, scaling, and orchestration for model training
Collaborate with researchers to implement and optimize training methodologies
Analyze and improve efficiency, scalability, and cost-effectiveness of training systems
Troubleshoot complex performance issues in distributed training environments
Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
3+ years of experience with distributed systems and ML infrastructure
Experience with PyTorch
Proficiency in cloud platforms (AWS, GCP, Azure)
Experience with containerization, orchestration (Kubernetes, Docker)
Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)
Master's or PhD in Computer Science or related field
Experience training large language models or multimodal AI systems
Experience with ML workflow orchestration tools
Background in optimizing high-performance distributed computing systems
Familiarity with ML DevOps practices
Contributions to open-source ML infrastructure or related projects