Bulletproofing Trillion-Parameter Training: SRE Strategies for Ultra-Large AI Infrastructure at Scale

Anjan Dash

Meta

Training trillion-parameter language models presents unique site reliability challenges that dwarf traditional distributed systems complexity. With training costs exceeding millions of dollars and runs spanning months across thousands of GPUs, even minor infrastructure failures can result in catastrophic business impact and resource waste. This presentation examines the SRE principles and practices essential for maintaining high uptime in distributed AI training environments. Drawing from real-world experience managing production AI infrastructure, the session explores how frameworks like NVIDIA's Megatron-LM, Microsoft's DeepSpeed, and UC Berkeley's Alpa introduce novel reliability challenges that traditional monitoring and alerting systems cannot adequately address. Key topics include implementing robust fault tolerance mechanisms for multi-week training jobs, designing effective checkpointing strategies that balance recovery speed with storage costs, and building comprehensive observability pipelines for distributed training workloads. The presentation examines how communication overhead between thousands of nodes creates cascading failure scenarios and demonstrates monitoring techniques for detecting performance degradation before it impacts model convergence. The session covers practical SRE implementations including automated failure recovery systems, capacity planning for massive memory requirements, and energy-aware resource allocation strategies. The discussion addresses the operational complexity of managing heterogeneous GPU clusters and the monitoring strategies needed to track model performance alongside traditional infrastructure metrics. Attendees will learn actionable SRE practices for large-scale AI infrastructure including incident response procedures for distributed training failures, reliability testing methodologies for AI workloads, and capacity forecasting techniques for rapidly evolving model architectures. This presentation addresses the critical reliability challenges that determine whether organizations can successfully operate next-generation AI systems in production environments.

Anjan Dash is an accomplished Tech Lead at Meta Platforms Inc., where he specializes in AI/ML infrastructure and privacy-preserving machine learning systems for ad recommendations. With over 15 years of experience in software engineering, he has driven significant business impact, including a 1%+ uplift in ads revenue representing approximately $1 billion in incremental revenue through innovative model architecture and infrastructure improvements. Before Meta, Anjan served as Lead Software Engineer at Amazon Web Services, where he launched the AWS SageMaker Ground Truth service, reducing AI training data labeling costs by up to 70%. His diverse background spans financial services, healthcare technology, and enterprise software, with notable achievements including leading the migration of legacy mainframe systems that generated $3 million in annual savings. Anjan holds a Postgraduate degree in Software Enterprise Management from the Indian Institute of Management, Bangalore, and a Master's in Computer Applications from the National Institute of Technology, Bhopal. He is a recipient of the Scott Cook Innovation Award at Intuit and achieved All India 2nd rank in the MCA entrance examination. Based in Dublin, California, he continues to drive innovation in distributed systems, cloud computing, and AI/ML platforms.

SREDAY

Site Reliability, DevOps and Cloud

October 3, 2025 San Francisco, CA, USA

Bulletproofing Trillion-Parameter Training: SRE Strategies for Ultra-Large AI Infrastructure at Scale

Anjan Dash

Sponsors & Partners