Home Computing Microsoft Details How It’s Building AI Cloud Infrastructure

Microsoft Details How It’s Building AI Cloud Infrastructure

How is Microsoft building out its own cloud infrastructure to support the growing demands of AI? That’s a question that Microsoft Azure CTO Mark Russinovich answered in a session at this week’s Microsoft Build 2024 conference.

AI demand has required Microsoft to build out a massive AI infrastructure to power the next generation of large language models (LLMs). Russinovich said Microsoft has scaled its AI infrastructure by 30 times since November 2023. That scale isn’t just about more GPUs and servers — it’s also about more cables and back-end infrastructure to support that massive scale.

“Efficiency is the key to the game,” Russinovich said. “If you take a look at what’s happening, to challenge the ability for us to be efficient, you can see that the sizes of these frontier models have continued to grow, basically, exponentially.”

Where Microsoft Is Innovating Cloud AI Infrastructure

The rapid growth of model sizes is driving exponential increases in GPU performance, memory capacity, and power consumption. Russinovich said the latest Nvidia GPU with all its high-bandwidth memory and all its transistors consumes 1,200 watts, just for one GPU.

“We cannot push enough air through our data centers to cool these kinds of systems,” he admitted.

The solution is liquid cooling, with Microsoft’s custom Maya racks being the first deployments of liquid cooling in the Azure data center.

“Maya is our first step toward the new design of data centers in the cloud,” he said. “Maya is a liquid-cooled system.”

Beyond cooling, Microsoft is also innovating in a few cloud infrastructure areas to support larger scale. One such area is power oversubscription, where Microsoft is optimizing its power usage.

Another area is high-speed networking, where Microsoft is running the InfiniBand interconnect, instead of Ethernet, to get better scale and performance. Additionally, Microsoft has built out its own custom storage accelerators to help get data in and out of models quickly.

Project Forge Is Microsoft’s AI Workload Platform

At the software layer, Microsoft has built Project Forge — an internal AI workload platform that treats all GPUs globally as an elastic resource pool.

MicrosoftMicrosoft's Project Forge

“The idea with one pool is everybody gets a virtual GPU, not physical GPUs,” Russinovich explained. “If a premium job comes in and there’s a low priority job running on the GPUs it needs, low priority gets evicted.”

Project Forge allows high utilization by reassigning resources, Russinovich said. So far, the results have been positive, with Microsoft getting high levels of utilization for its infrastructure resources, he said.

“If you take a look in aggregate across all of Microsoft, for all of our first-party training, we’re getting over 95% utilization,” Russinovich said.

About the author

 Sean Michael Kerner headshotSean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He consults to industry and media organizations on technology issues.

 

Reference

Denial of responsibility! TechCodex is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.
DMCA compliant image

Leave a Comment