Inside today’s Azure AI cloud data centers

Published on:

Azure CTO Mark Russinovich’s annual Azure infrastructure displays at Construct are at all times fascinating as he explores the previous, current, and way forward for the {hardware} that underpins the cloud. This 12 months’s discuss was no completely different, specializing in the identical AI platform touted in the remainder of the occasion.

Through the years it’s been clear that Azure’s {hardware} has grown more and more complicated. Firstly, it was a primary instance of utility computing, utilizing a single normal server design. Now it’s many alternative server sorts, in a position to assist all lessons of workloads. GPUs have been added and now AI accelerators.

That final innovation, launched in 2023, reveals how a lot Azure’s infrastructure has advanced together with the workloads it hosts. Russinovich’s first slide confirmed how rapidly fashionable AI fashions have been rising, from 110 million parameters with GPT in 2018, to over a trillion in right this moment’s GPT-4o. That development has led to the event of large distributed supercomputers to coach these fashions, together with {hardware} and software program to make them environment friendly and dependable.

- Advertisement -

Constructing the AI supercomputer

The size of the programs wanted to run these AI platforms is big. Microsoft’s first massive AI-training supercomputer was detailed in Could 2020. It had 10,000 Nvidia V100 GPUs and clocked in at quantity 5 within the world supercomputer rankings. Solely three years later, in November 2023, the most recent iteration had 14,400 H100 GPUs and ranked third.

In June 2024, Microsoft has greater than 30 related supercomputers in information facilities all over the world. Russinovich talked concerning the open supply Llama-3-70B mannequin, which takes 6.4 million GPU hours to coach. On one GPU that might take 730 years, however with one among Microsoft’s AI supercomputers, a coaching run takes roughly 27 days.

Coaching is just half the issue. As soon as a mannequin has been constructed, it must be used, and though inference doesn’t want supercomputer-levels of compute for coaching, it nonetheless wants plenty of energy. As Russinovich notes, a single floating-point parameter wants two bytes of reminiscence, a one-billion-parameter mannequin wants 2GB of RAM, and a 175-billion-parameter mannequin requires 350GB. That’s earlier than you add in any essential overhead, comparable to caches, which might add greater than 40% to already-hefty reminiscence necessities.

See also  The M4 iPad Pro's true potential will be realized at WWDC, and AI will have a lot to do with it

All which means that Azure wants plenty of GPUS with very particular traits to push by way of plenty of information as rapidly as attainable. Fashions like GPT-4 require important quantities of high-bandwidth reminiscence. Compute and reminiscence all want substantial quantities of energy. An Nvidia H100 GPU requires 700 watts, and with hundreds in operation at any time, Azure information facilities have to dump plenty of warmth.

- Advertisement -

Past coaching, design for inference

Microsoft has developed its personal inference accelerator within the form of its Maia {hardware}, which is pioneering a brand new directed-liquid cooling system, sheathing the Maia accelerators in a closed-loop cooling system that has required a complete new rack design with a secondary cupboard that accommodates the cooling gear’s warmth exchangers.

Designing information facilities for coaching has proven Microsoft tips on how to provision for inference. Coaching quickly ramps as much as 100% and holds there in the course of a run. Utilizing the identical energy monitoring on an inferencing rack, it’s attainable to see how energy draw varies at completely different factors throughout an inferencing operation.

Azure’s Challenge POLCA goals to make use of this data to extend efficiencies. It permits a number of inferencing operations to run on the similar time by provisioning for peak energy draw, giving round 20% overhead. That lets Microsoft put 30% extra servers in an information heart by throttling each server frequency and energy. The result’s a extra environment friendly and extra sustainable method to the compute, energy, and thermal calls for of an AI information heart.

Managing the information for coaching fashions brings its personal set of issues; there’s plenty of information, and it must be distributed throughout the nodes of these Azure supercomputers. Microsoft has been engaged on what it calls Storage Accelerator to handle this information, distributing it throughout clusters with a cache that determines if required information is offered domestically or whether or not it must be fetched, utilizing accessible bandwidth to keep away from interfering with present operations. Utilizing parallel reads to load information permits massive quantities of coaching information to be loaded virtually twice as quick as conventional file hundreds.

See also  AI is being used to identify and soothe stressed call center workers with photo montages

AI wants high-bandwidth networks

Compute and storage are necessary, however networking stays crucial, particularly with large data-parallel workloads working throughout many tons of of GPUs. Right here, Microsoft has invested considerably in high-bandwidth InfiniBand connections, utilizing 1.2TBps of inner connectivity in its servers, linking 8 GPUs, and on the similar time 400Gbps between particular person GPUs in separate servers.

Microsoft has invested so much in InfiniBand, each for its Open AI coaching supercomputers and for its customer support. Curiously Russinovich famous that “actually, the one distinction between the supercomputers we construct for OpenAI and what we make accessible publicly, is the size of the InfiniBand area. Within the case of OpenAI, the InfiniBand area covers your complete supercomputer, which is tens of hundreds of servers.” For different clients who don’t have the identical coaching calls for, the domains are smaller, however nonetheless at supercomputer scale, “1,000 to 2,000 servers in measurement, connecting 10,000 to twenty,000 GPUs.”

All that networking infrastructure requires some surprisingly low-tech options, comparable to 3D-printed sleds to effectively pull massive quantities of cables. They’re positioned within the cable cabinets above the server racks and pulled alongside. It’s a easy strategy to lower cabling occasions considerably, a necessity once you’re constructing 30 supercomputers each six months.

- Advertisement -

Making AI dependable: Challenge Forge and One Pool

{Hardware} is just a part of the Azure supercomputer story. The software program stack offers the underlying platform orchestration and assist instruments. That is the place Challenge Forge is available in. You possibly can consider it as an equal to one thing like Kubernetes, a means of scheduling operations throughout a distributed infrastructure whereas offering important useful resource administration and spreading hundreds throughout various kinds of AI compute.

The Challenge Forge scheduler treats all of the accessible AI accelerators in Azure as a single pool of digital GPU capability, one thing Microsoft calls One Pool. Masses have precedence ranges that management entry to those digital GPUs. A better-priority load can evict a lower-priority one, shifting it to a distinct class of accelerator or to a different area altogether. The intention is to offer a constant stage of utilization throughout your complete Azure AI platform so Microsoft can higher plan and handle its energy and networking finances.

See also  Safety off: Programming in Rust with `unsafe`

Like Kubernetes, Challenge Forge is designed to assist run a extra resilient service, detecting failures, restarting jobs, and repairing the host platform. By automating these processes, Azure can keep away from having to restart costly and sophisticated jobs, treating them as an alternative as a set of batches that may run individually and orchestrate inputs and outputs as wanted.

Consistency and safety: prepared for AI purposes

As soon as an AI mannequin has been constructed it must be used. Once more, Azure wants a means of balancing utilization throughout various kinds of fashions and completely different prompts inside these fashions. If there’s no orchestration (or lazy orchestration), it’s simple to get right into a place the place one immediate finally ends up blocking different operations. By benefiting from its digital, fractional GPUs, Azure’s Challenge Flywheel can assure efficiency, interleaving operations from a number of prompts throughout digital GPUs, permitting constant operations on the host bodily GPU whereas nonetheless offering a continuing throughput.

One other low-level optimization is confidential computing capabilities when coaching customized fashions. You possibly can run code and host information in trusted execution environments. Azure is now in a position to have full confidential VMs, together with GPUs, with encrypted messages between CPU and GPU trusted environments. You need to use this for coaching or securing your non-public information used for retrieval-augmented era.

From Russinovich’s presentation, it’s clear that Microsoft is investing closely in making its AI infrastructure environment friendly and responsive for coaching and inference. The Azure infrastructure and platform groups have put plenty of work into constructing out {hardware} and software program that may assist coaching the biggest fashions, whereas offering a safe and dependable place to make use of AI in your purposes.

Operating Open AI on Azure has given these groups plenty of expertise, and it’s good to see that have paying off in offering the identical instruments and strategies for the remainder of us—even when we don’t want our personal TOP500 supercomputers.

- Advertisment -


- Advertisment -

Leave a Reply

Please enter your comment!
Please enter your name here