Look mother, no InifiniBand: Nvidia’s DGX GH200 glues 256 superchips with NVLink

Computex Nvidia unveiled its newest social gathering trick at Computex in Taipei: stitching collectively 256 Grace-Hopper superchips into an “AI supercomputer” utilizing nothing however NVLink.

The package, dubbed the DGX GH200, is being provided as a single system tuned for memory-intensive AI fashions for pure language processing (NPM) recommender methods, and graph neural networks.

In a press briefing forward of CEO Jensen Huang’s keynote, executives in contrast the GH200 to the biz’s lately launched DGX H100 server, claiming as much as 500x larger efficiency. However the 2 are nothing alike. The DGX H100 is an 8U system with twin Intel Xeons and eight H100 GPUs and about as many NICs. The DGX GH200, is a 24-rack cluster constructed on an all-Nvidia structure — so not precisely comparable.

At the guts of this super-system is Nvidia’s Grace-Hopper chip. Unveiled at its March GTC occasion in 2022, the {hardware} blends a 72-core Arm-compatible Grace CPU cluster and 512GB of LPDDR5X reminiscence with an 80GB GH100 Hopper GPU die utilizing the corporate’s 900GBps NVLink-C2C interface. If you want a refresher on this next-gen compute structure, take a look at our sister website The Next Platform for a deeper dive into that silicon.

According to Nvidia VP of Accelerated Computing Ian Buck, the DGX GH200 options 16 compute racks every with 16 nodes outfitted with a superchip. In complete, DGX GH200 platform boasts 18,432 cores, 256 GPUs, and a claimed 144TB of “unified” reminiscence.

At first blush, that is nice information for these seeking to run very-large fashions which should be saved in reminiscence. As we have beforehand reported, LLMs want a number of reminiscence butin this case that 144TB determine could also be stretching the reality a bit. Only about 20TB of that’s the super-speedy HBM3 that is sometimes used to retailer mannequin parameters. The different 124TBs is DRAM.

In situations the place a workload cannot match inside the GPUs vRAM, it sometimes finally ends up spilling over to the a lot slower DRAM, which is additional bottlenecked by the necessity to copy information over a PCIe interface. This, clearly, is not nice for efficiency. But, it seems Nvidia is getting round this limitation by utilizing a mix of very quick LPDDR5X reminiscence good for half a terabyte per second of bandwidth and NVLink relatively than PCIe.

Speaking on the COMPUTEX 2023 convention in Taiwan right this moment, Nvidia boss Jensen Huang in contrast Grace-Hopper to his firm’s H100mega-GPU. He conceded that the H100 has extra energy than Grace-Hopper. But he identified that Grace-Hopper has extra reminiscence than the H100, so is extra environment friendly and subsequently extra relevant to many information facilities.

“Plug this into your DC and you can scale out AI”, he stated.

Gluing all of it collectively

On that matter, Nvidia is not simply utilizing NVLink for GPU-to-GPU communications, it is also utilizing it to connect collectively the system’s 256 nodes. According to Nvidia, this may permit very giant language fashions (LLMs) to unfold throughout the methods’ 256 nodes whereas avoiding community bottlenecks.

The draw back to utilizing NVLink is, not less than for now, it will probably’t scale past 256 nodes. This means for bigger clusters you are still going to be one thing like InfiniBand or Ethernet — extra on that later.

Despite this limitation, Nvidia continues to be claiming fairly substantial speedups for a wide range of workloads together with pure language processing, recommender methods, and graph neural networks, in comparison with a cluster of extra standard DGX H100s utilizing InfiniBand.

In complete, Nvidia says a single DGX GH200 cluster is able to delivering peak efficiency of about an exaflop. In a pure HPC workload, efficiency goes to be far much less. Nvidia’s head of accelerated compute estimates peak efficiency in a FP64 workload at about 17.15 petaflops when making the most of the GPU’s tensor cores.

If the corporate can obtain an affordable fraction of this within the LINPACK benchmark, that may place a single DGX GH200 cluster within the high 50 quickest supercomputers.

Nvidia GPUs fly out of the fabs – and proper again into them
Intel abandons XPU plan to cram CPU, GPU, reminiscence into one bundle
Microsoft rains extra machine studying on Azure cloud
When it involves liquid and immersion cooling, Nvidia asks: Why not each?

Thermals dictate design

Nvidia did not deal with our questions on thermal administration or energy consumption, however given the cluster’s compute density and meant viewers, we’re virtually actually an air-cooled system.

Even with out going to liquid or immersion cooling, one thing the corporate is trying into, Nvidia may have made the cluster way more compact.

At Computex final 12 months, the corporate confirmed off a 2U HGX reference design with twin Grace-Hopper superchip blades. Using these chassis Nvidia may have managed to pack all 256 chips into eight racks.

We suspect Nvidia shied away from this on account of datacenter energy and cooling limitations. Remember, Nvidia’s prospects nonetheless have to deploy the cluster of their datacenters, and if they should make main infrastructure modifications, it is going to be a troublesome promote.

Nvidia’s Grace-Hopper chip alone requires a few kilowatt of energy. So with out factoring in motherboard and networking consumption, you are cooling about 16 kilowatts per rack, only for the compute. This is already going to be quite a bit for a lot of datacenter operators used to cooling 6-10 kilowatt racks, however not less than inside the realm of cause.

Considering the cluster is being bought as a unit, we suspect the varieties of consumers contemplating the DGX GH200 can be taking thermal administration and energy consumption into consideration. According to Nvidia, Meta, Microsoft, and Google are already fielding the clusters, with common availability slated for earlier than the top of 2023.

Jensen Huang launching the Grace-Hopper and the DH200 on stage at Computex 2023

Jensen Huang Launching the DH200 on stage at Computex 2023. Yes, we must always have introduced a greater digicam – Click to enlarge

Scaling out with Helios

We talked about earlier that with a view to scale out the DGX GH100 past 256 nodes, prospects would wish to resort to extra conventional networking strategies, and that is precisely what Nvidia goals to display with its upcoming Helios “AI supercomputer.”

While particulars are fairly skinny at this level, it seems to be like Helios is actually simply 4 DGX GH200 clusters glued collectively utilizing the corporate’s 400Gbps Quantum-2 InfiniBand switches.

While we’re speaking switches, at COMPUTEX Huang introduced the SPECTRUM-4, a colossal swap that marries Ethernet and InfiniBand, with a 400GB/s BlueField 3 SmartNIC. Huang stated the swap and the brand new SmartNIC will collectively permit AI site visitors to circulation thorugh information facilities and bypass the CPU, avoiding bottlenecks alongside the best way. The Register will search extra particulars as they turn into accessible.

Helios is anticipated to return on-line by the top of the 12 months. And whereas Nvidia emphasizes its AI efficiency in FP8, the system ought to have the ability to ship peak efficiency of someplace within the neighborhood of 68 petaflops. This would put it roughly on par with France’s Adastra system, which, as of final week, holds the No. 12 spot on the Top500 rating. ®

– With Simon Sharwood.

…. to be continued
Read the Original Article
Copyright for syndicated content material belongs to the linked Source : The Register – https://go.theregister.com/feed/www.theregister.com/2023/05/29/nvidia_dgx_gh200_nvlink/