From www.pcmag.com
Nvidia’s upcoming Blackwell GPUs for AI computing may face further delays because they’re prone to overheating when connected to each other on server racks, The Information reports.
The issue has reportedly been traced to the server rack Nvidia designed for Blackwell—which can connect up to 72 GPUs at a time. Nvidia has repeatedly redesigned the racks, which could delay GPU server shipments and the opening of new data centers for Google, Microsoft, or Meta.
In August, a previous report suggested that a “design flaw” had caused the Blackwell GPUs’ launch to be delayed by months. It’s unclear whether this flaw is the server rack design issue. Nvidia announced Blackwell in March and initially said the GPUs could ship as soon as Q2 2024 before it encountered challenges.
Nvidia indirectly addressed the server rack problem in a statement to Reuters. “Nvidia is working with leading cloud service providers as an integral part of our engineering team and process. The engineering iterations are normal and expected,” a company spokesperson said, suggesting a new server design could be on the horizon.
Overheating is the main cause of performance issues for GPUs, which require a lot of energy to operate. The crypto mining industry, like AI, also uses a ton of energy, produces a lot of heat, and relies on high numbers of interconnected GPUs or mining rigs. Sometimes, crypto miners use immersion cooling, where the rigs are submersed in liquid, to prevent overheating.
And the more powerful a GPU, the more heat it can produce. While sometimes tech advancements can bring more energy efficiencies, this typically isn’t enough to offset the increased energy needs overall. The Blackwell AI chips can be 30 times faster than previous GPUs, according to Nvidia.
Training and running generative AI models at scale requires a lot of energy, too, as well as water to cool these systems. This has led some experts to predict that AI data centers may face power shortages as soon as next year. AI firms aren’t able to add new power sources to grid as quickly as they can add data centers—and they aren’t necessarily willing to wait, either.
Recommended by Our Editors
Meta, Microsoft, and Google have recently turned to nuclear power to meet their rising energy needs. However, “power purchase agreements” don’t necessarily solve AI’s energy problems.
Nvidia has seen its stock soar over 180% in the past year due to the AI surge and resulting spike in chip demand, while rival AMD recently began mass layoffs.
Get Our Best Stories!
Sign up for What’s New Now to get our top stories delivered to your inbox every morning.
This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.
About Kate Irwin
Reporter
I’m a reporter for PCMag covering tech news early in the morning. Prior to joining PCMag, I was a producer and reporter at Decrypt and launched its gaming vertical, GG. I have previously written for Input, Game Rant, Dot Esports, and other places, covering a range of gaming, tech, crypto, and entertainment news.
Read the latest from Kate Irwin
[ For more curated Computing news, check out the main news page here]
The post Nvidia’s Delayed Blackwell AI Chips Are Overheating in Servers first appeared on www.pcmag.com