A couple of interesting factors leading to liquid cooling in the commercial market (always been prominent in the top 500) and why I think this time it’s different.
First, we are designing smaller transistor leading to leakage and thus increasing chip thermal design power (TDP). We are also using bigger chips and chiplets also increasing TDP.
TDP on a GPU went from 200w to 1 kW in the past 5 years and even CPU are now close to 500w.
At 25k a pop avoiding throttling due to heat restriction is a must and is becoming incredibly hard with air cooling as air is an amazing insulator.
Second, networking is extremely expensive with the cost of top-end InfiniBand cable close to 1000$ a foot. This means you want to cram your CPU /GPU as close as possible to keep the connection electrical if possible and minimize optical cable lengths. This also decreases latency and increase cluster performance. On a 50 000 cable deployment, the saving can be … significant.
Third, we are now seeing the emergence of foundational model that requires government lab levels of interconnected compute. This is a new paradigm in the private sector and consume power in the 20-30 MW range. An increase in efficiency of 10% (from 1.2 to 1.1 PUE) brings huge saving on the power cost side plus important environmental benefits. It also becomes much easier to recuperate heat.
That’s why we’re seeing major announcement at OCP i.e. meta going the liquid cooling way.
I personally believe that’s where things are going and we are building a facility to optimize for the above points: https://www.qscale.com/
Www.qscale.com recuperate heat for 100+
Football field of greenhouse in North America
And FYI a properly designed liquid cooling facility is cheaper and more efficient then an air cooled one. Hopefully everything will switch to liquid cooling just like car engines are all liquid cooled !
Liquid cooling also enables heat recovery and free cooling all year long.
Here's a project using liquid cooling to recover energy from Data centers to heat greenhouses.
https://www.qscale.com/
My point was more around addressing global warming from the perspective of mitigation rather than prevention. There seems to be a lot of reports that indicate our opportunity for preventing global warming has passed or will soon pass[0][1][2]. So if that's the case, it seems like it would be pretty important to work on solutions to survive the effects of global warming instead.
I've been seeing studies since the release of the Inconvenient Truth claiming that the point of no return has been reached.
I don't doubt climate change, but I'm very skeptical of studies positing a slippery slope of catastrophic proportions. It's true we don't know the cascading effects of increased CO2 and methane emissions, but that doesn't mean it the unknown is apocalyptic.
That's may well be a factor, but there are many examples of countries without firewalls and censorship etc that are equally 'unamericanised', so it's unlikely to be the only factor.
I took some of the material on HN and put it inside a Blogger. The blog is quite rough at the moment as I have not spent much time editing the content.
If you are interested in reading it and possibly sharing it with others, here is the url: http://doanhdo.blogspot.com/
I will write more posts in the near future on the topics that I mentioned earlier.
I think this is good case of Black Swan (it would be highly improbable the same problem hit all the companies in a diversified portfolio but could happen a la hearthbleed).
There's an extremely insightful book written on the subject (the black swan by Mr Taleb). Wich I do recommend.
Discounted cashflow = bullshit used to rationalize the sale price ( it can have some use in super long term predictable business like real estate with 10 year lease and well diversified tenants)
First, we are designing smaller transistor leading to leakage and thus increasing chip thermal design power (TDP). We are also using bigger chips and chiplets also increasing TDP.
TDP on a GPU went from 200w to 1 kW in the past 5 years and even CPU are now close to 500w. At 25k a pop avoiding throttling due to heat restriction is a must and is becoming incredibly hard with air cooling as air is an amazing insulator.
Second, networking is extremely expensive with the cost of top-end InfiniBand cable close to 1000$ a foot. This means you want to cram your CPU /GPU as close as possible to keep the connection electrical if possible and minimize optical cable lengths. This also decreases latency and increase cluster performance. On a 50 000 cable deployment, the saving can be … significant.
Third, we are now seeing the emergence of foundational model that requires government lab levels of interconnected compute. This is a new paradigm in the private sector and consume power in the 20-30 MW range. An increase in efficiency of 10% (from 1.2 to 1.1 PUE) brings huge saving on the power cost side plus important environmental benefits. It also becomes much easier to recuperate heat.
That’s why we’re seeing major announcement at OCP i.e. meta going the liquid cooling way.
I personally believe that’s where things are going and we are building a facility to optimize for the above points: https://www.qscale.com/