Lessons to be learned from Google and Oracle's datacenter heatstroke

It's getting hot in here so take down all your nodes


Comment This year's summer heatwaves aren't just making your average Brit's life a bit miserable, it also caused problems for some cloud providers and server admins trying to keep their gear running.

Last month east London datacenters operated by Google and Oracle suffered a breakdown amid the region's strongest heatwave on record. Parts of the country even edged above 40C.

Both IT giants cited failures of their cooling systems. This allowed temperatures within the facilities to reach undesirable levels, and forced the shutdown of customer systems and workloads to prevent damage to the hardware and limit data loss.

With climate scientists predicting more extreme weather to come, one wonders what can be learned from these outages to mitigate future disaster.

Omdia analyst Moises Levy, who has spent the better part of his career designing and consulting on datacenters, said these events underscore the importance of risk management and planning when designing and maintaining these facilities.

And while he said these kinds of outages aren’t that common and can be difficult to predict, learning from these incidents is an opportunity that shouldn’t be missed by site operators and executives.

Maintaining equilibrium

As described by Levy, datacenters operate in a finely tuned equilibrium in which workloads consume power and generate heat, and that heat has to be extracted by equipment that also typically requires power. Power equals cost, workloads equal revenue, and cooling is needed to keep that workload revenue flowing without too much cost. Not enough cooling equals damage and loss of revenue; too much cooling also has its problems. And cooling costs money to install. It's an interesting equation to figure out.

When balancing power usage, cooling, and compute density, datacenter operators usually account for worst-case scenarios to avoid potential downtime. This is the strategy employed by Equinix, which operates colocation datacenters all around the world.

“We design for local climatic conditions, optimizing plant selection for reliability and efficiency, both for current maximum observed, and forecast worst-case temperatures anticipated in the future,” Greg Metcalf, senior director of global design at Equinix told The Register.

This can be as simple as spec'ing out and deploying redundant cooling plants or provisioning additional backup power. For example, in generally hot climates, such as Dallas, Texas, Equinix employs a complex and heavily redundant temperature control system to protect its facilities.

Liquid cooling in a hi-tech Bit farm

Is a lack of standards holding immersion cooling back?

READ MORE

“The cooling plants are designed for worst-case conditions, and are factory tested as such,” Metcalf said. “Implementing hardware redundancy means in the event of a heat peak, backup machines can be called upon to reduce the overall effort of a particular site's cold production.”

In a postmortem report following the London outage, Google blamed a "simultaneous failure of multiple, redundant cooling systems combined with the extraordinarily high outside temperatures" for the failure.

It's very interesting to see Google use the words "simultaneous" and "redundant" in the same sentence in this way, as it suggests there may have been a single point of failure that caused its temperature-regulation systems to break down, or that the facility was designed in such a way that multiple systems could fail all at once in the same way.

A datacenter or cloud outage typically occurs after a long or even short sequence of faults. One thing starts acting up or is misconfigured, and that causes another thing to fail, and that puts pressure on something else, and eventually it all collapses. Preventing an outage involves ensuring these individual screw-ups do not snowball into actual downtime.

In a heatwave, for instance, the mechanisms for starting up a facility's temperature-control equipment and regulating it on demand have to be present and operational, too, and if they aren't, well, it won't matter how much extra cooling capacity you have – it won't get used in time, or at all.

“It’s so important to look at the datacenter in a comprehensive way and not in silos," Levy said. "Anything can affect the other and we can have a cascade effect."

For example, a disruption to the datacenter’s supply of electric power or a breakdown in the cooling control system, or a failure to respond to or detect rising temperatures, can set you down the path to an outage.

This appears to have been what happened to Google and Oracle, with cooling system failures amid an overwhelming, historic heatwave. Google did not say (or did not want to say) its cooling simply couldn't mitigate the heat; it said its equipment failed to work when it was needed most.

Levy also pointed out not every component within a datacenter is as susceptible to extreme temperatures as others. The various boxes found throughout the datacenter, whether they be compute, networking, or storage-oriented, work within a range of operating temperatures. That can be as high as 90C to 100C for CPUs, or 55C to 65C for hard drives.

The age of the equipment can also play a factor. “Older equipment may be more sensitive to higher temperatures. Newer equipment may be less sensitive and they will accept higher ranges,” Levy said. We noted earlier this year that Google extended the lifespan of its cloud systems by an extra year to save money.

Another point to bear in mind: in the event of a cooling crisis, it’s not always as simple as shutting down systems are that are particularly vulnerable to excess heat, since networking, storage, and compute resources are largely dependent on each other.

For example, a virtual machine may be running on a compute node, but its resources may live on a separate storage node connected over the network. If any one of the three – compute, storage, networking/orchestration – go down due to hardware failure or to prevent damage, so does the virtual machine.

Complicating matters is the fact compute resources are growing more power hungry and by extension hotter. Many accelerators are now pushing 700W TDPs, with some box builders cramming multiple kilowatts of compute into a 2U chassis.

If datacenter operators don’t account for this with improvements to their power and cooling infrastructure, it can result in problems down the line, Levy explained.

This is standard procedure for Equinix, which in addition to taking into account their often varied compute load, also considers external factors. “Sites are analyzed for climatic effects beyond weather, such as nearby heat sources, to capture the multiple elements affecting required heating and cooling,” Metcalf said.

Lessons to learn

While it’s easy to point to Britain's unprecedented heatwaves and blame them for the outages, operating datacenters in hot climates is hardly a new concept. Though to be fair to Google, no one expects to see London experience the sort of summer weather, say, Texas and Arizona in the US do; when building a server warehouse in the UK capital, the long scorching days of Austin and Phoenix probably don't come to mind. Yet.

When power, cooling, compute, and external factors are taken into account, disruptions resulting from extreme weather events and the like can be mitigated. From what we can tell, it's just a case of whether the cost is worth it, given the risk. On the other hand, Google is not exactly strapped for cash, and aside from Oracle, its rivals didn't seem to suffer during the UK heatwave.

“The datacenter industry is well prepared for all of these events. That being said, it’s not like the datacenter industry is immune to any event,” Levy added.

When these outages do happen, analyzing their cause, identifying where the failure happened, and making that information public can help others avoid a similar fate.

It’s important to understand exactly went wrong and which components were impacted first, Levy said. “Hopefully the lessons learned can be made publicly available. For me, that will be a huge gain for all of the industry so everybody will learn from those, and we can avoid these type of events.”

This is mostly what Google has pledged to do in the wake of the outage. The American tech giant said it will investigate and develop advanced methods for decreasing the thermal load within its datacenters; examine procedures, tooling, and automated recovery systems to improve recovery times in the future; and audit cooling system equipment and standards across all of its datacenters.

Finally, Levy emphasizes that steps to mitigate the impacts of these outages need to be taken. Hyperscalers and cloud providers can, for example, migrate workloads to other datacenters or run those workloads across multiple zones or regions to avoid interruptions to their services.

However, as Uptime Institute analyst Owen Rogers told The Register in an earlier interview, implementing redundancy in cloud deployments isn’t automatic and often requires manual configuration on the customer’s part. ®


Other stories you might like

Biting the hand that feeds IT © 1998–2022