Print Friendly, PDF & Email

How the Rocky Linux Server HPC-AI Operating System Was Born from the Ashes of CentOS

[SPONSORED CONTENT] CentOS disappeared in the middle of winter. On December 8, 2020, the first sunset day of the year in northern latitudes, Red Hat announced that it would no longer support the Linux server operating system, and for many CentOS users” what instruments we agreed the day of (his) death was a dark and cold day.**

If you were a CentOS Linux advocate, you knew all about it. You knew his traits, his mannerisms, his bugs, his quirks. You knew his personality. You knew how to make the most of it and how to avoid the things it did that drove you crazy. You have developed a CentOS skill set that has become second nature. With CentOS, great things have been done, systems and careers have been built, successes have been achieved. So when CentOS reached end-of-life two Decembers ago, something central was removed from the working lives of its users.

At that time, CentOS (Community Enterprise Operating System) was an open-source, production-ready downstream version of Red Hat Enterprise Linux (RHEL), which has built up a huge and dedicated following over its 18 years of existence. It was the no. 1 Linux distribution in the enterprise, and users included Toyota, GoDaddy, Disney, RackSpace, and Verizon, organizations that build large, complex HPC-class AI clusters.

Then, all of a sudden, Red Hat announced that it was ditching — that’s the word CentOS fans would use — the operating system for a new “distro,” CentOS Stream. Red Hat stated from the outset that CentOS Stream was not a replacement for CentOS, “rather, it is a natural and inevitable next step intended to achieve the project’s goal of advancing enterprise Linux innovation.” But it was a replacement and not a “next step” that many CentOS users wanted. According to a report by Ars Technica, “comments on the announcement from the community are legion and overwhelmingly negative.”

But while the news left discouraged CentOS users in its wake, they moved through the stages of grief quickly and resiliently. Just two hours after the Red Hat news, CentOS founder and Linux open source guru Gregory Kurtzer chimed in and announced via a comment on the CentOS website that he would be launching an open source effort again. community-owned, a new distribution that will be bug-for-bug compatible with RHEL and continue the mission of CentOS. Its name: Rocky Linux, a tribute to the co-founder of CentOS, Rocky McGaugh.

“Rocky was a big response to the end of CentOS, which is very important to the HPC and AI community,” Brock Taylor, VP of High Performance Computing and Strategic Partners at CIQ, told us. “CentOS was the backbone of so many systems, especially when you think about multi-node environments, HPC clusters and AI running in multi-node environments. CentOS was the operating system of choice in this space, and when support ended, a whole community wondered how they were going to move forward. It was a huge shock to the system.

Kurtzer not only started the Rocky Enterprise Software Foundation (RESF), he also founded and became CEO of CIQ, a technology company providing support, services and added value to Rocky Linux, and a driving force behind the nascent operating system. CIQ is also a provider of not only traditional HPC solutions and support, but also a computing paradigm that paves the way for federated, hybrid, and cloud-native (HPC-2.0) computing.

Gregory Kurtzer, CIQ

Kurtzer, colleagues and members of the rapidly forming Rocky Linux community, some of whom hail from the CentOS community, quickly built momentum behind the upstart project. Taylor said that before long, thousands of developers were rallying to champion Rocky as a CentOS replacement. On December 12, just four days after Red Hat’s announcement, the Rocky Linux code repository had become the top trending repository on GitHub. Another project aspiring to fill the market void, AlmaLinux, was released on March 30, 2021 and beat Rocky Linux to market due to shared infrastructure, secure boot, and engineering from its parent company Cloud. Linux. In July last year, RESF released Rocky Linux 8.4.

Rocky’s momentum continued. RESF reports that in a typical month, there are at least 250,000 OS image downloads, with some months reaching 750,000. The OS has been widely accepted across the enterprise, in academic institutions and in the cloud industry, including Amazon Web Services, Microsoft Azure, Google Cloud Platform and Oracle Cloud Infrastructure. All this before the one-year commemoration of the demise of CentOS. This is resilience.

Taylor attributes Rocky’s adoption to his unwavering dedication to RHEL compatibility, a feature of CentOS, and improving community and project capabilities. In support of this goal, Rocky Linux version 9, announced last July, includes Peridot, which allows development groups to replicate and extend any version of Rocky Linux (incidentally, the version 9 does not mean there were eight previous versions of Rocky Linux, it indicates the new version is binary compatible with, yes, RHEL 9).

Taylor said Peridot works as a cloud-native stack for building Rocky Linux with tools designed to simplify working with source code.

Brock Taylor, CIQ

“A key function of Peridot is to ensure that Rocky Linux is truly bug-for-bug compatible with Red Hat Enterprise Linux, and that’s great value to this community,” he said. “It’s very similar to how CentOS has stayed in sync with Red Hat, and it provides enormous value to the open source community and especially a large part of the HPC infrastructure community, to ensure that the operating system is very solid. It tracks all the different things, the thousands of software components that need to be tracked, for compatibility.”

He cited Rocky Linux’s ability to track the operations of the kernel, which manages and connects resources across the operating system, and how that ability can come into play with a chip like AMD’s EPYC processor.

“The EPYC architecture has eight cores on a chiplet and eight chiplets on a processor, which gives you 64 cores,” Taylor said. “These cores share cache, so if you have eight threads scheduled on the CPU, you want one thread per chiplet. But you may have cases where over time the threads end up migrating to only one or two of the chiplets , which means they are competing for resources while other chiplets are idle, so you get inefficient performance.

The latest Rocky Linux kernel has updates that mitigate issues like this, Taylor said, ensuring equal distribution of processing resources, which is especially critical in demanding HPC and AI workloads. And Peridot ensures that such kernel enhancements make it into operating system distributions.

Taylor explained how CIQ supports Rocky Linux users struggling with complex and heterogeneous HPC-AI clusters. These environments are typically designed and maintained by HPC cluster administrators, those proverbial IT masters of all trades who are, understandably enough, so hard to find and hire. In fact, it is common for these administrators to be researchers or data scientists themselves, or post-docs who have lost a bet with their peers. The role of CIQ consultants is to tap into their Rocky Linux knowledge and reduce the IT expertise that would otherwise be required of researchers or data scientists.

Taylor himself spent 22 years at Intel and AMD doing the same thing these users did, he said, “looking at the solution space” – that is, trying to understand how multi-architecture clusters come together.

“The data scientist wants to do data science, not cluster administration or Linux administration,” Taylor said. “They need tools that allow them to focus on this type of work. It starts with a solid foundation in the operating system. What CIQ does with our strong connection to the operating system and Rocky Linux is a layer in technologies specific to high performance computing and AI.

“We face an ever-expanding world of silicon solutions,” Taylor continued, “and the developers building the applications and frameworks might be running on a general-purpose processor, or on a GPGPU accelerator, or an FPGA. And they have to figure out when and where they support all the different form factors and operations, they don’t want to discuss how they configure the software and middleware elements, or how their fabric drivers are integrated into the operating system.

All of this requires an enormous amount of coordination across the software stack and that’s where, Taylor said, CIQ can help.

** W. H. Auden, “In Memory of W. B. Yeats”

Similar Posts

Leave a Reply

Your email address will not be published.