Fix AMD Ryzen Freeze/Hang/Crash issues – CentOS 7

On of our server providers was running a deal on servers equipped with the newly released AMD Ryzen CPUs. We got a couple to test them for future placements.

Right off the bat we started experiencing issues with them, the shell sessions would randomly hang/freeze. At first we thought this was a network issue and ignored them, it wasn’t until one of the servers crashed and wouldn’t respond to reboot commands that we became suspicious. We had to contact our data center to send physically send someone to the machine to investigate why it wasn’t even responding to hard reset requests. They reported back that the server had no display and was completely unresponsive to keyboard commands, they had to power it off manually and then turn it back on.

We suspected right from the start that this may have been a kernel issue as the CPU architecture was practically brand new, turns out we were right, a number of people had experienced similar issues and the solution turned out to be to install kernel version 4.12 and above.

Note: Make sure you backup everything you have on this server before proceeding with updating the kernel.

Start off by verifying that you are indeed on an old kernel version

uname -r

The result should be something similar to the following:

Add the ELRepo to your server

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

Install the latest kernel by issuing the following:

yum --enablerepo=elrepo-kernel install kernel-ml

You now need to edit the grub file at /etc/default/grub

GRUB_TIMEOUT=5
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/root rd.lvm.lv=centos/swap crashkernel=auto rhgb quiet"
GRUB_DISABLE_RECOVERY="true"

Edit the GRUB_DEFAULT=saved line to GRUB_DEFAULT=0

Change this line, save the config and then issue the following command.

grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot the server

reboot

Check the kernel version after your server returns

uname -r

You should see something like the following:

new kernel-version
new kernel-version