bit-tech.net

AMD's Ryzen hit by FMA3 hard-crash erratum

AMD's Ryzen hit by FMA3 hard-crash erratum

An apparent hard-crash flaw in AMD's Zen architecture has been discovered courtesy a little-known benchmarking application, but a microcode fix is reportedly in the works.

A little-known benchmark tool has winkled out a bug in the microcode of AMD Ryzen processors which allows programs using a 128-bit Fused Multiply Add 3 (FMA3) instruction to hard-crash the host system - but a fix is reportedly on the way.

The flaw in AMD's Ryzen parts was first spotted last week by HWBOT user Mysticial, who posted a forum thread detailing a strange experience with the open-source Flops benchmarking tool. Using a version of the tool designed for Intel's Haswell processors, Mysticial was able to hard-freeze his system every time Flops ran a single-precision 128-bit Fused Multiple Add 3 (FMA3) instruction - requiring a complete power off and on in order to restore functionality. As others tested the tool, it became clear that the crash wasn't restricted to Mysticial's build nor the Asus motherboard but was entirely reproducible on any Ryzen system.

Following community confirmations of the bug, HWBOT's chief operating officer Pieter-Jan Plaisier posted that he had received confirmation that the crash was the result of a flaw in AMD's Ryzen microcode - meaning it affects all currently-available Ryzen chips, and likely also the company's upcoming Naples server parts which share the same Zen microarchitecture. Thankfully, a fix is on the way: 'Was told this issue will be fixed in a new AGESA [AMD Generic Encapsulated Software Architecture microcode] code,' Plaisier claimed. 'In other words: it was an AMD [CPU] issue, not C6H [chipset] issue.'

The particular instruction which crashes the Flops benchmark is not in common usage in commercial software, but the presence of a hard-freeze bug is nevertheless a problem: a malicious application could crash any Ryzen- and likely any Zen-based system with a single line of code. For now, disabling Simultaneous Multi-Threading (SMT) - which allows Ryzen to run two threads on each processor core - appears to reduce the likelihood of a freeze.

AMD has not publicly commented on the issue, nor offered a timescale for the release of an AGESA update to address the problem.

6 Comments

Discuss in the forums Reply
Corky42 16th March 2017, 11:20 Quote
From what i was reading it seems related to a power issue as OC'ed Ryzen's don't seem to be effected.
dstarr3 16th March 2017, 20:44 Quote
So is this a driver/firmware kind of fix, or would this require replacing the CPU with an updated version?
RedFlames 16th March 2017, 20:58 Quote
Quote:
Originally Posted by dstarr3
So is this a driver/firmware kind of fix, or would this require replacing the CPU with an updated version?

iirc it's typically done via bios update. Basically a patch that says 'this function is broken, don't use it'.

Gareth can probably provide a better explanation that that, but this kinda thing is more common than you'd think. For example Skylake had its own... quirks...
Gareth Halfacree 16th March 2017, 21:16 Quote
Quote:
Originally Posted by RedFlames
iirc it's typically done via bios update. Basically a patch that says 'this function is broken, don't use it'.
Wot 'e said. Basically, there's a thing called 'microcode' which is, effectively, the thing you're actually talking to when you're sending the CPU instructions. The user says "what's two plus two," the program says "ADD REG1 REG 2; MOV REG3;," the microcode says "right, turn these particular parts of the processor on and have them physically do the following."

When a processor erratum is found, you update the microcode to fix it. In some cases, like Red says, the 'fix' might involved completely disabling a broken instruction (like when Intel disabled TSX 'cos it turned out you could crash the whole system with it). In other cases, there might be a way to change how the instruction operates and bypass the problem. In still more cases, you might use it to 'fix' something that was never really a bug - Intel took away the ability to overclock supposedly-locked chips via a microcode update just recently.

HWBOT says that somebody - either AMD itself or a motherboard maker - has said that it's going to be fixed in a microcode update, though it hasn't been specified whether the 'fix' will make FMA3 work properly or just disable it altogether.

While I'm here, a fun little fact for you: AMD was originally working on implementing FMA3 (fused multiply-add with three operands) while Intel was working on an equivalent instruction called FMA4 (fused multiply-add with four operands). AMD then switched to FMA4 while Intel switched to FMA3. So, the reason it took some dude running a random benchmark to find the flaw is that you wouldn't normally use FMA3 on an AMD chip - you'd use FMA4. The benchmark the guy was using hadn't yet been compiled for Ryzen, though, so he was using a version compiled for Haswell which used the FMA3 instruction and thus triggered the bug. If he'd used a version compiled for Piledriver or Bulldozer, it wouldn't have used FMA3 and we'd have no idea about the flaw.

Also, just in case it seems like AMD's really dropped the ball on this one, errata are incredibly common in systems as complex as processors. Here's the Skylake spec update from January this year, with a list of unfixed errata which spans 35 pages. For a real eye-opening experience, have a look at how many have "None identified" printed under "Workarounds."
Flexible_Lorry 17th March 2017, 00:42 Quote
Thanks Gareth for the follow-up information.
Gareth Halfacree 17th March 2017, 09:56 Quote
Quote:
Originally Posted by Flexible_Lorry
Thanks Gareth for the follow-up information.
I live to serve.
Log in

You are not logged in, please login with your forum account below. If you don't already have an account please register to start contributing.



Discuss in the forums