A programmer on the DragonFly BSD project has received confirmation that a bug he'd been chasing for more than a year was the result of a previously unnoticed erratum in AMD's instruction set architecture.
Matthew Dillon had been chasing the bug, which caused a segmentation fault that would crash the system while compiling code, for over a year when he came to the conclusion it resided in hardware. Much of that time was taken up with the rarity of the crash: prior to finding a specific test case for the bug in December last year, Dillon would need to leave a 48-core system running in a loop for up to two days to reproduce the failure.
Dillon's suspicions that the bug came from a hardware issue stemmed from the specificity of the problem: the compiler would reliably crash on a system with an AMD processor, but the exact same compiler working on the exact same code would run forever without issue on an Intel chip.
Having ruled out a software error in the affected section of code - the 'fill_sons_in_loop' segment - Dillon got in touch with AMD with his findings.
'We exchanged a few emails to try to come up with a good test case,
' Dillon explains in a posting to the DragonFly project mailing list
. 'Owing to the difficulty of reproducing the bug I constructed a fully bootable DFly operating system & test case USB image and verified that the bug was present on my test box using that image. AMD was then able to reproduce the bug using that image on their own machines.
To Dillon's surprise, AMD confirmed that the flaw stemmed from its processors rather than user error. 'AMD has taken your example and also analysed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families,
' the company told Dillon. 'The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer
While it's too late to fix the flaw in AMD's current generation of processors, Dillon's work means that the bug will be entered into AMD's official list of errata. As a result, coders can be forewarned as to the issue and work on software-based solutions to work around the problem.
'I'm pretty stoked,
' Dillon admits in his post. 'It isn't every day that a guy like me gets to find an honest-to-god hardware bug in a major CPU!
AMD has yet to release an updated errata sheet for the affected processors.
While AMD has still to release updated errata information, it has been in touch to confirm the affected processors. According to an AMD spokesperson, the flaw has been confirmed in the previous four generations of AMD Opteron server-oriented chips, including the Opteron 2300 and 8300 'Barcelona' and 'Shanghai' chips, the Opteron 2400 and 8400 'Istanbul' chips, and the Opteron 4100 and 6100 'Lisbon' and 'Magny-Cours' chips.
According to the spokesperson, the flaw has not been found in its latest Opteron 4200 and 6200 'Valencia' and 'Interlagos' processors as it is not present in the new Bulldozer microarchitecture. Desktop and laptop processors are not affected by the bug.