bit-tech.net

BSD coder finds AMD processor bug

BSD coder finds AMD processor bug

AMD has admitted that a bug found by BSD coder Matthew Dillon represents a previously unknown erratum in its chips.

A programmer on the DragonFly BSD project has received confirmation that a bug he'd been chasing for more than a year was the result of a previously unnoticed erratum in AMD's instruction set architecture.

Matthew Dillon had been chasing the bug, which caused a segmentation fault that would crash the system while compiling code, for over a year when he came to the conclusion it resided in hardware. Much of that time was taken up with the rarity of the crash: prior to finding a specific test case for the bug in December last year, Dillon would need to leave a 48-core system running in a loop for up to two days to reproduce the failure.

Dillon's suspicions that the bug came from a hardware issue stemmed from the specificity of the problem: the compiler would reliably crash on a system with an AMD processor, but the exact same compiler working on the exact same code would run forever without issue on an Intel chip.

Having ruled out a software error in the affected section of code - the 'fill_sons_in_loop' segment - Dillon got in touch with AMD with his findings.

'We exchanged a few emails to try to come up with a good test case,' Dillon explains in a posting to the DragonFly project mailing list. 'Owing to the difficulty of reproducing the bug I constructed a fully bootable DFly operating system & test case USB image and verified that the bug was present on my test box using that image. AMD was then able to reproduce the bug using that image on their own machines.'

To Dillon's surprise, AMD confirmed that the flaw stemmed from its processors rather than user error. 'AMD has taken your example and also analysed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families,' the company told Dillon. 'The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer.'

While it's too late to fix the flaw in AMD's current generation of processors, Dillon's work means that the bug will be entered into AMD's official list of errata. As a result, coders can be forewarned as to the issue and work on software-based solutions to work around the problem.

'I'm pretty stoked,' Dillon admits in his post. 'It isn't every day that a guy like me gets to find an honest-to-god hardware bug in a major CPU!'

AMD has yet to release an updated errata sheet for the affected processors.

UPDATE
While AMD has still to release updated errata information, it has been in touch to confirm the affected processors. According to an AMD spokesperson, the flaw has been confirmed in the previous four generations of AMD Opteron server-oriented chips, including the Opteron 2300 and 8300 'Barcelona' and 'Shanghai' chips, the Opteron 2400 and 8400 'Istanbul' chips, and the Opteron 4100 and 6100 'Lisbon' and 'Magny-Cours' chips.

According to the spokesperson, the flaw has not been found in its latest Opteron 4200 and 6200 'Valencia' and 'Interlagos' processors as it is not present in the new Bulldozer microarchitecture. Desktop and laptop processors are not affected by the bug.

17 Comments

Discuss in the forums Reply
B1GBUD 6th March 2012, 13:03 Quote
I doubt he'll be rewarded for his efforts!!
Quote:
Originally Posted by Article
can create a condition where the process or incorrectly updates the stack pointer.

Should that read "can create a condition where the processor incorrectly updates the stack pointer."
Gareth Halfacree 6th March 2012, 13:35 Quote
Quote:
Originally Posted by B1GBUD
Should that read "can create a condition where the processor incorrectly updates the stack pointer."
Yes. Yes, it should. Fixed, ta!
Amsalpedalb 6th March 2012, 14:08 Quote
So exactly which CPUs does this affect?
Gareth Halfacree 6th March 2012, 14:09 Quote
Quote:
Originally Posted by Amsalpedalb
So exactly which CPUs does this affect?
AMD hasn't said yet; we'll have to wait for the updated errata document to be sure.
Harlequin 6th March 2012, 14:13 Quote
OMG its not like intel have errata documents is it.....

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-core-desktop-specification-update.pdf

from jan this year for the SB i5....
k4p84 6th March 2012, 14:29 Quote
We all make mistakes, it seems a rather specific set of instructions needed to occur to cause a bug so it would seem unlikely that it would happen much IRL. At least they said cheers for finding a bug !
Paradigm Shifter 6th March 2012, 15:28 Quote
All CPUs have errata lists that seem to go on forever. I'm impressed with the professionalism of both Dillon and AMD in this case; I'd have half expected the developer to go running to the internet media and bring down a firestorm. Instead, he works for a year to isolate it, then works with the manufacturer to validate the issue.

It's not exactly a scenario that is going to happen very often, though - at least consumers won't have anything to worry about if it only happens on >=48 core systems...
NethLyn 6th March 2012, 16:06 Quote
That's all right then, as long as it's not going to affect my gaming it's hard to be worried about this issue, both chip firms give you more or less lifetime guarantees anyway if you're not overclocking.

They should offer him a job :)
thehippoz 6th March 2012, 17:45 Quote
good find
SpAceman 6th March 2012, 20:40 Quote
I'm impressed at how well he kept his cool.. Segmentation faults can be.. Stressful.
Must have been somewhat of a relief when he finally figured out it wasn't his code stuffing up.
Adnoctum 7th March 2012, 00:35 Quote
Quote:
Originally Posted by Gareth Halfacree
Yes. Yes, it should. Fixed, ta!

Where's his reward for finding your erratum?
Gareth Halfacree 7th March 2012, 08:24 Quote
Quote:
Originally Posted by Adnoctum
Where's his reward for finding your erratum?
It's that warm glow of self-satisfaction you get. You know the one. That's the reward.*

* Reward has no cash value. Not redeemable for cash. No alternative will be offered. Limit one per household. Limited time offer. No purchase necessary. See website for details. May contain nuts. Keep out of reach of children. Offer void where prohibited.
Houndofhell 7th March 2012, 09:23 Quote
Quote:
Originally Posted by Paradigm Shifter
All CPUs have errata lists that seem to go on forever. I'm impressed with the professionalism of both Dillon and AMD in this case; I'd have half expected the developer to go running to the internet media and bring down a firestorm. Instead, he works for a year to isolate it, then works with the manufacturer to validate the issue.

It's not exactly a scenario that is going to happen very often, though - at least consumers won't have anything to worry about if it only happens on >=48 core systems...

I'd best be looking out for it then.

Any word on which CPUs in particular are affected?
DbD 7th March 2012, 15:28 Quote
Quote:
Originally Posted by k4p84
We all make mistakes, it seems a rather specific set of instructions needed to occur to cause a bug so it would seem unlikely that it would happen much IRL. At least they said cheers for finding a bug !

Who knows - it doesn't announce itself, it just makes your machine crash. Perhaps it's effected plenty of people with random crashes?
Gareth Halfacree 8th March 2012, 09:03 Quote
Article updated with a response from AMD - seems it affects the previous four generations of Opteron chips, but not the new Bulldozer-based parts. No desktop or laptop parts affected, either.
Christopher N. Lew 8th March 2012, 09:25 Quote
So my new-build 'Mangy-Cours' machine is broken before I've even finished putting it together?!
Gareth Halfacree 8th March 2012, 11:31 Quote
Quote:
Originally Posted by Christopher N. Lew
So my new-build 'Mangy-Cours' machine is broken before I've even finished putting it together?!
Only if you're doing something so rare and specific it took an experienced coder a year just to find a reliable test case...
Log in

You are not logged in, please login with your forum account below. If you don't already have an account please register to start contributing.



Discuss in the forums