bit-tech.net

New GPGPU approach promises 20 per cent performance boost

New GPGPU approach promises 20 per cent performance boost

AMD's HSA technology may have just been validated by research that promises a 20 per cent performance boost without overclocking.

Researchers at North Carolina State University have provided some serious vindication for AMD's plan to unite GPU and CPU silicon using the Heterogeneous Systems Architecture: a 20 per cent performance boost without overclocking.

Before we get into the paper, entitled CPU-Assisted GPGPU on Fused CPU-GPU Architecture, there's a couple of things to get out of the way: while the researchers are independent, the research itself was part-funded by AMD while the company's senior fellow architect Mike Mantor is named as a co-author. The team also didn't have real silicon to work with: instead, their results are based on a simulated future AMD Accelerated Processing Unit (APU) featuring shared L3 cache.

With that out of the way, the team's results are still worthy of note. Using the aforementioned simulated silicon, the team were able to convince their code to run 20 per cent faster on average without overclocking the 'chip.'

'Our approach is to allow the GPU cores to execute computational functions, and have CPU cores pre-fetch the data the GPUs will need from off-chip main memory,' paper co-author and associate professor of electrical and computer engineering Huiyang Zhou explains. 'This is more efficient because it allows CPUs and GPUs to do what they are good at. GPUs are good at performing computations. CPUs are good at making decisions and flexible data retrieval.'

This approach, in which the CPU and GPU combine their efforts to boost overall performance, has previously been nigh-on impossible thanks to the separation between GPU and CPU in silicon. With AMD forging ahead with the architecture formerly known as Fusion, which bonds the two into a single cohesive whole, however, it becomes far simpler.

Using synthetic benchmarks, Zhou's team was able to show significant performance gains using the CPU-assisted GPU model. On average, benchmarks ran 21.4 per cent faster while some tasks were boosted by 113 per cent.

'Chip manufacturers are now creating processors that have a "fused architecture," meaning that they include CPUs and GPUs on a single chip. This approach decreases manufacturing costs and makes computers more energy efficient. However, the CPU cores and GPU cores still work almost exclusively on separate functions. They rarely collaborate to execute any given program, so they aren't as efficient as they could be,' explains Zhou. 'That's the issue we’re trying to resolve.'

While the research may have been helped along by AMD's input, it applies equally to Intel's latest-generation Sandy Bridge architecture. Where Intel seems happy to keep its current level of integration, however, AMD is forging ahead with a full fusion of GPU and CPU. Should the paper's experiments prove themselves in the real world, that could give AMD the boost it needs to finally compete at the high-end with Intel.

To take advantage of the model, extensions will need to be added to compilers that automatically generate a pre-execution program with memory access instructions for the GPU kernel. As a result, it's something of which only future software will be able to take advantage.

Zhou's paper is due to be presented at the International Symposium on High Performance Computer Architecture in New Orleans later this month.

14 Comments

Discuss in the forums Reply
Hustler 8th February 2012, 13:50 Quote
Whats this?...more promises of Jam tomorrow from AMD hardware?

...sigh
debs3759 8th February 2012, 16:10 Quote
I like that AMD are looking into newer and better ways of doing things. Fully integrating the CPU and GPU will, IMO, be a very good thing. We might even see AMD take back the lead in high-end computing in a few short years (Remember AMD64? You should, it's the basis for the instruction set in every 64-bit x86 CPU we see today!)
borandi 8th February 2012, 17:00 Quote
You act very sceptical in this news story. This is understandable. But any university that wants to work on stuff like this has to have a good partnership with some aspect of the manufacturer - e.g. NVIDIA have a lot of academic people around the word who act as reps for their institution, promoting GPGPU and exposing the members of their institutions to as much stuff as possible. As a result, those who collaborate with the academics (in this case, Mike Mantor) on a significant level are put in as co-authors even if they didn't directly do any research themselves, they contributed a lot to the outcome.
tonyd223 8th February 2012, 17:12 Quote
but will it run Cry...
DbD 8th February 2012, 17:12 Quote
That 20% is a picked out the air figure, and seems very small if they actually want to make use of it.

There's a load of overhead required to set-up some bit of code to run on the gpu not the cpu. Equally only some bits of code are faster - if the lump of code is too small then the overhead of setting it up for gpu outweighs the benefit it gives. Hence that 20% figure is based on a guess for overhead, and a guess for code that would work well with gpu, all this done without real silicon - just a simulation of some *future* silicon. Being as the research is set-up to show this is possible (this is the result AMD will have wanted) 20% is a very low number to have come out with.

Then there's the other things - e.g. power usage. Done on a traditional cpu might have been 20% slower but it only required cpu, where as your 20% performance gain now requires all the extra complexity and power requirements of a gpu. Would it have been more efficient power wise to just have a more power hungry cpu that ran 20% faster?

I think this sort compute tends to work really well when you can allocate big lumps of code, or full jobs to some specialised hardware (e.g. video decode), but for just working on standard code in combination with cpu it doesn't looks so effective.
azazel1024 8th February 2012, 18:03 Quote
On what kind of a workload? We already know that there are plenty of workloads that are single threaded only, or at most a bare 2-6 threads. The only time a GPU MAY help is in a scenario where the calculations/work you are doing has more threads than there are CPU cores (or with hyperthreading, really virtual cores to) to handle it. Even then, with the rather low frequency and low IPC of each individual "stream processor" in a GPU, you have to get to fairly high threaded calculations to see a performance benifit of GPU over CPU calculations. I have no idea the numbers, but I'd suspect we are talking at least double the threads of the CPU core count of a lot more.

Heck, look at H.264 encoding, which can be pretty massively threaded. High end discrete GPUs only manage something like 50-150% odd faster encoding than a high end Intel CPU...and they have to take short cuts which compromises image quality some. That is with like 800+ stream processors versus 4-6 real cores.

Sure there are things that make sense to process on a GPU with all its many cores, but most things in general computing are still best left to the CPU, or at most might gain a little assistance from having the GPU step in to handle some of the processing (example rendering a webpage).

However, how much actual performance gain is there on having the CPU and GPU on die, sharing L3 cache and workloads? As compared to say a discrete GPU sharing some of the processing with the CPU? I'd assume at least a small percent gain due to significantly lower latency and shared cache...but it is as much of a gain as the processing ability of a discrete GPU over an integrated GPU? After all current day AMD Llano top of the line GPU is not much better than a 6450 discrete GPU, basically the most bottom of the barrel. Even with shared cache and more main memory bandwidth I doubt the iGPU performance improves that much.

Now compared that with higher clock rates, much greater memory bandwidth and a significant number more stream processors of something like a 550 or 6670 even? Despite having much, much higher latency?

I see shared CPU/GPU processing on the core as important, but I really suspect that the numbers the researchers came up with are specious at best. I doubt that compares shared CPU to GPU computing with discrete GPU (even the exact same performance discrete GPU, but with the higher lag resulting from the discreteness).
FelixTech 8th February 2012, 19:01 Quote
I think it will be interesting if they can reduce or remove the time spent transferring data between CPU and GPU memories. There are many things that can be done well with GPUs, but where latency is prioritised over throughput the time for a return trip to the GPU is simply too big at the moment.
schmidtbag 8th February 2012, 20:01 Quote
amd and their partners are absolutely right in the sense that in order to get the best performance, the cpu and gpu need to work together as a single unit, but as long as x86 is the dominant architecture, amd can't expect programs to follow in their desired steps. intel is "happy" with their current cpu+gpu setup because they have plans to just keep improving performance without the need to increase clock speed or slap on more cores. both sandy bridge and ivy bridge are perfect examples of this.

if intel followed amd's decision, then i think we'd get a whole other world of computing, in a good way. but as long as intel doesn't want to follow amd, the idea won't take off. i feel like if any company wants to do the fused cpu and gpu idea, they might as well create an entirely new architecture from scratch, and we all know that isn't going to happen. what amd wants to do is effectively what SPARC and PPC currently do.
velo 8th February 2012, 20:03 Quote
Quote:
Originally Posted by DbD
That 20% is a picked out the air figure, and seems very small if they actually want to make use of it.

"Using synthetic benchmarks, Zhou's team was able to show significant performance gains using the CPU-assisted GPU model. On average, benchmarks ran 21.4 per cent faster..."

Doesn't seem particularly air-like to me...
fluxtatic 9th February 2012, 07:25 Quote
Quote:
Originally Posted by azazel1024

Heck, look at H.264 encoding, which can be pretty massively threaded. High end discrete GPUs only manage something like 50-150% odd faster encoding than a high end Intel CPU...and they have to take short cuts which compromises image quality some. That is with like 800+ stream processors versus 4-6 real cores.

If you mean SB, you're a bit off - SB is using dedicated hardware for Quick Sync. Not that I mean your results are wrong, just the reasoning behind it. QS is ridiculously fast, yes, but it isn't directly the result of the magic of the SB arch itself, just that Intel saw fit to cram dedicated hardware to handle that job, and only that job, onto the die. Compare it to the previous Core arch, and the picture is a bit different.

I'm always a bit suspicious of this type of research, in that it isn't actual silicon they're working on. Rather, it's a model of some future arch that may never come to be. Or they tuned it, not even necessarily intentionally, to crank out results, not taking into account that what they ended up designing won't be practical as an actual CPU.
Quote:
Originally Posted by DbD
That 20% is a picked out the air figure, and seems very small if they actually want to make use of it.

Have you read the paper? I'm actually asking, as I haven't. If they were going to start pulling figures out of...some place, why go so low? Or, shall we put on our tinfoil hats and realize it's a cunning scheme - make it sound good, but not suspiciously good. Otherwise people won't believe it. What have they got to lose, with so many people shitting on them now?

Even here, it's starting to feel as if people want AMD to fail. You think Intel won't get lazy (and even more expensive) with zero competition? Intel, the company so dedicated to the enthusiast community they'll sell you insurance on your processor in case you blow it up. Never mind the fine print that essentially says they can wriggle out of the obligation to replace the hardware, leaving you with no recourse. If they're so dedicated to us, give me the crack dealer model: first one is always free. Kill it, they'll replace it no questions asked. But that's it. Blow up the replacement and you're back on Newegg or Scan like every other sucker. That's dedication to the community...and it isn't like they can't afford it. This market segment is such a tiny portion of their revenue, they could start a "send us a picture of the PC you're building and we'll send you the processor for free" program and you wouldn't even see a dent in their quarterly revenues.

On-topic, though, this is exciting. Rather than piss on their shoes, cheer them on and hope they're right. Last time they had something that was a great leap forward (one that succeeded, that is), the result is all of us using 64-bit processors. They were also the first in x86 using true, native multicore processors, as well.

Between this and Intel's Haswell announcement (http://arstechnica.com/business/news/2012/02/transactional-memory-going-mainstream-with-intel-haswell.ars), this is a big day in hardware news - be happy!
Snips 9th February 2012, 11:03 Quote
How long ago was AMD64 again? Why is it always mentioned ever since everytime AMD release a dissappointed processor?

The word for today is "Simulated"

I'm sure every processor "on paper" performs like a demon, it's manufacturing the idea that AMD fall down.
Nexxo 9th February 2012, 11:39 Quote
Interesting concept. Wouldn't dismiss it just because AMD is the one experimenting with it. I remember a time when AMD beat the pants off Intel and I've been around long enough to know that whatever happened can happen again.
Guinevere 9th February 2012, 13:11 Quote
Quote:
Originally Posted by DbD
That 20% is a picked out the air figure

Ahhh peer review is always at it's very very best when undertaken by someone unqualified and unwilling to read the paper.

Well done sir. Well done I say.
azazel1024 9th February 2012, 14:28 Quote
Quote:
Originally Posted by fluxtatic
Quote:
Originally Posted by azazel1024

Heck, look at H.264 encoding, which can be pretty massively threaded. High end discrete GPUs only manage something like 50-150% odd faster encoding than a high end Intel CPU...and they have to take short cuts which compromises image quality some. That is with like 800+ stream processors versus 4-6 real cores.

If you mean SB, you're a bit off - SB is using dedicated hardware for Quick Sync. Not that I mean your results are wrong, just the reasoning behind it. QS is ridiculously fast, yes, but it isn't directly the result of the magic of the SB arch itself, just that Intel saw fit to cram dedicated hardware to handle that job, and only that job, onto the die. Compare it to the previous Core arch, and the picture is a bit different.

I'm always a bit suspicious of this type of research, in that it isn't actual silicon they're working on. Rather, it's a model of some future arch that may never come to be. Or they tuned it, not even necessarily intentionally, to crank out results, not taking into account that what they ended up designing won't be practical as an actual CPU.
Quote:
Originally Posted by DbD
That 20% is a picked out the air figure, and seems very small if they actually want to make use of it.

Have you read the paper? I'm actually asking, as I haven't. If they were going to start pulling figures out of...some place, why go so low? Or, shall we put on our tinfoil hats and realize it's a cunning scheme - make it sound good, but not suspiciously good. Otherwise people won't believe it. What have they got to lose, with so many people shitting on them now?

Even here, it's starting to feel as if people want AMD to fail. You think Intel won't get lazy (and even more expensive) with zero competition? Intel, the company so dedicated to the enthusiast community they'll sell you insurance on your processor in case you blow it up. Never mind the fine print that essentially says they can wriggle out of the obligation to replace the hardware, leaving you with no recourse. If they're so dedicated to us, give me the crack dealer model: first one is always free. Kill it, they'll replace it no questions asked. But that's it. Blow up the replacement and you're back on Newegg or Scan like every other sucker. That's dedication to the community...and it isn't like they can't afford it. This market segment is such a tiny portion of their revenue, they could start a "send us a picture of the PC you're building and we'll send you the processor for free" program and you wouldn't even see a dent in their quarterly revenues.

On-topic, though, this is exciting. Rather than piss on their shoes, cheer them on and hope they're right. Last time they had something that was a great leap forward (one that succeeded, that is), the result is all of us using 64-bit processors. They were also the first in x86 using true, native multicore processors, as well.

Between this and Intel's Haswell announcement (http://arstechnica.com/business/news/2012/02/transactional-memory-going-mainstream-with-intel-haswell.ars), this is a big day in hardware news - be happy!

I was refering to x86 encoding of h.264 compared to GPU 580 or 5870 encoding. Quick synch is faster than GPU h.264 encoding, and it appears to actually deliver on par or maybe better quality than GPU encoding. x86 CPU encoding delivers by far the best quality, though at speeds that are roughly half or so of faster GPU cards. However, if you look at power use...overall energy used for encoding might actually be better on a sandybridge processor than a GPU. If the high end GPU can do it twice as fast, but uses 3 times the power...

Anyway, my point is that as things stand this second, GPU encoding is nice, but it isn't a panacea. I think as it regards the APU that AMD and Intel seem to be putting together, with both moving further and further along the path of SOCs (Haswell pretty much will be a SOC with just about everything moved on die), a GPU is going to be "critical", but not nearly as good as discrete cards. Heck, just look at the die area of Nvidia and AMD high end discrete cards right now. Vaguely 300mm^2. That is about 50% bigger than Sandybridge, which is already using a big chunk for the GPU (roughly half? A third?). As process size shrinks, I think we'll see the GPU portion of the APU/CPU get bigger and bigger, however, it is likely to still be smaller and less powerful than what you'll find in discrete GPUs.

I do think at some point in the next 1-3 CPU generations (maybe by Haswell?) we'll see a complete disapperance of the low and maybe even the mid-low GPU markets. Ivy Bridge looks like it may be on par wtih a 6550 and AMDs GPU in Llano is just about on par with that as well. Haswell sounds like it is probably going to improve on Ivy Bridge anywhere from 25-100% and Trinity is likely to be better than Llano. Integrated GPUs are certainly improving faster than Discrete graphics are.

Two keys to integrated GPUs though is going to be closer integration with the CPU (Intels shared L3 cache, AMD deciding to implement a real L3/Shared cache) as well as a larger main memory pipeline and/or dedicated VRAM slots. However, Intel at least, and to a lesser degree AMD it seems, are moving toward lower power CPU/APUs. In part because of portable computing, but also because of the server space and desktop CPUs are mirroring this as well. So the discrete GPU is always going to be much more powerful, so long as you don't mind coughing up the money. Your average mid-range card with a TDP in the 100-150W range is going to be much more powerful than a combined CPU/GPU that total might have a TDP of 65-130w.

So integrated GPUs can accelerate somethings and be significantly lower latency than a discrete GPU. However, for raw processing power, a discrete GPU is still going to be head shoulders more powerful. It is really just going to be in situations where low latency is required that the iGPU is going to be better than a discrete card, or in a situation where there is no discrete card present, which the market is quickly moving toward as integrated GPUs start becoming "Good enough" for basic users, corporate computing and casual gamers. Heck at the rate of improvement they are going to be good enough for even heavy gamers who are on a budget or have lower resolution displays (I'd say give it 3-4 years and iGPUs are going to be able to handle 1080p with medium/high settings at >30FPS in basically all games, though hopefully by then the >20" monitor group is going to be standardizing on something more like >1500p).
Log in

You are not logged in, please login with your forum account below. If you don't already have an account please register to start contributing.



Discuss in the forums