The world of GPGPU:
While Intel’s 80-core Terascale processor is impressive, it’s still a pipe dream for anything other than a research project in a lab controlled by specialists. For those that want actual TeraFLOP computing power in your home, you’re going to have to ironically look not to CPUs, but the world of graphics cards to find your processing powerhouses.
In the past, GPUs have been very specialised with clusters of units designed to process fragments (pixel shaders) and vertices (vertex shaders). ATI's Radeon X1000-series was a departure from this in some ways, and the cards have been great at crunching Folding@Home work units. However, the advent of DirectX 10's unified pipeline has meant that GPUs have stepped right away from the traditional model.
DX10-based GPUs are a mass of unified floating point processors that are fed with a massive amount of memory bandwidth available to them. Take Nvidia's G80 graphics chip for example, which features 128 floating point processors clocked at 1.35GHz, and a massive 86.4GB/sec of memory bandwidth.
ATI's R580 GPU flow diagram
The latest graphics cards from AMD and Nvidia have close to 500 GigaFLOP of observable computation power. Thus, with a pair of cards in either CrossFire or SLI, you can get very close to that golden TeraFLOP barrier.
The graphics market is a multi-billion dollar pressure cooker that exploits market mechanisms for the benefit of the industry and consumer, pioneering innovation at a huge rate of turnover. However GPUs are designed for graphics processing. The interaction between user and graphics card is through the computer subsystem which makes the CPU primary and add-in boards secondary.
Nvidia's G80 GPU flow diagram
GPUs are also massively parallel processing machines, so while it’s a lot easier to cut up a display into portions for separate rendering targets, general purpose calculations require instructions to be arranged in a way that they all execute in the correct order. Certain programs can be broken up into sections and have them calculated independently by the stream processors, before clever kernel algorithms in the program organise the data coming through.
This is still harder to code for than a CPU though, which has a wider language and education base. However, if you think in terms of graphics (where a stream array = texture and memory read = texture sample request), you can get round some of these problems.