Specifically the applications written for 45nm Penryn processors each make use of separate optimisations within the additional set of either SSE4, Radix-16 Divider or the Super Shuffle Engine.
For example, DivX Pro 6.6 Alpha
is SSE4-optimised so it will take the commonly used sum absolute command and combine all the code that uses it many times in a single string and make it into a single instruction. Despite the fact the code can be pretty massive, the Intel engineers boasted that it could be condensed and put through in a single clock cycle
, increasing the processor's IPC.
Another example is that the H.264 encoder used in Mainconcept
uses the Super Shuffle Engine when doing a large area of motion search detection. Instead of using either inferior scan, spiral or hexagonal 16 block searches for motion detection, using whole square chunks of the scene around that same 16 blocks used to be extremely taxing on the CPU.
The larger L2 cache helps the CPU throw more scene detail into it (despite possibly arguing with the associated die processor for the space), negating the need to address the main system memory.
The Super Shuffle Engine specifically helps when the process too as trying to address and find the local minimum of all the areas and order them requires its use significantly.
Finally, games like Half-Life 2
benefit from the Radix-16 Divider function as the divide function has typically been more taxing than other FP calculations for the CPU. Although geometry calculations will soon be done on the GPU, general gaming calculations associated with distance and geometry that aren’t specifically to do with the scene rendering are still in requirement of this.
Overall the results are interesting and go some way to confirm what we already expected from the theory behind the improvements.
There was a Santa Rosa laptop
also sporting the 45nm Penryn CPU but we weren’t allowed to run numbers on it as the hardware was just a proof of concept engineering sample.
As you can (just about) see on the left, CPU-Z can't quite recognise the new 45nm Penryn processor, thinking it has four lots of 32MB L1 data and instruction cache (that would make 256MB in total!), but it does read the two lots of 6MB L2 cache correctly.
Since both machines were entirely identical, even to the chassis, keyboard, mouse and monitor they needed labels detailing what was inside. Unfortunately we weren't allowed to take pictures of the insides running.
Three different methods of motion estimation in H.264. In the benchmark, a far larger square block search was used instead. In addition there is the DivX 6.6 Alpha performance graph detailing what factors of the new Penryn CPU do what in its process execution.