Intel Processor Roadmap: Penryn, Nehalem & the Future.
It wasn’t too long ago that Intel announced its next generation processors based on the 45nm process
back at the end of January. Now, just two months later we’ve managed to secure an update on how this new process will feature in Intel’s future Penryn and Nehalem CPUs, while Intel also let slip some future developments in its architecture that surprised even us.
Penryn: More than a die shrink.
If you missed Tim’s article
about how Intel has conquered current leakage with its new Hafnium based High-K di-electric insulating and metal gate technology, the main benefits are 20% faster transistor switching at the same power usage and an overall 30% power reduction at the transistor level. What Intel told us today builds on that base technology.
Due in Q207, Penryn is going to carry on the Core micro architecture, continuing the rigorous two year cycle of new processes Intel has stuck to for the last ten years. 65nm was first introduced with Presler, Yonah and Dempsey, then later used with the Core 2 family which was launched in the middle of last year. Later this year Intel will launch its 45nm processors, starting with Penryn.
It’s easier to apply a new optical process to an existing architecture rather than something new, as you know where to look for defects. Rather than try to achieve too much at once, it’s easier to keep things simple and do one change at a time.
Although, with the move to 45nm Intel has taken the time to implement some tweaks and improvements, higher core frequencies and increased IPC (instructions per clock cycle). We will be back into the 3GHz+ region, although there was no mention of where the frequency ceiling would be on the new platform
In addition there’s a new Fast Radix-16 Divider
, which is essentially an improved division engine that can divide 4bits at once instead of 2. This can be used on both floating point and integer divide function, as well as square root calculations providing an average of 2x increase in performance in this sector. This may seem unimportant, but geometric and physics calculations in games, 3D rendering and many scientific applications inherently rely on math functions.
As Tim mentioned in his previous article, there will be 50 new SSE4 instructions specifically to further optimise media, RAID, gaming and graphics. In addition there will be better power management from an increased quantity of sleep states and increased cache sizes of 50% - this means up to 6MB and 12MB for dual and quad core chips respectively. The front side bus (FSB) will increase in speed first on 65nm Core 2 to 1333MHz, then later to 1600MHz after the move to 45nm, before being lost almost entirely with Nehalem.
The performance figures Intel quoted us were impressive, but we will naturally reserve comment until we’ve tested the new hardware in the lab and compared it to existing technology. We were told to expect a 20% increase in gaming performance and 40% increase in media performance with the faster bus, clock speed, larger cache and SSE4.
Click for Large Images
In the mobile arena the 45nm Merom replacement will remain on the same socket initially; however news through the grape vine hints that this may be set to change in Q407 or Q108. The deeper power down technology includes five power states:
- C0 is the active state in which everything is running at full capacity.
- C1 has the core clock turned off, as well as a slightly reduced core voltage. However the motherboard power lines are kept alive and the data cache is kept intact. This means that the performance isn’t compromised and the wakeup time is extremely fast, but only a little power is saved.
- C3 has the same core voltage drop as the C1 state, but now turns off the PLLs and flushes the L1 cache, switching it off and losing the data. L2 data cache remains intact as it takes longer to fill than L1 holding a larger variety of more general program data. The consequence of turning more off is a longer wakeup time, but Intel’s chart shows no appreciable drop in idle power, and therefore increment in battery life. However, we were only provided with a simple description rather than accurate working numbers, so we will have to reserve a true judgement until they arrive.
- C4 drops the core voltage again, as well as partially flushing the L2 cache on top of everything previously applied in the C3 state.
- C6, the final power down state, is an almost complete shutdown of the CPU. There is a significant drop in core voltage and everything is now switched off to maximise battery life. The obvious down side to this state is that the resume time will be greater.
Intel implies the sleep states are seamless, automatic transitions rather than something you select like the hibernate or sleep functions within the OS.
Click for Large Images.
Enhanced dynamic acceleration technology is Intel’s new performance booster, without exceeding the total power output limit (TDP in Watts). In a multi core platform, if fewer than all cores are being used by a limited number of threads then the CPU will actually overclock
individual cores in order to crunch through the process faster. It then co-ordinates other cores into a deeper, lower power sleep state in order to compensate for the thermal increase.
While details are still scarce on the degree of auto overclocking employed and how much performance increase it offers, it’s certainly an interesting technique. This could provide for more overclockable processors - the only downside we can see is that a deeper sleep state means that other cores take longer to wake up, leaving the system unresponsive and dedicated to its single task.
While the Core architecture increased the SSE instruction handling to 128bit from 64bit, to afford a single clock SSE function execution for a 2x performance increase, Penryn improves on this further by adding a Super Shuffle Engine
with SSE4. Some SSE operations like unpacking, packing, aligning, wide shifting, insertion, extraction and 'setup for horizontal arithmetic functions' have shuffle operations in them. The new engine allows the processor to execute full width shuffle passes in a single operation for better performance, by reducing latencies.
Intel also has enhanced its virtualisation technology in Penryn, speeding up the virtual machine transition times (entry & exit) by anywhere between 25 to 75 percent. This is achieved through microarchitectural improvements and doesn't require a software change to the virtual machine. Future Apple Mac users are likely to be the most obvious benefit, quickly flicking between Mac OSX and Windows XP/Vista without having to reboot between each change.
Intel stipulated that current motherboards would need to attain certain requirements before being able to run 45nm CPUs, despite being socket compatible. The motherboard can be made to work with a simple BIOS update, however they would also need to support certain power requirements and the new elevated bus frequencies.