Nehalem is based on a scalable and dynamic architecture that has been designed in a modular way. We’ve heard of this before – back when AMD was discussing the “Falcon” Fusion design that should arrive next year.
It seems like a popular way to go that allows flexibility to change and “quickly” enter potential new markets without going back to the drawing board for several years.
Nehalem will launch as Intel’s first native quad-core part but will be scalable from two to eight cores. All these cores have Simultaneous Multi-Threading (SMT) included, meaning there will be four to sixteen threads available for processing workloads.
The cache size will also mirror AMD’s current parts to some degree where the L2 cache size will drop in size to just 256KB per core, but it’s apparently “super low latency”, while the L1 cache remains unchanged from the current Core microarchitecture design. However, Nehalem now includes an inclusive L3 cache that acts as a snoop-filter.
By being inclusive this means there is automatically an amount of cache reserved in L3 to hold L2 and L1 data, so that none of the cores have to waste cycles sniffing another core’s cache for information – now it just has to simply dive into the shared L3 cache. Intel says that the first Nehalem-based product will be quad-core and will feature 8MB of L3 cache, which means that around 1,280KB of L3 cache will be reserved for a copy of the L1 and L2 caches on each core.
This does cut the available size for everything else to about 7MB and this is quite a drop from the current 2x6MB shared cache design on current quad-core 45nm parts. However, the integrated memory controller means that main memory access is far quicker and each core can talk to one another unlike before where one pair would have to navigate the front side bus to the memory controller hub and then back again to talk to the other pair of cores sitting right next to it.
Other improvements include seven new SSE 4.2 instructions, a faster “unaligned” cache access (media applications were tipped as seeing the most benefits), faster thread synchronisation hardware to help not only the SMT engine but the many-cores in general, a Renamed Return Stack Buffer and a new 2nd level branch predictor which can address very large code footprints like databases for example.
Intel was keen to point out its SMT should add to its energy efficiency, while the direct memory access and lower latency design should greatly improve its performance compared to the MCH designs it's currently using. Both server and high end desktop parts (both triple channel DDR3) should be available in Q408, with mainstream parts arriving in 2009.