Nehalem uArch Improvements - 256KB L2, 8MB L3 Confirmed
Performance Improvement Features:
With the next generation microarchitecture, Intel made significant core enhancements to further improve
the performance of the individual processor cores. Below we describe some of these enhancements.
Instructions per cycle improvements. The more instructions that can be run per each clock cycle, the greater the performance. In addition, in many cases, by running more instructions in any given clock cycle, the work task can complete sooner enabling the processor to more quickly get back into a lower power state. To run more instructions per cycle, Intel made several key innovations.
• Greater parallelism. One way to extract more parallelism out of software code is to increase the
amount of instructions that can be run “out of order.” This enables more simultaneous processing and
overlap latency. To be able to identify more independent operations that can be run in parallel, Intel
increased the size of the out-of-order window and scheduler, giving them a wider window from
which to look for these operations. Intel also increased the size of the other buffers in the core to
ensure they wouldn’t become a limiting factor.
• More efficient algorithms. With each new microarchitecture, Intel has included improved algorithms in places where previous processor generations saw lost performance due to stalls (dead cycles). Next generation Intel microarchitecture (Nehalem) brings many such improved algorithms to increase performance. These include:
• Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the
need to synchronize threads is also becoming more common. Next generation Intel
microarchitecture (Nehalem) speeds up the common legacy synchronization primitives (such
as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded
software will see a performance boost.
• Faster Handling of Branch Mispredictions: A common way to increase performance is
through the prediction of branches. Next generation Intel microarchitecture (Nehalem)
optimizes the cases where the predictions are wrong, so that the effective penalty of
branch mispredictions overall is lower than on prior processors.
• Improved hardware prefetch and better load-store scheduling: Next generation Intel
microarchitecture (Nehalem) continues the many advances Intel made with the 45nm next
generation Intel Core microarchitecture (Penryn) family of processors in reducing memory
access latencies through prefetch and load-store scheduling improvements.
Enhanced branch prediction. Branch prediction attempts to guess whether a conditional branch will be taken or not. Branch predictors are crucial in today's processors for achieving high performance. They allow processors to fetch and execute instructions without waiting for a branch to be resolved. Processors also use branch target prediction to attempt to guess the target of the branch or unconditional jump before it is computed by parsing the instruction itself. In addition to greater performance, an additional benefit of increased branch prediction accuracy is that it can enable the processor to consume less energy by spending less time executing mis-predicted branch paths.
Next generation Intel microarchitecture (Nehalem) uses several innovations to reduce branch mispredicts
that can hinder performance and to improve the handling of branch mispredicts.
• New second-level branch target buffer (BTB). To improve branch predictions in applications that have large code footprints, such as database applications, Intel added a second-level branch target buffer (BTB). BTBs reduce the performance penalty of branches in pipelined processors by predicting the
path of the branch and caching information used by the branch.
• New renamed return stack buffer (RSB). RSBs store forward and return pointers associated with call and return instructions. Next generation microarchitecture’s renamed RSB helps avoid many common
return instruction mispredictions
Intel Smart Cache Enhancements:
The new three-level cache hierarchy for next generation Intel microarchitecture (Nehalem) consists of:
• Same L1 cache as Intel Core microarchitecture (32 KB Instruction Cache, 32 KB Data Cache)
• New L2 cache per core for very low latency (256 KB per core for handling data and instruction)
• New fully inclusive, fully shared 8MB L3 cache (all applications can use entire cache)
A new two-level Translation Lookaside Buffer (TLB) hierarchy is also included in next generation Intel
microarchitecture (Nehalem). A TLB is a processor cache that is used by memory management hardware to improve the speed of virtual address translation. The TLB references physical memory addresses in its table.
All current desktop and server processors use a TLB, but next generation Intel microarchitecture (Nehalem)
adds a new second level 512 entry TLB to further improve performance.
Improved virtualization performance. Next generation Intel microarchitecture (Nehalem) adds new features that enable software to further improve their performance in virtualized environments. For example, the next generation microarchitecture includes an Extended Page Table (EPT) for reconciling memory type specification in a guest operating system with memory type specification in the host operating system in virtualization systems that support memory type specification.
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=3264&p=2