Saturday, March 08, 2008

45nm, An Interesting Convergence In Design

AMD's K8 has been trailing C2D for the past 18 months. Making things worse, AMD has stumbled a bit with its K10 launch while Intel's Penryn seems to be on schedule. It is somewhat surprising then to discover that Intel's Nahelem and AMD's Shanghai are so similar.


Both are quad core, both use L3 cache, and both use a point to point interface (QuickPath for Intel and HyperTransport for AMD). And, according to Hans De Vries:

Nehalem
731 Million Transistors
246mm Die Size
7.1mm/MB L2 Cache Density

Shanghai
700 Million Transistors
243mm Die Size
7.5mm/MB L2 Cache Density

We don't really see much difference until we look at core size and L3 density:

Nehalem
24.4mm Core Size
5.7mm/MB L3 Cache Density

Shanghai
15.3mm Core Size
7.5mm/MB L3 Cache Density

AMD uses 2MB's of L2 + 6MB's of L3 for 8MB's total.
Intel uses 1MB of L2 + 8MB's of L3 for 9MB's total.

Along with similar total cache size the area devoted to cache is similar as well (nearly identical area for L3). However, the area devoted to core logic is quite different:

Nehalem
Core Area - 97.6mm
L2 Area - 7.1mm
L3 Area - 45.6mm

Shanghai
Core Area - 61.2mm
L2 Area - 15mm
L3 Area - 45mm

We see right away that Nehalem devotes 85% more die area to core logic than to cache whereas Shanghai devotes about the same die area to core logic and cache. It is almost a certainty that with Nehalem's greater amount of core transistors that it will run faster than Shanghai. On the other hand it should also consume more power. If we assume that Intel gets a reduction in power draw due to having better 45nm transistors then this should offset some of Nehalem's higher power draw. However, with 60% more transistors devoted to core logic I don't believe that all of this could be offset. My guess is that at full load Nehalem will draw more than Shanghai but Nehalem should be closer at idle power. Actually, Nehalem's core ratio at 40% is almost the same as Penryn's at 41% which is only slightly less than Merom's at 44%. In contrast, Shanghai's core ratio has dropped to a tiny 25% much smaller than Barcelona's 36%.

Penryn has a density of 6.0mm/MB for L2. Therefore, I would expect Nehalem's L2 at a density of 7.1mm/MB to be faster than Penryn's L2. However, I would also expect Nehalem's L3 at 5.7mm/MB to be slightly slower than Penryn's L2. This is interesting because we know that Barcelona's L3 is a bit slow. However, with an L2 twice the size of Nehalem's this should be a closer match than Barcelona is to current Penryn with its massive 12MB L2. Essentially, Shanghai is Barcelona with 3X the L3 cache but Nehalem's cache structure is much more like Shanghai's than Penryn's. Shanghai is unlikely to have any significant core changes from Barcelona but there may be some minor changes in the decoding section. This is pretty much what I would expect. Since Barcelona only improved on the decoding speed of FP instructions compared to K8 I would expect AMD to tweak a few more Integer instructions in Shanghai to increase speed. Decreasing decoding time for commonly used instructions would also make a lot of sense given Penryn's advantage in decoders. AMD is not likely to get a boost of 10-15% doing this but a boost of, say, 3% is possible. AMD might see another 3% boost in speed due to larger L3 although this could possibly be higher if AMD could make L3 faster. A lot of people are wondering if AMD will bump the master L3 clock on Shanghai for better performance at speeds above 2.3Ghz. Just bumping the clock could be worth another 1%. So, I'm guessing a total speedup for Shanghai of perhaps 7% versus Barcelona.

There is one other obvious difference in the die layout. Shanghai has mirror image cores. This means that a master clock signal originating from the center of the Northbridge area should propagate out to each of the four cores at the same time and this is a standard way of doing timing. However, since Nehalem's cores are placed side by side it must be using a more sophisticated clocking scheme. In other words, there is not any place on the Nehalem die where a clock signal would reach all four cores at the same time. This would tend to suggest the possibility that each core actually has its own master clock allowing them to run nearly asynchronously. If Nehalem does indeed have this ability then this could be a big benefit in offsetting the higher power draw since cores that were less busy could simply be clock scaled downward. AMD could therefore be in for a tough fight in 2009 in spite of the similarities between Shanghai and Nehalem. The one big benefit to AMD from the similarity is that it makes it nearly impossible to cache tune the benchmarks in Intel's favor. This means that Intel is probably going to lose some artificial advantage that it has now with Penryn. However, given the difference in core transistors and Nehalem's greater memory bandwidth I can't see any reason why Nehalem couldn't make up for this loss and still have an overall performance lead at the same clock as Shanghai.