Saturday, March 08, 2008

45nm, An Interesting Convergence In Design

AMD's K8 has been trailing C2D for the past 18 months. Making things worse, AMD has stumbled a bit with its K10 launch while Intel's Penryn seems to be on schedule. It is somewhat surprising then to discover that Intel's Nahelem and AMD's Shanghai are so similar.


Both are quad core, both use L3 cache, and both use a point to point interface (QuickPath for Intel and HyperTransport for AMD). And, according to Hans De Vries:

Nehalem
731 Million Transistors
246mm Die Size
7.1mm/MB L2 Cache Density

Shanghai
700 Million Transistors
243mm Die Size
7.5mm/MB L2 Cache Density

We don't really see much difference until we look at core size and L3 density:

Nehalem
24.4mm Core Size
5.7mm/MB L3 Cache Density

Shanghai
15.3mm Core Size
7.5mm/MB L3 Cache Density

AMD uses 2MB's of L2 + 6MB's of L3 for 8MB's total.
Intel uses 1MB of L2 + 8MB's of L3 for 9MB's total.

Along with similar total cache size the area devoted to cache is similar as well (nearly identical area for L3). However, the area devoted to core logic is quite different:

Nehalem
Core Area - 97.6mm
L2 Area - 7.1mm
L3 Area - 45.6mm

Shanghai
Core Area - 61.2mm
L2 Area - 15mm
L3 Area - 45mm

We see right away that Nehalem devotes 85% more die area to core logic than to cache whereas Shanghai devotes about the same die area to core logic and cache. It is almost a certainty that with Nehalem's greater amount of core transistors that it will run faster than Shanghai. On the other hand it should also consume more power. If we assume that Intel gets a reduction in power draw due to having better 45nm transistors then this should offset some of Nehalem's higher power draw. However, with 60% more transistors devoted to core logic I don't believe that all of this could be offset. My guess is that at full load Nehalem will draw more than Shanghai but Nehalem should be closer at idle power. Actually, Nehalem's core ratio at 40% is almost the same as Penryn's at 41% which is only slightly less than Merom's at 44%. In contrast, Shanghai's core ratio has dropped to a tiny 25% much smaller than Barcelona's 36%.

Penryn has a density of 6.0mm/MB for L2. Therefore, I would expect Nehalem's L2 at a density of 7.1mm/MB to be faster than Penryn's L2. However, I would also expect Nehalem's L3 at 5.7mm/MB to be slightly slower than Penryn's L2. This is interesting because we know that Barcelona's L3 is a bit slow. However, with an L2 twice the size of Nehalem's this should be a closer match than Barcelona is to current Penryn with its massive 12MB L2. Essentially, Shanghai is Barcelona with 3X the L3 cache but Nehalem's cache structure is much more like Shanghai's than Penryn's. Shanghai is unlikely to have any significant core changes from Barcelona but there may be some minor changes in the decoding section. This is pretty much what I would expect. Since Barcelona only improved on the decoding speed of FP instructions compared to K8 I would expect AMD to tweak a few more Integer instructions in Shanghai to increase speed. Decreasing decoding time for commonly used instructions would also make a lot of sense given Penryn's advantage in decoders. AMD is not likely to get a boost of 10-15% doing this but a boost of, say, 3% is possible. AMD might see another 3% boost in speed due to larger L3 although this could possibly be higher if AMD could make L3 faster. A lot of people are wondering if AMD will bump the master L3 clock on Shanghai for better performance at speeds above 2.3Ghz. Just bumping the clock could be worth another 1%. So, I'm guessing a total speedup for Shanghai of perhaps 7% versus Barcelona.

There is one other obvious difference in the die layout. Shanghai has mirror image cores. This means that a master clock signal originating from the center of the Northbridge area should propagate out to each of the four cores at the same time and this is a standard way of doing timing. However, since Nehalem's cores are placed side by side it must be using a more sophisticated clocking scheme. In other words, there is not any place on the Nehalem die where a clock signal would reach all four cores at the same time. This would tend to suggest the possibility that each core actually has its own master clock allowing them to run nearly asynchronously. If Nehalem does indeed have this ability then this could be a big benefit in offsetting the higher power draw since cores that were less busy could simply be clock scaled downward. AMD could therefore be in for a tough fight in 2009 in spite of the similarities between Shanghai and Nehalem. The one big benefit to AMD from the similarity is that it makes it nearly impossible to cache tune the benchmarks in Intel's favor. This means that Intel is probably going to lose some artificial advantage that it has now with Penryn. However, given the difference in core transistors and Nehalem's greater memory bandwidth I can't see any reason why Nehalem couldn't make up for this loss and still have an overall performance lead at the same clock as Shanghai.

37 comments:

Unknown said...

Great analysis Scientia, but after reading that, I had a terrible stomach ache because of the idea that AMD will suck once again for the third time. :/

AMD is doing wonderfully on the graphics/chipset side, but this won't save them for ever if they keep executing the way they are. I'm starting to have the idea that their processor division simply sucks.

When I knew AMD, I was really happy to use their K7 and K8 processors because of the great performance per clock per watt it offered against intel processors, but looking at the trend nowadays, I don't know when shall we see that again. :(

To be sincere with you, I won't put my faith in Bulldozer this time (as I did with Barcelona AND Shanghai). I just don't want to be disappointed once again.

I really don't care if AMD is behind in process manufacturing, all we want (as enthusiast) is a killer uArch from them so that they once again can regain their name of "ADVANCED Micro Devices".

And about Intel, I really disgust them: they used nasty tactics all this time to exclude AMD from big deals, they coerce people and companies on buying their products, they use false information to deceive people and big companies, they have the ability to interfere in the purchasing decissions of big enterprises and even countries (One Laptop Per Child anyone?), they have the advantage that 95% of apps are coded using their nasty compiler, and the list keeps growing.

Once again, thanks for your sincere thoughts Scientia. I wonder what the stupid intel fanboys gathered at Roborat's blog have to say about this overview.

Keep with the good work. ;)

Scientia from AMDZone said...

erlindo

Please refrain from personal attacks. Some of the ideas at roborat's blog have been correct although they do seem to have forgotten about predicting that AMD would go bankrupt, that AMD would be purchased, that AMD would completely outsource, that AMD would sell FAB 30, or that AMD would end up divesting ATI.

My guess is that similar amnesia will develop if the ATI purchase turns out to be a big benefit for AMD later this year.

Shanghai is still up in the air since AMD has neither announced an actual release date nor initial clock speeds. I guess until we see Shanghai run against Nehalem we really won't know how well AMD is competing.

Unknown said...

Sorry about the personal attacks, is just that I hate how those guys like to distort reality, and the bad thing about that, is that the "ignorant Joe" will believe every bit of their crap.

I promise I won't fall again on the personal insults.

enumae said...

Scientia,

What are your thought about VR-Zones Dunnington - Nehaem Ramp, possible Nehalem in Q3?

And the news from DigiTimes about...

"AMD's 45nm processors have already entered EVT testing, according to sources at motherboard makers, who added they expect to receive samples by August or September."

--------------------------------

Also, if Nehalem can, or does get released in Q3 (in small volume and with projected performance +/-), while ramping quickly (according to the graphs), what kind of effect do you expect it will have on AMD's revenue for 2008?

Thanks

Scientia from AMDZone said...

enumae

"What are your thought about VR-Zones Dunnington - Nehaem Ramp, possible Nehalem in Q3?"

Apparently the EX version of Nehalem, Beckton won't be out until Q4 2009 so it looks like Dunnington will be the main 4-way chip until then. The Nehalem ramp introduces more products in Q1 and Q2 2009. The volume in Q4 won't be much of a factor.

"AMD's 45nm processors have already entered EVT testing, according to sources at motherboard makers"

Well, EVT or Engineering Validation Testing comes right after first silicon. Then you do Design Validation Testing and then Product Validation Testing.

"who added they expect to receive samples by August or September."

"Samples" doesn't seem to be the right word. They should be receiving samples long before then. By August or Semptember; they should be receiving actual product.

"Also, if Nehalem can, or does get released in Q3 (in small volume and with projected performance +/-), while ramping quickly (according to the graphs), what kind of effect do you expect it will have on AMD's revenue for 2008?"

Very little. Penryn and Dunnington will be much larger factors along with Centrino 2.

You didn't mention the other item in the Digitimes article about AMD's discrete graphics growing from 35% to 50% share this year. Unfortunately for nVidia this is entirely plausible. The problem for nVidia is that Intel makes a lot of motherboards but none of them can do SLI so this leaves a fair segment of the market open for AMD. Frankly I'm not sure how nVidia could stop AMD unless:

1.) nVidia releases a much better GPU preventing AMD from closing the gap.

2.) nVidia licenses SLI to Intel.

3.) Intel stops making motherboards.

AR said...

Why do you assume the master clock on Shanghai cannot be further clock gated or multiplied independently at each core ? Also asynch clocks create a problem at shared structures such as the L3, meaning a little more latency.. no ?

You also seem to ignore that Nehalem is SMT while Shanghai is not. That might explain the die area differences.

I wonder if Shanghai is still 3 wide or its been bumped up to 4 way superscalar.

Ho Ho said...

First of all, make that 700k+ image into thumbnail. It takes quite long to load even on 12Mbit


scientia
"It is almost a certainty that with Nehalem's greater amount of core transistors that it will run faster than Shanghai"

While I agree it should run faster it brought back some memories how people said intel is using cache to hide its architecture shortfalls. I sure hope AMD isn't trying to do it with Shanghai. Also as was said some of the core increase comes from SMT and assuming it is any better than HT in P4 it could give multithreaded workloads quite big boost, something you cannot make up with bigger/faster caches or slight IPC improvements.

For the record, Intel itself has said Nehalem is 10-25% faster in single-threaded and up to 100% in multithreaded workloads. Of course that 100% must be taken with a grain of salt, I've seen 50% speed increases on real-world code with P4 HT but it was certainly not the norm. Another big thing that will likely increase Nehalem IPC by quite a bit is IMC that gives much lower memory latency compared to external MC. It was hurting Core2 and all earlier CPUs quite a bit and thus the big caches and advanced prefetchers.


"On the other hand it should also consume more power"

That is questionable as you cannot compare two different CPU designs using different manufacturing principles by using only one parameter. My personal opinion is that Intel will keep to its current TDP numbers and won't increase clock speeds by much. I'd say 3.4GHz quad at 130W on launch should be easily doable. I have no ideas how high could Shanghai clock but I doubt it is much higher than Barcelona, if any.


"Actually, Nehalem's core ratio at 40% is almost the same as Penryn's at 41% which is only slightly less than Merom's at 44%. In contrast, Shanghai's core ratio has dropped to a tiny 25% much smaller than Barcelona's 36%."

Doesn't the NB and other non core/cache parts of the CPU also take quite a lot of power? AMD definitely has much more of the die dedicated for those things.


"However, I would also expect Nehalem's L3 at 5.7mm/MB to be slightly slower than Penryn's L2."

If L3 is truely only slightly slower than L2 in Penryn then that would be extremely fast for L3, especially compared to the awful speed of L3 in Barcelona. I don't think increasing cache size would matter that much, especially considering how the speed changed with Conroe->Penryn. I'm much more worried with what happens to Shanghai L3 performance than with Nehalems.


"Just bumping the clock could be worth another 1%"

I don't think bumping cache speed would give it much benefit as K10 needs buffers between cores and L3 and those will be the main bottleneck when cache isn't running in sync with core.


"Well, EVT or Engineering Validation Testing comes right after first silicon. Then you do Design Validation Testing and then Product Validation Testing."

How many respins do you think there could be during that time?


"Very little. Penryn and Dunnington will be much larger factors along with Centrino 2."

I'd add that Shangai will also matter very little this year.


"You didn't mention the other item in the Digitimes article about AMD's discrete graphics growing from 35% to 50% share this year."

In order to grow from 35% to 50% they would first make up their lost marketshare (<30% atm). Seems like HD3xxx series weren't that big competitors to NV afterall as some certain individuals predicted.


"The problem for nVidia is that Intel makes a lot of motherboards but none of them can do SLI so this leaves a fair segment of the market open for AMD."

I highly doubt that SLI is much of importance in terms of GPUs sold as only a tiny minority is using more than one GPU and you can use one with pretty much any motherboard.


"1.) nVidia releases a much better GPU preventing AMD from closing the gap."

First AMD has to release something that is really good. RV770 should arrive late Q2, I'd say general availiability some time in July. It should get them the ultimate performance crown back for a while, no ideas what happens to the big sellers in low/midrange. By the same time NV should have shrinked G92 to 55nm and given it slight speedbump and probably a bit faster RAM - higher speed in high end and lower prices in mid/low end. Later this year there will be a whole new generation of GPUs coming from NV and I'm quite certain they can deliver at least twice the performance of their then two-years old architecture.

Scientia from AMDZone said...

AR

"Why do you assume the master clock on Shanghai cannot be further clock gated or multiplied independently at each core ?"

The independent clocking on K10 works fine now; I can't see any reason to change it. My point was that if Intel uses a different clocking scheme they might get the ability to do finer clocking control.

"Also asynch clocks create a problem at shared structures such as the L3, meaning a little more latency.. no ?"

From the look of the Nehalem die I would say that each core has its own 2MB's of L3. I'm guessing that accessing the rest of the L3 is slower anyway. I'm thinking that Nehalem's L3 cache scheme is actually 2MB's local + 6MB's extended rather than 8MB's shared equally to all. In other words, I would say that each core has faster access to its own 2MB's than to the rest. This may not be the case of course; the L3 could indeed be contiguous 8MB's.

"You also seem to ignore that Nehalem is SMT while Shanghai is not. That might explain the die area differences."

That is irrelevant. Since the L2 density is similar we can roughly assume that logic density is about the same. Therefore transistors are about equal to area. More transistors devoted to core logic should make it faster regardless of the specific method.

"I wonder if Shanghai is still 3 wide or its been bumped up to 4 way superscalar."

It is still 3. It still uses 3 decoders.

Anonymous said...

I think the SMT is an important comment. A large portion of the SMT dedicated transistors will are only going to be used in situations of 5 to 8 threads. If there are 4 or fewer demanding threads, a sane OS will do the equivalent of giving each thread it's own core.

However, all of each core's transistors in shanghai are usable at a lower number of threads.

While I'm sure there are going to be a number of loads where Intel's 8 threads will easily outperform AMD's 4, I would not be too terribly surprised to see AMD equal or perhaps even surpass Intel in the 1 to 4 thread range.

enumae said...
This comment has been removed by a blog administrator.
AR said...

Hmm.. 3 wide vs 4 wide. There's another reason for area difference. There is no offhand way I can think of to estimate the difference (replicated pipe stages and slightly larger static structures like the reorder buffer) but I'm sure it contributes something.

SMT comes at an area cost, especially for a full fledged one where both threads are simultaneously active and being serviced (I dont recall reading anywhere that Nehalem is switch on event SMT, probably its prescott style).

About the L3, a variable latency cache would be very contra to the Core 2 design (shared L2 on the merom). Also its unwise to deduce cache association from layouts. Those could easily be cache banks. For example the Niagara II has a very similar banked cache that kind of looks split on the layout. The banks could sit behind a common entry point and unless cores are carefully optimizing accesses to banks (would require a clever compiler and cpuid exporting info) on the average there is no difference in latency across cores.

Pop Catalin Sever said...

"Both are quad core, both use L3 cache, and both use a point to point interface (QuickPath for Intel and HyperTransport for AMD)."

Except that Nehalem will be faster, once again.

muziqaz said...

128kb L2 cache per core? :) looks like Intel is shooting for extremely fast cache subsystem, or maybe Intel ran out of real estate on die :D
look at the size of those cores :)

It would be interesting to see die shot of pentium with HT and without :)
And the layout of the die for nehalem looks interesting ;)

Mark S. said...

pop cat - please give us a break. Everyone here knows that the "Nehalxeoperton" copycat chip's QP serial links are supposedly faster than the industry standard HT3. However, for the past five years Intel has been denying that any advances in basic architecture are needed. 64-bit extensions for x86? Not needed (until convenient for Intel). Native dual-core? Not needed (until convenient for Intel). Native quad-core? Not needed (until convenient for Intel). On-board memory controller? Not needed (until convenient for Intel). Direct chip-to-chip interconnects? Not needed (until convenient for Intel).

So don't get all cocky that Intel will have (Real Soon Now) the same features that AMD has been championing since 2003. You should be glad that AMD has been leading the way and keeping Intel competitive, otherwise you would still be running 32-bit x86 code on a s-l-o-w FSB architecture. Or Itanium if you really really had to have 64-bit arch.

Cheers!
Mark S.

Aguia said...

I think you are all missing a bigger point than the way the L3 cache is arranged, the IMC performance boost that Intel could have thanks to a better implementation or better use of prefetcher.
Intel Core 2 Duo CPU prefetcher is already very efficient without integrated IMC that I wonder what will happen with the Nehalem.
Besides if it is a triple channel system instead of dual, the performance might scale very well.

I just think its strange small 256KB L2 cache, but then a bigger L2 like 1MB for example would shrink the L3 to such a point than an L3 cache system would be useless.

I already said here that AMD would have done better if it used a smaller L2 since its exclusive and give the L3 some 3MB size or more. Core 2 Duo for example with 3MB L2, 4MB or 6MB doesn’t make a big difference, but the 2MB version however shows a “large” performance hit in some applications.
We already have seen with the K8 that the CPU only looses efficiency with less than 256kb l2.
AMD L2 size performance

amdman said...

Most code on average run 1.5 instructions a cycle, and are stalled by loads.
Core2 has 4 instruction units, but can only execute one load a cycle, and those loads are 64 bit.
K10 has 3 instruction units and support 2 loads a cycle, and those loads are 128 bits.
On integer code its a wash, with AMD behind in clock speed.
On SSE codes K10 at 2.2GHz are dominating against 3GHz Core2, because the Core2 is so data starved.
WIth higher clocks all the gamers will abandon Core2 for K10.

Scientia from AMDZone said...

ho ho

The prefetchers and large L2 on Penryn already hide a lot of that latency so I wouldn't expect Nehalem's IMC to give much benefit. The only real benefit I can see is with total memory bandwidth. It is pointless to speculate about Shanghai's speeds. This would be anything from slower than 65nm to faster. The amount AMD has mentioned for NB is 15 watts.

spam

It depends on the SMT implementation. A P4 style SMT requires very little extra die area. On the other hand, a true Alpha style SMT requires a complete overhaul of the pipeline including additional decoders (4 would not be enough). But Alpha was capable of 4 fine grained threads per core whereas Nehalem can only do 2.

ar

My assumption wasn't based on layout. My reasoning was based on the fact that Intel greatly reduced the L2 on Nehalem. I figured that they might then have wanted a faster L3 to supplement the smaller L2. But, I figured that they couldn't do that with a full 8MB's of L3. Again, I just mentioned it as a possibility.

mark

The base speed for Intel's QP links is higher. However ... you have to include the encoded checksum information that ensures data integrity. This reduces the base speed of QP.

aguia

As I've already mentioned to ho ho, the good prefetching in Penryn means that we really won't see much imrovement with latency on Nehalem.

amdman

You need to recall that decoding time is a big factor on decoder bandwidth. AMD could increase decoder bandwidth by increasing the number of decoders as Intel did but they also increase decoder bandwidth when they reduce the decoding time on instructions. The decoding time for most SSE instructions was cut in half with K10 versus K8. However, since there was no change on any Integer instructions that would be my guess for Shanghai. I expect we will see a reduction in decoding time in some Integer instructions.

Aguia said...

Scientia,

I can’t recall very well the past Intel CPUs die sizes, specially the volume ones.
With Intel going after such a big die compared to its current ones (including the double ones) do you think Intel might have an issue manufacturing it since it haven’t handle big dies in the past?

Aguia said...

Intel CPU Die Size and Manufacturing Cost

Well I found this but unfortunately I can’t understand why they say manufacturing cost if there is no values on it.

Unknown said...

Scientia:

Sorry for my ignorance, but why does AMD still remains on a 3-issue uArch?

Is it easy to add a fourth decoder or does it requires a complete overhaul of the entire core?

Scientia from AMDZone said...

aguia

You can look at some die sizes at chip architect.

Also, Intel had:

Williamette - 217mm2
Smithfield - 206mm2
Tulsa (16MB L3) - 435mm2

erlindo

To add a 4th decoder you have to widen the pipeline from the Fetch stage down to the Schedulers. Since Stage 1 (Fetch 1) and Stage 2 (Fetch 2) were widened with K10 we could skip those. However, Stage 3 (Pick) only selects 3 instructions and the pipeline is 3 wide down from there. After the final decode stage, however, you only have 3 Integer Schedulers, each occupying one lane.

Intel gets around this by using a global Reservation Station. But, AMD would only need to tweak the Instruction Control Unit a little to allow it to feed the 4 lanes to 3 schedulers. This would be similar in operation to Penryn since it too only has 3 Integer Execution Units to handle 4 lanes.

Mark S. said...

Hmm. I just noticed that there are only two QP links on a quad-core Nehalxeopteron. What kind of multiprocessor geometry is that supposed to enable? For a four-processor configuration, I can think of two options.

The first is a straight line, with processor 0 having to traverse processors 1 and 2 to get to processor 3. Like this:

0-1-2-3

The second would be a circle, where processor 0 would wrap back to processor 3, so no processor would be more than 1 hop away in a four-processor configuration. Like so:

0-1
| |
2-3


Or is Intel going to have variable number of QP links, based on single, double, or multi CPU ability?

And, since I see that Shanghai has four HT links instead of the current three, will AMD be creating a new socket to take advantage of all four? Or will they use one as a production 'spare', hoping to get at least three usable links out of four laid out on the chip?

And will HT3 be enabled on Shanghai, or will it be degraded to HT1 like Barcelona?

Cheers!
Mark S.

Aguia said...

Mark S.,

I think each AMD HT3 link can be split in two so there are actually eight HT links not four but at 8 bit instead of 16 bit.
Maybe Intel can be split in two too so they are actually four.

enumae said...

Mark S,
What kind of multiprocessor geometry is that supposed to enable?


Gainestown

For a four-processor configuration, I can think of two options.

Beckton

Mark S. said...

aguia and enumae - thanks for the feedback! It looks like Intel may be planning for some interesting confirguration possibilities with QP.

I gather that the referenced photo is a shot of a Gainestown version of Nehalem with two QP links. Any timeframe for the Beckton Nehalem-EX with four QP links?

Any scoop on a new AMD socket for Shanghai? I think the fourth HT link will come in pretty handy, especially if it is HT3 enabled.

Mark S.

Scientia from AMDZone said...

Beckton, the 4-way or MP version of Nehalem is scheduled for Q3 2009 at the earliest.

As far as I know, Intel still hasn't released all the specs on QP. Most of it was extrapolated from PCIe 3.0.

Randy Allen said...

Scientia I was curious as to your opinion on this:

http://amdzone.com/phpbb3/viewtopic.php?f=52&t=135278

I simply fail to see how having less L1 and L2 cache in Nehalem is going to be a problem.

Scientia from AMDZone said...

randy

Actually I already mentioned this. I said that I thought that the L3 on Nehalem was actually 2 + 6 rather than 8. I think that each core has 2MB's of local L3 and another 6MB's of accessible but slower L3. Whether this is true or not it is difficult to imagine that Intel would suddenly make a design with flawed cache structure after all of its years of experience with the P4 cache design. My assumption is that the Nehalem cache is workable.

amdman said...

Nehalem is an updated P4.
Intel has said that the hyperthreading is not improved from the P4, but it does have 4 integer units now so Intel is going to call it a Core chip. :)
Pretty funny seeing as Core1 was just a P3 with a huge cache to make up for its short comings.

muziqaz said...

but what a chip it was and still is :)
shit in server loads, but for single threaded and simplethreaded code it is brilliant :)
I am waiting to see how they will manage to take use of that 4-way thingy. If it is done right, I think it will be huge step up. But you never know :) perhaps all that talk from Intel about big jump in perf is because the HThreading?

Pop Catalin Sever said...

http://www.fabtech.org/content/view/6316/
http://money.cnn.com/2008/04/07/technology/amd.fortune/?postversion=2008040807

It's time for H.R. to step down or he must be asked to do so. He brought nothing good to AMD so far, only debt and troubles. Even if something good eventually comes out of his reign, it's too late, the damage has been done.

Anonymous said...

People ^^ don't really understand what Ruiz has done for AMD. All they see are AMD's latest struggles, and they then stop looking any further.

Ruiz is the man that gave AMD it's OEM deals. Before Ruiz, there were no AMD boxes from IBM, Sun, HP, or Dell. There was, effectively, no such thing.

Now, you have people spewing shit about how Hector ruined AMD. Ridiculous. He turned AMD from a tiny little low-end competitor to a full-fledged competitor with Intel across all the major OEMs in all the major market segments.

One ^^ should get a clue before one starts trying to demand that someone be fired and lose their job.

Unknown said...

As I read this "comparison" I started thinking of all sorts of jokes like "you know, the Mitsubishi Lancer sedan is really similar to the Lancer Evolution. They both have 4 doors, both have 4 tires, both have similar curb weight, length, width... They both come with the Mitsubishi logo..."

You know Scientia, it's utterly ridiculous to compare things at a macro-level like this. I get the feeling that there's not much to write about lately (and I know I've been just itching to write about Nehalem too) and you've given in to the itch to write about what's going to be interesting instead of what is.

Scientia from AMDZone said...

If nothing else the macro similarities should make the benchmarks a bit closer. Less difference in cache size for example. Obviously there would still be differences with SSE4 code and some difference in Shuffle and division.

Unknown said...

it is indeed terrible to see amd will might screw up again. the problem isnt that the phenom isn't good enough it just has a few flaws as i call them. the major one is clock speed. the other points which amd should improve on are increasing there memory channels. may need to be 2x 128bit channel or 4 times 64-bit to get competative with nehalem. also a boost of l2 cache speed or increase too 1mb could help.
i say boosting its l1 and l2 cache and l3 cache speeds could help alot in little time and would be most cost efficient. but keep in mind amd can still boost its hypertransport to utilize the full 2600mhz or 5200mts. yet they should spend time on making a counter for the nehalem in time. the shanghai is a great product but it just comes a few months to late. it would provide great competition against yorkfield and penryn but not against nehalem.

Scientia from AMDZone said...

kribblin

"the other points which amd should improve on are increasing there memory channels. may need to be 2x 128bit channel or 4 times 64-bit to get competative with nehalem."

No. With the faster DDR3 you can get plenty of bandwidth over just two channels up to 6 cores. You don't need more than this unless you go to 8 cores or you go over 3.2Ghz with 6 cores. We'll start needing 3 memory channesl in late 2009.

" also a boost of l2 cache speed or increase too 1mb could help."

You are off a bit. K10's L2 speed is fine; it is the L3 that is slow. There have been plans to increase L2 to 1MB in 2009. And, the L3 clock does need to be faster.

"i say boosting its l1 and l2 cache and l3 cache speeds could help alot in little time and would be most cost efficient."

Boosting L1 or L2 speed is not possible. L3 will probably get a little faster with Shanghai.

"but keep in mind amd can still boost its hypertransport to utilize the full 2600mhz or 5200mts."

They will but this will be most important for 4 sockets and above.

"yet they should spend time on making a counter for the nehalem in time. the shanghai is a great product but it just comes a few months to late. it would provide great competition against yorkfield and penryn but not against nehalem."

I assume you are talking about desktops. In terms of 4 sockets the Nehalem version won't be out until late 2009.

Unknown said...

i agree with you, i did some research since the time i posted my last response and actualy the design isnt bad at all compared to the nehalem design. it might even end up faster in pricing range. AMD might even get a higher clock potential due to its small cores design. not to forget the phenom 9950 isnt bad at all. its a better multi tasker then intels q6600 and also good overclocker with an sb750 board. amd shows its streangh here in multi tasking. i also read nehalem wont be much faster in games either, fudzilla who is usualy quite accurate claims it might only be a small 5-15% margin. means the 45nm shanghai might even end up faster if you manage to clock its cores higher because most games dont take advantage of 4 cores. and that might even be the main reason why people see core2quads with x86 windows end up faster in benchmarks usualy. but internal connections of the amd design are faster then those of the core 2 quad.