Monday, September 10, 2007

AMD's K10 – A Good Start

K10 can be summed up pretty simply based on the benchmarks in reviews that popped up like mushrooms just after midnight Monday, September 10th as AMD's NDA expired. A 2.0Ghz Barcelona seems to be equal to a 2.33Ghz Clovertown. And, since Penryn is only showing a 5% increase in speed, this should make a 2.0Ghz Barcelona equal to a 2.2Ghz Penryn. Since AMD has promised 2.5Ghz in Q4 this should allow AMD to match the 2.83Ghz Penryn. This means that in Q4, Intel will still have two clock speeds, 3.0 and 3.16Ghz which are faster. In Q1 08, with 3.0Ghz, AMD should be able to match up to 3.33Ghz.

I've already seen analogies drawn with 2003 when AMD first launched K8. There are some similarities but some important differences as well. One difference is that K10 is slower at launch than Opteron was in 2003. Given the same speed ratios, Barcelona would need to be at 2.2Ghz. So, AMD's K10 launch is a bit worse than Opteron in 2003. However, nothing else is really the same. When K8 launched there was exactly one chipset (AMD's) that supported it and compatible motherboards only arrived very slowly. Today, there are several chipsets and dozens of boards that can handle K10 with nothing more than a BIOS update. Another difference though is process yield. K8 launched with a painfully low yielding 130nm SOI process (only about half the yield of AMD's K7 bulk 130nm process) and this new process took a year to reach yield maturity. Today, in spite of the low clocks, AMD's 65nm process for K10 has excellent yields. Intel's yields, in contrast, on its brand new 45nm process will take a couple of quarters to reach maturity. This plus Intel's much slower ramp gives AMD some breathing room until Intel catches up in Q2 08. Some people have overestimated Intel's volume and clock ramp because they were spoiled by the launch of C2D back in 2006. These two are not comparable however as Intel's 65nm process then already had six months of maturity from production of Presler in Q4 05. On the other hand, there is no reason to be pessimistic about Intel since Penryn is doing quite well and a long way from the painful start of Prescott (Pentium D) in 2004. AMD should be able to reach good volume and a 3.0Ghz clock speed by Q1 08 and if we use the same six month lead then Intel should be looking good with 45nm in Q2 08.

My view of processors was pretty good from 2003 up to the release of Yonah in early 2006. However, I can readily admit that I didn't see C2D coming. It was quite surprise when C2D was considerably faster than I had expected. The normal design cycle for AMD has been pretty clear since K5. Specifically, AMD had a small team doing upgrades on 486 and a major team that had designed the 29000 RISC processor. AMD dismantled the 29000 team and formed a new design team which created K5. As 5K was finished and the team free again a new major team started work on K7. K6 was never a major design team at AMD since it was designed by NexGen however it is clear that another small upgrade team was formed to see K6 through introduction and the upgraded versions K6-2 and K6-III. Having finished K8 the major team began K8 while the upgrade team created K7 upgrades like Athlon MP and Barton. Once K8 was finished, a major team began work on K10 while an upgrade team saw K8 through X2 and Revision F with its new memory controller, virtualization, and RAS features. K10 has been essentially finished since mid 2006 so the major team would be working on Bulldozer while an upgrade team handles both the launch and the later Shanghai version. Obviously, these teams aren't fixed but can be stripped or dismantled and then added to or nearly newly created from members of previous teams. The teams do change but the size and amount of work they do stays about the same. The only real change has been fairly recent. AMD has had a secondary team working on Turion. AMD also acquired some new design personnel with the Geode architecture. It looks like staff from these two groups have formed a major design team to work on Bobcat while a second upgrade team worked on Griffin upgrade of Turion. I would have to assume that these teams are smaller than the K10/Bulldozer teams since Griffin has only a few changes from K8 and the Bobcat architecture is simpler. Nevertheless this does give a stronger combined focus to the formerly separate Geode and Turion lines. Apparently, Intel has a similar secondary group working on Silverthorne.

Intel's design teams used to be similar. Intel's major team created Pentium Pro, Williamette, and then Prescott. The upgrade team had been working on Pentium and it moved on to the PII and PIII upgrades of Pentium Pro and then the Northwood upgrade of Williamette. Intel had another team working on Itanium and a small team in Israel working on Banias. The major team was supposed to be working on Nehalem after Prescott while the upgrade team would move to Tejas as the upgrade of Prescott. Then we started hearing about a new Indian team working on Whitefield. As far as I know the Whitefield team was disbanded and very little if any of their work was used in other designs. From what I can gather, Whitefield would have been a quad core version of Sossaman with an Integrated Memory Controller and Intel's CSI point to point interface. The talk had been that Conroe was an upgrade of Yonah so the speed was quite a surprise. Once it was released though it was clear that Conroe was not an upgrade at all but a completely new design. Yet, many months later I still saw web authors calling Conroe a derivative of Yonah. Others too were still under the mistaken notion that Intel only started working on Conroe when Whitefield was canceled. However, neither of these ideas is correct.

There is no way to know exactly what went on at Intel but given the fact that we know roughly what Intel knew we can make some reasonable guesses. It must have been clear to Intel in late 2002 that Prescott had serious problems. It is also clear that ES Banias samples were available at this same time. It therefore seems reasonable that Intel decided in 2002 to work on a Prescott successor based on Banias. The Whitefield team was apparently created as both a hedge and an attempt to get CSI out the door more quickly. So, it appears to be the case that the Banias team became a small upgrade team which then worked on Dothan and Yonah. The original Tejas team seems to have produced Smithfield and the 65nm shrink, Presler, plus the Tulsa upgrade. Taken in this perspective we can see that neither of these were major teams. This means that Intel had to have shifted away from the original P4 Nehalem design and on to Core 2 Duo back in late 2002. This would make three years to mid 2005 and just enough time to finish a new core for the 2006 release. I realize now that information about Core 2 Duo was both stifled and confused. It was stifled because Intel chose not to patent the architectural elements of Banias that went into C2D. And, it was confused because work on C2D got constantly mixed up with the work on Tejas and the original P4 based Nehalem and the work on Whitefield. However, there was also the problem of the FSB which was expected to limit performance. This notion was strongly bolstered by the Whitefield design which was to have an IMC. However, in a brilliant but unexpected move, Intel managed to overcome the crippling latency of the FSB in completely different fashion by increasing the L2 cache size and substantially beefing up the prefetch hardware. This along with the widened cache buses and SSE units is what made Conroe so fast.

Hopefully, I've now got enough understanding of how we got here to see where we are going. However, I have to mention that when others try to analyze the current situation I've seen a strange element coming into the evaluation. This is the idea of fairness. Many people suggest that it was fair that Intel got back into the lead because supposedly AMD was complacent. This notion however is incorrect. Intel got into the lead because they made the decision to change course way back in 2002 when it became apparent that Prescott wasn't going to work. Also, AMD was not complacent. As I've described their design teams I can't see where they could have gotten more staff to do more interim work. I'm sure that AMD felt that their dual core design was enough to tide them over until K10 was released and I'm certain they didn't foresee Intel's use of MCM to create a quad core. By the time AMD knew about Kentsfield it would have been pointless to work on an MCM design of their own. Also, I can't really fault their 65nm ramp even though it was a year behind Intel's. AMD ran 90nm tests at FAB 36 when it became operational in Q1 06. AMD ran 65nm tests in Q2 and then began actual 65nm production in Q3 which arrived for launch in Q4. I can't see how this would have happened sooner unless AMD had gotten FAB 36 operational sooner. Fairness has nothing to do with what happened. AMD did the best they could and Intel got the benefit of its hard work from the previous three years. It's really as simple as that.

But, these arguments persist. I find people insisting that K10 can't be a good design because it is late or that AMD can't possibly ramp clocks, volume, or yield faster than Intel because Intel has been in the lead. At their heart, these are all fairness arguments with no basis in reasoning. The quality of a design is not related to how late it is nor is the speed of a ramp related to how fast the previous generation of processor was. It has been argued that AMD would go bankrupt after Penryn's launch because AMD would not be able to handle Intel's big costs savings at 45nm. The problem is that it is precisely because AMD is using the older, more mature 65nm process that it will indeed be able to ramp yield, volume, and clocks for K10 faster than Intel can for Penryn on its new 45nm process. Intel will improve its 45nm process and this should pay off by Q2. The process will be mature when Nehalem launches in Q4 08. However, Nehalem loses both the die size advantage from MCM and has to compete with Shanghai as 45nm versus 45nm. It is also clear that AMD will be able to convert to 45nm in about half the time that it takes Intel. Intel may then enjoy a lead in Q1 09 due to its faster memory bandwidth and speed from SSE4 instructions.

However, the situation changes drastically in Q2 09 with Bulldozer. Bulldozer will not only have more memory bandwidth than Nehalem but it will be able to use ordinary ECC DIMMs in the same way as registered memory. This should give AMD a tremendous boost since it would allow Bulldozer to use the faster desktop chips in servers instead of waiting for the lagging registered memory. These desktop chips will be faster and cheaper. It is also clear that AMD intends to move agressively and bypass JEDEC which has up to now been heavily influenced by Intel. Bypassing JEDEC means that instead of watiting for memory to become an official standard, AMD will certify the memory when it becomes available from a manufacturer such as Micron directly whether it becomes an official standard or not. Had AMD been able to do this earlier there would have been less reason to work on the memory controller change since DDR was available in unnofficial speeds up to 600 Mhz which would easily have matched the current DDR2-666 speeds. DDR2 would only have been needed with quad core.

However, the biggest change with Bulldozer is without doubt, SSE5. I've seen some people trying to compare SSE5 with SSE4 but these two are not even remotely the same. SSE4 is another SSE update similar to SSE2 or SSE3. In contrast, SSE5 is an actual extension to the x86-64 ISA. And, either by design or by accident AMD has introduced SSE5 at the worst possible time for Intel. The SSE5 announcement is too close to the release of Nehalem for Intel to extend the architecture to take advantage of it. This puts Intel in a very difficult position. Adding SSE5 is such a big improvement that it greatly weakens Itanium. However, not adding SSE5 means giving substantial ground up to AMD. Since Intel has given every indication that it will beef up its x86 line when forced to by AMD we'll have to assume that it will add SSE5. Intel however faces the second problem that by the time Bulldozer is released it will already be supported. Intel could theoretically announce its own standard in the next six months however similar pressure as occurred with AMD64 is likely to prevent Intel from going its own way. It is a clear sign that AMD is serious about SSE5 since it has dropped its own 3DNow! Instructions with Bulldozer.

On the other hand, support for SSE5 will be difficult for Intel since it would have to be added to the 32nm shrink of Nehalem without any major changes in architecture. This basically means that Intel will have to substantially modify its predecoders and decoders to support the extended DREX byte decoding. This is an easier task for AMD since all of its decoders are complex whereas only one of Intel only uses one complex decoder. This could mean a major overhaul of Nehalem's decoders. It also means that Intel gets to watch the value of its macro-ops fusion get substantially weakened by the new SSE5 vector conditional move. This also tends to reduce some of C2D's features since 3-way instructions reduce loads. Finally, Intel has to figure out how to do 3-way (two source + one destination) operations without adding extra register move micro-ops. It can be done but it means tweaking the control logic way down in the core. I think Intel can do it but it wouldn't be easy. This could also mean modifying the current micro-ops structure to have room for the extra destination bits but a change like this most likely could not be done with just the 32nm die shrink of Nehalem. The bottom line is that Intel will not have a clear lead over AMD on into 2009 as some have suggested. Intel has some bright spots such as Q2 08 when it gets 45nm on track and later in Q1 09 when Nehalem will have memory bandwidth advantages. However, AMD has its own bright spots such as Q1 08 when K10 gets good volume and speed on the desktop, Q3 08 with the Shanghai release, and Q2 09 with Bulldozer and SSE5.

125 comments:

Scientia from AMDZone said...

I saw that there were a lot of incorrect comments in the last thread but I made correcting comments before posting here. So, this discussion should be able to start with the right information.

It appears from the review at Anandtech that K10 is 17% faster than Clovertown at the same clock and 11% faster than Penryn.

Anandtech also has 2.5Ghz chips in hand now and confirms that they will be available in Q4.

Axel said...

Scientia

A 2.0Ghz Barcelona seems to be equal to a 2.33Ghz Clovertown.

Only in the server space, and only in certain memory-intensive benchmarks.

Arstechnica's conclusions are dead on.

Giant said...

The TechReport review is more thorough. It shows that Barcelona is about on par with Clovertown at the same frequency.

All in all, this is most unimpressive. AMD's new architecture is not as fast Intel's existing architecture. Even with a MCM vs. monolithic and FSB vs. HyperTransport.

Giant said...

Scientia failed to mention it, but Harpertown should scale better than Clovertown thanks to the 1600mhz FSB on the high end versions, the faster 800mhz FB-DIMM memory, the snoop filter on the northbridge and the faster 24-way associative L2 that is now 6MB per dual core die (50% larger ).

Both the HKEPC and Anandtech results were on a single dual core CPU for desktop where the FSB was never much of an issue. Lots of changes in Penryn and the accompanying Stoakley chipset for DP servers seem to be focused on improving the scaling across the eight processing cores. This will also certainly help Intel get the most out of the future 45nm clockspeed increases.

It's not CSI, but it will do until H2'08 when CSI and Nehalem arrive.

Giant said...

BARCELONA OVERVIEW BASED ON CURRENT INFORMATION AND REVIEWS:

http://www.tgdaily.com/content/view/33770/118/

Greg said...

Actually, Giant, the techreport's review shows that, depending on the application, barcelona can be assumed to average out just barely above par with clovertown at the same frequency, while being far far more efficient in power usage, which is a problem for clovertown considering the close pricing of these two chips.

While the power savings over time may not be enough to warrant worry, the density of the heat-generating surfaces in servers is a huge issue, because removal of heat becomes exponentially more complicated when you try to make a high density server (what a large portion 2x and 4x servers are used for). Thus, costs rise considerably due to rack design and cooling design costs, as well as maintenance costs and the likelihood that down-time will be extensive if it's hardware related.

Companies like Dell and HP will love this proc, because they can stick it in whatever the heck they want and barely have to spend anytime engineering airflow designs because these chips will hardly need any, and this will greatly increase these companies margins on servers and decrease pricing pressure they place on AMD.

Techreport also noted that AMD's lead will likely increase with increased clock speeds, being that AMD also increases their interconnect speeds and that their cache bandwidth and memory controller speeds are highly dependent on the core speed.

Try reading the article next time (along with multiple reviews, regardless of their bias) to get a full analysis.

Also, that tgdaily article is horribly written, and doesn't seem to provide very well thought out analysis (probably rushed, like many articles).

Christian M. Howell said...

Well, from what I've seen of the benchmarks, two things are abundantly clear, 1) K10 scales like crazy with clockspeed, number of sockets and added threads and 2) K10 utterly blows away the current Opterons at the same power.

I don't put as much faith inexact numbers because if you look at the same CPUs from different vendors or different models, the numbers vary wildly.

In order to really test Phenom, you either need to put a 23xx into a QFX board or with an actual final board.

The 2GHz vs 2.5GHz comparisons showed almost perfect scaling with clock.

Also, most people note that K10 will really open up above 2.5GHz as the North Bridge runs faster and can clock higher with HT3.

I would say the desktop will be interesting when the RD790 boards are released at the end of the month.

Scientia from AMDZone said...

Axel

"Only in the server space, and only in certain memory-intensive benchmarks."

No, in general.

"Arstechnica's conclusions are dead on."

Not really. I looked at the Tech Reports review. Maybe we'll have better reviews in Q4 and Q1.

InTheKnow said...

Scientia said...
Intel's yields, in contrast, on its brand new 45nm process will take a couple of quarters to reach maturity.

That is pure speculation on your part. Unless you can back this up with data, it leaves a big hole in your analysis. Generally, you seem to try and base your conclusions on some sort of data, so I'm surprised to see you pull this one out of thin air.

Some people have overestimated Intel's volume and clock ramp because they were spoiled by the launch of C2D back in 2006. These two are not comparable however as Intel's 65nm process then already had six months of maturity from production of Presler in Q4 05.

On the other hand Penryn has very few architectural changes allowing Intel to concentrate on the process change. that is the whole point of the "tick - tock" strategy is to only change one thing at a time, process or architecture.

The Penryn to Nehalem ramp next year is what should be compared to the Presler to C2D ramp. This ramp should be comparable to the ramp from 90nm to 65nm. I don't recall that being too rough.

InTheKnow said...

greg said...
Actually, Giant, the techreport's review shows that, depending on the application, barcelona can be assumed to average out just barely above par with clovertown at the same frequency, while being far far more efficient in power usage, which is a problem for clovertown considering the close pricing of these two chips.

Now drop the power usage by 10% and bump performance by 5% for Penryn due out in about 8 weeks and where does that put us?

Scientia from AMDZone said...

Giant

"Scientia failed to mention it, but Harpertown should scale better than Clovertown thanks to the 1600mhz FSB on the high end versions, the faster 800mhz FB-DIMM memory, the snoop filter on the northbridge and the faster 24-way associative L2 that is now 6MB per dual core die (50% larger )."

As far as I know none of the Q4 chips support 1600Mhz but presumably these will show up later. And, you are correct, the extra FSB speed does no good without faster FBDIMMs. The snoop filter though doesn't give any additional benefit since Intel already uses one with the current dual FSB chipset.

Now, do you also want to mention that Tigerton is nothing but a rebadged Clovertown? Has Intel made any statement that this would be changed to the Penryn core?

"This will also certainly help Intel get the most out of the future 45nm clockspeed increases."

Are we talking desktop or servers now? On the desktop, Penryn seems to be only 5% faster. But, I suppose the faster FSB speed might help for clocks of 3.33Ghz and higher.

"It's not CSI, but it will do until H2'08 when CSI and Nehalem arrive. "

Well, Shanghai is likely to arrive in Q3 but Nehalem not until Q4. Clearly though Nehalem will have the advantage with its greater memory bandwidth. I'm still thinking this will be based on FBDIMM. Has anyone heard anything to suggest something else?

"http://www.tgdaily.com/content/view/33770/118"

Indeed, the sour grapes at Anandtech are pretty thick. I noticed that the whiner never bothered to note that AMD included 2.5Ghz chips that won't be officially released until Q4. That seems like plenty of lead time to me. Also, the shrill noise about not having time to test is silly since Anandtech can take just as long as they like and publish a real review.

Scientia from AMDZone said...

InTheKnow

"That is pure speculation on your part... Generally, you seem to try and base your conclusions on some sort of data"

Okay, AMD has provided relative charts for its processes since 130nm so I'm confident about how they'll do based on what they've done. Intel on the other hand hasn't provided this same information but did admit to low yields when they started 65nm. I'm assuming that if 45nm were better then Intel would have mentioned this. So, my assumption is that it is about the same as 65nm. Again, they'll have it running well by Q2; it shouldn't take any longer than that.

"On the other hand Penryn has very few architectural changes allowing Intel to concentrate on the process change. that is the whole point of the "tick - tock" strategy is to only change one thing at a time, process or architecture."

Well, this isn't quite true. First of all the process change is independent of the architecture; either one can be good or bad. Here's the deal. Secondly, the changes to division and shuffle were substantial but there is no indication of a problem.

"The Penryn to Nehalem ramp next year is what should be compared to the Presler to C2D ramp."

No. These aren't the same. The 45nm process will be fully mature before Nehalem is launched. In contrast, the process was good but not fully mature at the time of Woodcrest's launch.

" This ramp should be comparable to the ramp from 90nm to 65nm. I don't recall that being too rough."

As I mentioned above; Intel admitted low intitial yields at 65nm.

Scientia from AMDZone said...

InTheKnow

"Now drop the power usage by 10% and bump performance by 5% for Penryn due out in about 8 weeks and where does that put us?"

Well, you need to remember two things. First, AMD gets a much bigger power drop than 10% as it moves to the next steppings. Secondly, K10 gets a much bigger boost than 5% as it moves to regular memory.

Mo said...

Sci, how do you know what the next stepping on Barcelona is going be like? Have you seen it?

Barcelona fell short of AMD has been saying all along.
What are your feelings on that?

Word is that it's only good for server market, what do you think it will do on desktop.


and lastly, for Christ's sake, stop deleting posts. Someone posted earlier asking you what i'm asking you now.... You felt the need to delete their post.

xwraith said...

I've been reading through forums & reviews all evening. Two points:

1) Johan @ Anandtech chose to use an Intel compiled version of Linpack in his benchmarking. I wish somebody out there had done with GCC, VS, and the Portland compiler. Now I've read all the complaints about lack of time to do benchmarks, and I know this stuff takes time. Johan says that AMD has been able to run Intel optimized code well before, but to me its too... how should I say it... its not a neutral enough platform. From what I've seen of Intel's compiler over the years its there to get the best performance out of their processor, not AMDs. I think the chip designs have diverged enough since the K7/P3/P4 "MHz rules" days that running one chips optimized code is more likely going to give you a worse case scenario, rather then a balanced one.

2) I hate picking on Anandtech again (and I realize this may just be to having to generate an article so quickly) Johan writes:
"The people of zVisuel told us that - in reality - the current Core architecture can sustain six FP operations in well optimized loops." This is an interesting piece of information about Core, but there is no corresponding information about Barcelona. In fact with a large image breaking up the flow of the text one is left with the distinct impression that the Core architecture is superior, because Johan is silent about AMD's capabilities. Can you run similar loops with Barcelona? Is Barcelona better or worse, or even just the same?

Anyway just some thoughts about all the information buzzing around today.

Scientia from AMDZone said...

sprender56

"Sci, you take too much at face value from AMD."

I don't know how to answer this. Perhaps if you give a list telling where information from AMD has been consistently inaccurate and some other source has been more accurate I can change sources.

"Now you are claiming 3.0Ghz in Q1 just because AMD showed a sample running that?"

Maybe you haven't understood what I said before. Basically, if AMD is showing a cherry picked chip now it should be available in six months which would be Q1 08. We saw the same pattern with Intel in 2006. I've analyzed ramping at both Intel and AMD and six months seems pretty consistent; this value comes up over and over again.

"You are claiming 45nm to be mid-year because AMD said so?"

No. Because mid year seems consistent with both AMD's announcements and what AMD is doing. The current milestone however is tapeout on Shanghai. If AMD doesn't announce this soon I'm going to have doubts about Q3.

"Didn't AMD also say Barcelona was on track but later changed that to 6 month delay"

No. Barcelona is 3 months late, not six. The 6 months refers to clock speed which AMD had planned to be 2.3Ghz at release. Now, since AMD is going to be at 2.5Ghz in Q4 which matches the original schedule I suppose you could claim that they will have caught up in Q4. However, there are a couple of caveats. For example, AMD originally planned to be at 2.9Ghz on dual core in Q4 and there is no indication yet what will actually be released. Secondly, Intel bumped its clock to 3.16Ghz so I guess you could reasonably say that AMD should be at 2.66Ghz in Q4. So, all in all, still slightly behind in Q4. On the other hand, AMD should be squarely ahead of schedule in Q1 08.

"barely stays up with 2.33Ghz and on only on memory intensive heave fp stuff. It's not even a across the board overall victory."

We'll see. It will take more tests to actually say this for certain.

"this sort of power is only good in HPCs"

I'm not sure what power you are referring to. HPC tends to be both connection intensive and FP intensive.

"What about desktop performance for users liek you and I? "

Unfortunately, we'll need a system with non-registered memory to tell. Maybe in Q4.

"trying to hide your disappointment."

Disappointment occured when AMD scaled the launch clocks back from 2.3 to 2.0Ghz. At the moment, though, things are looking better because it looks like AMD will still exceed their Q1 projections.

"everything you have said is crumbling like a dried piece of bread."

Maybe you could be more specific? All I've got from your comments so far is that you are sceptical of AMD.

Scientia from AMDZone said...

Mo

"Sci, how do you know what the next stepping on Barcelona is going be like?"

There are reasonable projections. For example, I wouldn't expect Intel's 3.16Ghz Penryn to draw 150 watts even though I haven't seen it.

"Barcelona fell short of AMD has been saying all along.
What are your feelings on that?"


I don't see that yet. I suppose it could be proven eventually if the tests all tend to show that. But, not at this point.

"Word is that it's only good for server market, what do you think it will do on desktop."

I think it will run faster with non-registered memory. Remember, we had this same problem with K8 in 2003.

"Someone posted earlier asking you what i'm asking you now.... "

sprender56's ?

Scientia from AMDZone said...

13ringinheat

"Wow do you even do research anymore"

You can't correct my statements about one article and one author by quoting from a completely different article written by a different author. The article I was referring to was AMD's website paints Barcelona performance pictures not seen in the real world by Rick C. Hodgin.

The real review that you quoted from does not contan Hodgin's whining. His statements are also misleading:

And the highest-end 2350 part was not even handed out to all of them. Some were forced to use 2347 parts at 1.9 GHz.

Nowhere in this statement would you find the truth which is that AMD sent Anandtech samples of the 2360, 2.5Ghz parts. In fact, AMD sent Anandtech two matched (one Intel and one AMD) computer systems. Essentially, Hodgins is claiming that AMD is misleading people while misleading people himself.

InTheKnow said...

Scientia said...
Intel on the other hand hasn't provided this same information but did admit to low yields when they started 65nm.

That is simply incorrect. Intel has given similar data as you can see below. So unless you can provide a link that shows otherwise, I'm going to have to assume that you are wrong about Intel's alleged yield problems at 65nm as well.

Looking at the plot in Figure 1 would seem to indicate that 65nm was better than 90nm and that 45nm matches 65nm.

Also of note is that the article was released around the time the 45nm process was announced. That would indicate that yields should now be much better than where the plot ends in the link. 45nm was announced in January so they are 8-9 months further down the yield curve than what the plot shows. If the ramp is on schedule they should be in the flat part of the curve, which are mature yields the way I read the graph.

Scientia from AMDZone said...

Let's look at the truth of what AMD did rather than Hodgin's tantrum. The problem is that you have to read two Anandtech articles to get the whole truth. AMD's Quad-Core Barcelona: Defending New Territory

Johan De Gelas said:

A lot of people gave us assistance with this project, and we would of course like to thank them.

Damon Muzny, AMD US
Brett Jacobs, AMD US

Now, let's look at what Anand Lal Shimpi said in his AMD Phenom Preview: Barcelona Desktop Benchmarks.

"We got a call earlier in the week asking if we'd be able to turn around a review of AMD's Barcelona processor for Monday if we received hardware on Saturday. Naturally we didn't decline, and as we were secretly working on a Barcelona preview already, AMD's timing was impeccable."

And:

"AMD shipped us a pair of 2U servers a day early, we actually got them on Friday"

"At the last minute, AMD informed us that we'd be receiving three sets of Barcelona processors: a pair of 1.9GHz chips, 2.0GHz chips and 2.5GHz chips. "

Note that the 2.5Ghz chip is not production and has 2.5 written on with felt tip marker.

Note too that Anand is more optimistic than I am:

"With 2.5GHz in hand today, we'd expect Phenom to be at or below 2.6GHz by the end of the year"

"AMD was kind enough to send us two servers, identically configured, from Colfax. "

"Both systems worked just fine"

However, instead of using the system as AMD had intended, Anand proceeding to hack the AMD computer:

"but I had other plans in mind for the Barcelona system that AMD was sending me.

We cracked open the Barcelona server and made some modifications

The end result was, as Johan put it, us using 'such a beautiful, noble machine for such plebian activities'."

And, naturally, the hacking began to effect the system:

"During our testing we'd occasionally get a Hyper Transport error upon reboot

Then, at the very end (literally two hours before publication) of our benchmarking, the AMD server stopped POSTing. As of now the system will simply sit there and spin its fans without actually putting anything on the screen. A number of things could have happened, but thankfully the Barcelona system decided to die after we ran all of our tests."

Scientia from AMDZone said...

InTheKnow

"Intel has given similar data as you can see below."

Thank you for the graph. I hadn't seen it but it supports what I had heard.

"Looking at the plot in Figure 1 would seem to indicate that 65nm was better than 90nm and that 45nm matches 65nm. "

No. You are reading the graph wrong. What it actually shows is that 90nm had worse initial defect density than 130nm but about the same improvement rate. The chart further shows that there was no improvement in initial defect density with 65nm but the rate of improvement got worse. It further shows that 45nm is close to 65nm and worse than 90nm. Again, this matches with what I said.

"Also of note is that the article was released around the time the 45nm process was announced. That would indicate that yields should now be much better than where the plot ends in the link."

This is a log scale. Most of the range in these graphs are unacceptable yield levels. The minimum production yield level is way down, nearly at the bottom. As far as I can tell the 45nm plot shows a similar defect density improvement to 65nm. And, again, this would be Q2 08.

Scientia from AMDZone said...

13ringinheat

"In your eyes scientia has AMD ever dropped the ball????"

Definitely. Although K5 did use a RISC core it wasn't very fast. One of the K7 steppings had trouble with heat. AMD was unprepared with a process beyond 130nm. They left Motorola then tried UMC and finally ended up with IBM. K8's launch date was pushed back ... twice. The schedule slipped 12 months altogether. K8's initial yields were pretty bad, about half of what AMD got with K7 on 130nm. It also took AMD awhile to get this yield up. AMD didn't anticipate Intel's increase of the FSB to 800Mhz which meant delays before socket 939 was ready. And, I think AMD's involvement with Flash memory was a mistake. Those are mistakes just off the top of my head.

Giant said...


Now, do you also want to mention that Tigerton is nothing but a rebadged Clovertown? Has Intel made any statement that this would be changed to the Penryn core?


That's exactly what Tigerton is. It even uses the same socket as Clovertown. The 45nm shrink of Tigerton is meant to be Dunnington. I guess they'll talk more about this in IDF. Intel's DP servers seem to get all the new technology first. It wouldn't surprise me if both Penryn and then Nehalem are for DP servers first.






Are we talking desktop or servers now? On the desktop, Penryn seems to be only 5% faster. But, I suppose the faster FSB speed might help for clocks of 3.33Ghz and higher.

In servers. I haven't heard of any desktop based Penryn CPUs using a 1.6Ghz FSB. Penryn is about 5% faster on desktops without any SSE4 optimizations. With SSE4 the numbers really vary. We've seen video encoding twice as fast when SSE4 is used.

Two more facts from the Randy Allen quad core introduction video: They are planning on having a 2.5Ghz version out in December. He also says that it will be about 15% faster than the 2Ghz model.

Axel said...

Scientia

Not really. I looked at the Tech Reports review.

I'm pretty sure you didn't look at the same review that I did, since you still stubbornly hold to the mistaken opinion that K10 has a general 17% IPC lead over Clovertown. Let's look at those review results a bit closer:

The numbers below represent how much higher/lower K10's IPC is compared with Clovertown GHz on the 2.0 GHz test systems.

SPECjbb: 2.1% higher
Valve VRAD: 13.1% lower
Cinebench sngl: 16.3% lower
Cinebench mult: 10.7% lower
POVRay chess: 3.8% higher
POVRay bench: equal
MyriMatch 1 th: 9.1% lower
MyriMatch 8 th: 1.8% higher
STARS 1 th: 29.2% lower
STARS 8 th: 13.3% lower
Folding avg: 3.0% higher
Panorama Fact: 12.9% lower
picCOLOR: 20.0% lower
WME encoding: 6.3% lower
Sandra Mult Int:47.8% lower
Sandra Mult FP: 11.6% lower

Now how does this data even remotely support your position that K10 generally has 17% more IPC than Kentsfield/Clovertown? If I take the crude liberty of averaging those numbers above, I find that K10 is in fact generally 14.4% slower than Clovertown per clock.

So it looks like you've made the mistake of taking Anandtech's server oriented benchmarks and extending that IPC relationship to the enterprise & desktop spaces. I'm stunned at your naivete. The enterprise trend is covered by Tech Report's benches above which clearly show that K10 has a significant per clock deficit against Clovertown in that space.

For the desktop we have Anandtech's preview using registered DDR2-667 memory, showing a 10%-15% IPC gain over K8, which is some 10-15% short of Kentsfield. Think what you want, but faster unbuffered memory is unlikely to net more than a 5-10% gain in IPC. The conclusion is clear: K10 is unlikely to even catch up with Kentsfield on the desktop, certainly not in games where Kentsfield's lead over K8 is huge. Penryn will extend this desktop lead further and add SSE4 to the equation, completely destroying the larger die K10 in the desktop space.

uf said...

Scientia

A 2.0Ghz Barcelona seems to be equal to a 2.33Ghz Clovertown.


Not at all. For int, 2.0Ghz Barcelona better than 2.33Ghz Clovertown (SpecInt2006_rate).
For float, 2.0Ghz Barcelona much better than 3.0Ghz Clovertown (SpecFp2006_rate) for 2P and 4P servers.

The only question left - which benchmarks are more professional ones: Spec.org or Anand and others.

Giant said...

South Korea finds that Intel has done nothing serious enough to warrant sanctions:

All they did was send Intel a "statement of objection".

http://www.smh.com.au/news/Technology/South-Korean-antitrust-regulator-completes-Intel-probe/2007/09/11/1189276694226.html

AMD's lawsuit is without any merit whatsoever and will be thrown out.

Aguia said...

The only question left - which benchmarks are more professional ones: Spec.org or Anand and others.

I was thinking the same uf.

This list published by axel for example:

SPECjbb: 2.1% higher
Valve VRAD: 13.1% lower
Cinebench sngl: 16.3% lower
Cinebench mult: 10.7% lower
POVRay chess: 3.8% higher
POVRay bench: equal
MyriMatch 1 th: 9.1% lower
MyriMatch 8 th: 1.8% higher
STARS 1 th: 29.2% lower
STARS 8 th: 13.3% lower
Folding avg: 3.0% higher
Panorama Fact: 12.9% lower
picCOLOR: 20.0% lower
WME encoding: 6.3% lower
Sandra Mult Int:47.8% lower
Sandra Mult FP: 11.6% lower


Which server applications are on that list?
Sandra,Valve VRAD,picCOLOR: nope
WME encoding: nope
SPECjbb: Yes
...
Most of the other I don’t know what they do/are, I use a lot of server applications and I never heard of some of those.

It would be nicer if they would benchmark:
IIS
Apache
Isa
SQL
Windows Media Server
Asp/PHP/Java served applications
Linux
Running Virtual servers running the above applications, maybe even the applications axel posted if it’s that’s so hard to do!
...
In other words Real server tests!
Applications that don’t do nothing like Sandra, SuperPi, Valve, ... are stupid tests!!!

Has Scientia already said one time here those types of applications can only be used to compare same processors architectures performances differences specially (clock speed scaling or cache size affection), but even those I have lots of doubts...

Ho Ho said...

From the few benchmarks I've read so far the biggest surprises are the relatively low cache bandwidth, high memory latency and not that good SIMD throughput.

cache
L1/2 seem to be a lot slower than on Core2. With all the talk about doubling bandwidth and all I expected it to at least match Core2 but it still lags behind.


memory latency
Memory latency has increased by a lot "thanks" to the added L3. I did expect it to loose some because of L3 but I also thought that superior memory controller would at least make it even with older K8. Unfortunately it didn't happen. All we can hope for is that with higher clock speeds for L3 latency drops. Unfortunately I'm afraid it can't drop by much.

Also synchronizing data inside the socket doesn't seem to be too fast, at least not under light load. With MCM Intel can synchronize data in one die nearly 3x faster than Barcelona and from die to die about as fast as any core to any core in Barcelona. From socket to socket the time is around 12% slower on Intel.

Seems as being native will not be much of a benefit until system becomes under such a load that FSB becomes a bottleneck and latencies start to increase. Before that Intel MCM seems to be either on par or vastly superior, depending on what cores exchange data.


SIMD
SIMD benchmarks didn't seem to be particularly good either as in most places Xeos were still in the lead. SisoftSandra should show numbers close to peak throughput in ideal situation and even there is Barcelona loosing.



For my need Barcelonas doesn't seem to be the best choise. I don't care about power usage, all I need is massive SIMD throughput, good cache bandwidth and not too low memory latency. So far Core2 with its superiour SIMD througput balanced out memory latency. I had hoped for that Barcelona could catch up with SIMD throughput and maintain its lower latency but seems as the differences aren't good enough.

Of course things will be better for Barcelona with regular DDR2 dimms but they will be much better for Xeon with DDR2 instead of FBDIMM too.

In short I expected a bit better but hoped for more. We'll see what happens with faster clock speeds and different RAM. I personally don't expect things to be too much different.

Aguia said...

I don't care about power usage,
OK

all I need is massive SIMD throughput
OK

good cache bandwidth
For what ho ho, run cache bandwith benchmarks?

and not too low memory latency.
Too low or too high?

So far Core2 with its superiour SIMD througput balanced out memory latency.

Comparing desktop chip with server that’s really nice... did you already bother comparing the quad Xeon memory latency to Quad Opteron latency?

You seam also forget that with AM2+ the northbridge will be clocked 2X higher so L3 cache will lower latency and increase bandwidth, do you think that will do nothing?

Ho Ho said...

aguia
"For what ho ho, run cache bandwith benchmarks?"

No, to run programs that benefit a lot from cache, e.g ray tracing. It has well over 90% L1/2 cache hit ratio and it really does help to have higher cache bandwidth, especially when CPU has a lot of SSE throughput.


"Too low or too high?"

Sorry, I meant too high latency.


"Comparing desktop chip with server that’s really nice"

I'm quite sure that Core2 will see bigger benefits going from FBDIMM to regular DDR2 than K10 from registered DDR2 to regular DDR2. If on server CPUs core2 is already quite competitive with FBDIMMs then with regular RAM things would get even better for it.



"did you already bother comparing the quad Xeon memory latency to Quad Opteron latency?"

I didn't but anandtech did. Remember, it is registered DDR2 vs FBDIMM. You can guess how huge hit does Xeon take with using FBDIMM. Wasn't somewhere around 30-50% higher latency? Especially with as many dimms as anandtech uses.


"You seam also forget that with AM2+ the northbridge will be clocked 2X higher so L3 cache will lower latency and increase bandwidth, do you think that will do nothing?"

No I didn't forget it, I even said a few words about it.

Yes, it surely will help. Assuming perfect world where latency would drop 2x then overall latency would drop by around 10ns. Down from 80-90ns it is around 10-12% decrease. Not all that much I'd say but of course better than nothing. Current K8 with registered DDR2 has latencies of around 60ns and with regular RAM around 50ns or lower.

Axel said...

This is by far the most bafflingly disappointing launch of a new generation of CPUs ever. A grand total of about four reviews? No buzz on web sites, fierce debates in forums, etc? I was looking forward to several days of post-launch excitement, reading tons of articles on dozens of sites, and just wished for K10 to put up a huge fight despite all indications this year to the contrary:
- The covertly obtained Cinebench & POVRay benches.
- AMD's refusal to publically exhibit K10 performance
- Three high level executive resignations
- Hector's hints that K10 wouldn't make the splash that K8 did.

It seems as if both Intel & AMD advocates have been stunned into silence by the revelation that the emperor really had no clothes after all. It truly is too little too late, even if the clocks come up. Though the memory subsystem architecture is innovative and efficient, AMD simply didn't do enough to the actual core to greatly improve IPC over K8. K10 is essentially the K8 motor with new heads & cam, when what AMD really needed was to retire the old motor and replace it with a new big block or at least outfit it with a supercharger.

AMD's only hope now is that the B2 stepping not only brings the clocks up, but has significant IPC gains as well over B1. However, as we know, core steppings rarely increase IPC at all but usually only fix errata and optimize transistors & thermals.

Aguia said...

However, as we know, core steppings rarely increase IPC at all but usually only fix errata and optimize transistors & thermals

Well Axel you are quite right if we were talking of Intel. But with AMD I don’t think so.

What you said was exactly what the new 65nm Conroe revision did.

Ho Ho said...

As for the whole SSE5 thing, Scientia has somehow totally ignored Larrabee. It will be released some time late next year as GPU and CPU versions will be availiable in 09. It is certain it will have major updates to x86 ISA and I'm sure that it will certainly have MADD instructions among other stuff.

Also I wonder how big edge will SSE4 dot product instruction have over SSE5 that doesn't have it. After all, it is quite imprtant instruction for all 3D calculations.

Scientia from AMDZone said...

ho ho

I'm sorry if you don't understand the significance of SSE5. SSE5 in spite of its name is not just another SSE. A better (and more accurate) name for it would have been AMD64-2. Larabee is fine but it has nothing to do with SSE5. What you are talking about is SSE.

Ho Ho said...

scientia
"Larabee is fine but it has nothing to do with SSE5."

Yes, so it is. That will also be kind of a problem for AMD as if Larrabee and its instruction set, whatever that will include, becomes more accepted it will be difficult to manage. Of course the opposite could also happen, though Larrabe will still work nicely as a GPU.

Ho Ho said...

scientia
"What you are talking about is SSE."

Actually I was talking about AMD SSE seemingly missing dot product instrutction. I brought in Larrabe just as an example of what will Intel have to put against SSE5 from AMD.

Giant said...



Well Axel you are quite right if we were talking of Intel. But with AMD I don’t think so.


Please show us a prior example of a new CPU stepping from AMD providing more than a few fixed bugs and errata with a slightly lower power consumption.

Scientia from AMDZone said...

All

I understand that many of you have been looking at the charts on Tech Report and seeing disappointing results. However, I haven't seen anyone yet give a post who actually seems to understand the results.

Specifically, the Tech Report memory and cache bandwidth results look like they've been put through a meat grinder with some numbers being accurate and others being way off. However, the part that bothers me is that the author, Scott Wasson is only sceptical of the results when they are low for Xeon. When they are low for Barcelona, he accepts them at face value. I'll see if I can go over them and explain what is going on. Page 3

The first chart is Sandra cahe and memory bandwidth.

The chart looks very pretty and appears to show that C2D has much higher bandwidth. However, the chart is wrong.

Let's start at the bottom with the Opteron dual core results. The cache maximum for a single core 2220 is 22.4GB. What we see is half which is too low. The maximum for 2350 is 32GB and we are close to that so that looks fine. However, the maximum for 5335 is also 32 GB's yet the chart shows it reaching about 39. 37 GB is max for 5345 but it shows 45. The max for 5365 would be 48 GB but the chart claims about 58.

Cache bandwidth is a direct function of clock so I would fully expect the 5365 to be the highest. Also, the K8's only have half the bandwidith so they will be lower in spite of the higher clocks. If Xeon were close to max and Barcelona were a bit lower the results might be believable. But, these results show Xeon exceeding the maximum theoretical bus bandwidth and this simply is not possible. The first chart is worthless but the author has no complaint.

The second chart is Sandra cache and memory bandwidth, 1GB test.

The maximum possible memory throughput would be 10,666 GB/sec.

2360 - 11,534
2218 - 8,298
5365 - 5,179

Clock speed has no effect on this test. The author says, "I'm a little dubious about the relatively low results for the Xeons, though." Well, the numbers for Xeon look low to me too but how in the world did he overlook that Barcelona is exceeding the maximum that the memory can theoretcally deliver? This chart is useless.

The next chart is CPU-Z memory access latency.

The Opteron and Xeon numbers don't look too bad but the Barcelona numbers are strange. The latency on Barcelona should be the same as dual core Opteron. The author's opinion, "Well, that's not so good."

However, he further says, "Let's look a little closer at the results with the aid of some fancy 3D graphs, and I think we can pinpoint a reason for the Opteron 2300s' higher memory access latencies." Sure, why not?

Looking at the pretty bar graphs shows the same thing. So, what new thing does he think he has discovered?

"The Opteron 2350's L3 cache has a latency of about 23ns, and the 2360 SE's L3 latency is about 19ns. Since latency in the memory hierarchy is a cumulative thing, that's very likely the cause of our higher memory access latencies.

Adding the L3 cache in this way was undoubtedly a tradeoff for AMD, but it certainly carries a hefty latency penalty."

Well, that certainly shattered any illusions I had about the knowledge of this author. Specifically, the latency is not cummulative. Why? Because K10 does not fetch to L3 as this author assumes. The memory fetch in fact doesn't go to either L3 or L2; it goes directly to L1. Since neither L2 nor L3 are part of the memory access chain they have no effect on latency.

What can we conclude from these charts? I don't know. There are obviously things in error. The numbers that are too high may be an error in the Sandra code itself. The high latency for Barcelona could be a misconfiguration or it could be errata with the B0 stepping. Unfortunately, the quality of this test was so low that I really can't tell.

Ho Ho said...

scientia
"Since neither L2 nor L3 are part of the memory access chain they have no effect on latency."

Probably I have missed something really obvious but if memory access chain has nothing to do with L2/3 then what use are they? CPU has no idea where the data is and it has to request if from caches before it goes to RAM. I know that if data is not in caches it gets prefetched to L1 first. If L1 is "full" it gets sent to L2 and/or back to RAM. If it is in L2 then how can CPU know that?

Scientia, do you have any comments on Barcelona cache synchronization latencies that Anandtech demonstrated?


Btw, if anyone sees any other cache benchmarks I'd be glad if they would be linked here.

Scientia from AMDZone said...

Okay, now I'm looking at the Anandtech memory test.

I'm seeing the same curious numbers. The Lavalys Everest L1 bandwidth is showing 32K Read for Barcelona and 51K for Opteron. The write and Copy numbers are similar. Significantly though, the L2 numbers look normal.

I'm wondering now if Barcelona has an L1 cache bug.

Scientia from AMDZone said...

ho ho

L2 and L3 with K10 are victim caches. This is different from K8 where L2 is also the prefetch cache.

In order for latencies to be cumulative a transfer from L3 would have to go through L2 but this does not happen. L3 transfers go directly to L1. Likewise, neither L2 nor L3 have any effect on memory loads since these loads do not go to L2 or L3.

With odd numbers from two different pieces of software and two different testers I think it is more likely the processor. Since the numbers don't match the design limits it seems to be a bug. The question then is how long it might take to fix it. From my limited knowledge of processor design I don't see this as a big prolem; those take a year to fix. This seems more likely a 3-6 month fix.

Axel said...

Scientia

Specifically, the Tech Report memory and cache bandwidth results look like they've been put through a meat grinder with some numbers being accurate and others being way off.

Perhaps, but how about addressing the more meaningful application benchmarks and how those results conflict greatly with your position that K10 offers 17% higher IPC than Clovertown...

abinstein said...

"Since latency in the memory hierarchy is a cumulative thing, that's very likely the cause of our higher memory access latencies."

This is wrong. Only the tag check, a small part of total cache access latency, needs to be serial. In K8, checking the tags is 1 cycle for the L1 and 2 cycles for the L2 cache. In Barcelona, it's probably ~8 cycles for the L3.

"The memory fetch in fact doesn't go to either L3 or L2; it goes directly to L1. Since neither L2 nor L3 are part of the memory access chain they have no effect on latency."

Well they do have some effects on latency, but not relevant to the prefetch mechanism. To measure the cache access latency accurately the prefetch should not be at work at all. ;)

Aguia said...

CPU has no idea where the data is and it has to request if from caches before it goes to RAM. I know that if data is not in caches it gets prefetched to L1 first. If L1 is "full" it gets sent to L2 and/or back to RAM. If it is in L2 then how can CPU know that?

Well I was thinking the same, but supposing that the prefetcher was integrated in the IMC, would the IMC be the transfer link between the three cache levels?

IMC -> L1 C1/C2/C3/C4
IMC -> L2 C1/C2/C3/C4
IMC -> L3
C1/C2/C3/C4 L1 -> IMC
C1/C2/C3/C4 L2 -> IMC
L3 -> IMC

Ho Ho said...

I have no ideas what the numbers should mean you just posted but my point was that before anything gets fetched to cache CPU must check if the data is in some level of the cache. First it checks L1, then L2 and L3. If neither of those has the needed cacheline only then is the memory accessed. When the data is coming back from RAM it will be put directly to L1.

That initial check of caches is what will slow things down a bit more than it used to thanks to the added cache level.

Scientia from AMDZone said...

ho ho

Yes, exactly, but this wouldn't be a big drop; it would be a small drop. Besides, if you look at Anandtech data, they give a much more reasonable latency of 76 versus 59. This would be consistent with the extra delay from a 2.0 versus a 2.6 Ghz clock. They also show an increase in both Read and Copy bytes/clock. This again would be reasonable.

Giant said...

I actually find it quite funny that Scientia says "K10 is off to a good start.".

Lets look at Intel's product launches since the start of 2006 based on the core micro-architecture.

First was Woodcrest. It was easily faster than dual core Opteron.

Next was Conroe, easily faster than Athlon 64 X2.

Next was Merom. Intel had the lead in mobile the whole time but Merom extended the lead Yonah had established.

This continued with Clovertown and Kentsfield in November 2006.

The most recent was Tigerton.

(Note that I didn't include Tulsa, as it was still based on Netburst)

In each of these cases we saw Intel clearly offering the new levels of performance in each of the categories.

But then AMD announces Barcelona for both DP and MP servers. It's slower than the fastest Clovertown and Tigerton CPUs.

Yet you still state that K10 is off to a good start? Can you explain that to us?

If Conroe had been slower the existing Athlon 64 X2 products would you say that 'Intel is off to a good start with Core'? I think not.

The fact is here again. This first occured in GPUs with the 2900 and it's happened here again with quad core CPUs.

AMD has not raised the performance barrier at all. Not one bit. That's very disappointing.

You want to know why Nvidia still gets over $500 for a Geforce 8800 GTX? Because of AMD. How can Intel still charge nearly $1200 for a 3Ghz Clovertown CPU? Because of AMD.

Ho Ho said...

scientia
"Yes, exactly, but this wouldn't be a big drop; it would be a small drop."

It will add as much latency as it takes to check if data exists L3. If it takes 10ns it will add 10ns to total memory latency.


"Besides, if you look at Anandtech data, they give a much more reasonable latency of 76 versus 59."

Are you sure you are not messing up memory latency and transferring data from L1 to L1 in another core in same package?

Greg said...

Relatively speaking, for a company that has, at most, held 26% of an entire market's share, and whose competitor has held the other 74% (yes, there's via, but that's not enough for anyone to really care) a product that is fairly competitive on the most profitable portion of server chips is a good start. Assuming that AMD SHOULD pull off a launch as good as anything a company as dominant and large as Intel is for that launch to be considered good is just poor reasoning and a clear sign of bias.

Again, this is just like k8, in terms of processor performance, and not seeing that connection shows you either have little to no experience with in this market, or are simply ignoring data to come to your own, pre-decided, conclusions.

abinstein said...

"First it checks L1, then L2 and L3. If neither of those has the needed cacheline only then is the memory accessed. When the data is coming back from RAM it will be put directly to L1."

There's good reason and probability that L3 tag check and main memory access can go in parallel. If it turns out L3 has valid cache line then the main memory access is simply cancelled.

InTheKnow said...

I said...
"Looking at the plot in Figure 1 would seem to indicate that 65nm was better than 90nm and that 45nm matches 65nm. "

And your reply was ...
No. You are reading the graph wrong. What it actually shows is that 90nm had worse initial defect density than 130nm but about the same improvement rate. The chart further shows that there was no improvement in initial defect density with 65nm but the rate of improvement got worse. It further shows that 45nm is close to 65nm and worse than 90nm. Again, this matches with what I said.

There is no need to be insulting. The university I graduated from would have done a poor job indeed if they hadn't taught me how to read a simple graph.

If you take the time to really measure things on the graph you will see the following things.

1) Intel required ~24 months to reach the same level on 130 nm that they eventually reached on 90 nm.

2) Intel reached the flat portion of the plot in ~22 months at 90 nm. This despite several flat spots on the graph that showed significant yield hurdles had been encountered. I would expect this since the 90nm transition also overlapped with the 12 inch transition somewhat.

3) On 65 nm Intel matched 90nm yields in ~19 months. Yields continued to improve from beyond the 90 nm levels.

4) 90nm launched around the end of December '04. Intel had reached the flat part of the graph ~2 months prior to this.

5) 65nm launched in Jan '06. Intel matched the 90nm yield levels ~3-4 months prior to this.

6) 45nm is now at about the 18 month point on the plot. If they are matching 65nm yields then they should be very close to the 90nm yield level now. The launch is believed to to be 2 months away and they should be well into the mature portion of the yield graph by then.

Assuming of course that you accept the idea that Intel's 90nm ever reached "mature" yields.

So your theory that Intel is improving yields more slowly with each successive node and launching product with immature yields doesn't hold up well on close inspection. In fact, it appears that Intel is ramping each node to mature yields faster with each iteration. You would expect to see this as the 12-inch processes mature.

I would be happy to provide you a copy of the marked up graph if you want to post it.

GutterRat said...

Let's end the controversy right now about what's broken in "Barcelona"

Rev 10h Errata Document.

Buyer Beware.

Scientia from AMDZone said...

Giant

"First was Woodcrest. It was easily faster than dual core Opteron.

Next was Conroe, easily faster than Athlon 64 X2."


True

"Next was Merom. Intel had the lead in mobile the whole time but Merom extended the lead Yonah had established. "

Sorry, not true. AMD's mobile sales are doing fine. Merom was in fact a step backwards because it failed to hit Intel's target power consumption.

"This continued with Clovertown and Kentsfield in November 2006."

True.

"The most recent was Tigerton."

Not really. Tigerton is just Clovertown with firmware for 4-way. If this had been introduced in 2006 I'd agree with you but being introduced right before Penryn seems like a step backwards.

"(Note that I didn't include Tulsa, as it was still based on Netburst)"

No, Tulsa was a good upgrade to Presler.

"In each of these cases we saw Intel clearly offering the new levels of performance in each of the categories."

Well, except for Merom and Tigerton.

"But then AMD announces Barcelona for both DP and MP servers. It's slower than the fastest Clovertown and Tigerton CPUs."

True indeed.

"Yet you still state that K10 is off to a good start? Can you explain that to us?"

Sure. Opteron was slower than Athlon MP when it was introduced. The same was true of Athlon 64 and Athlon. And both were slower than Intel's offerings. However, by the end of 2003 AMD was ahead of its older K7 offerings. I expect we'll see the same thing with K10.

"If Conroe had been slower the existing Athlon 64 X2 products would you say that 'Intel is off to a good start with Core'? I think not."

You need to distinguish between something that pulls ahead soon like K8 and something that never does like Williamette. Intel had to release Northwood before P4 was any good; K10 is not like that.

Scientia from AMDZone said...

intheknow

"So your theory that Intel is improving yields more slowly with each successive node"

No. I would say that 45nm seems about the same as 65nm.

It would be interesting to compare with AMD if we had the same graph. The only AMD graph I've seen shows improvement per wafer rather than time.

" and launching product with immature yields doesn't hold up"

You need to understand the difference. Mature enough for production is not mature. In other words Intel hits a minimum yield target and then begins production but the yield continues to improve.

What I'm saying is that AMD begins production at a higher yield. But this difference would shrink over the next couple of quarters. I would guess though that D1D probably shows a similar profile to AMD's FABs. It's not difficult to understand that Intel has a bigger job getting yields up on its separate FABs.

Scientia from AMDZone said...

GutterRat

"Let's end the controversy right now about what's broken in "Barcelona"

Rev 10h Errata Document."


Those are normal errata but if there is an L1 cache bug it wouldn't be end there.

Maybe I should move to an article about that.

Ho Ho said...

abinstein
"If it turns out L3 has valid cache line then the main memory access is simply cancelled."

So if the data is indeed in L3 then memory bandwidth will still be wasted on the read?


scientia
"Tigerton is just Clovertown with firmware for 4-way."

Yes, it might have been seemingly minor upgrade but it did increase performance substantially compared to the other offerings noneoftheless

Scientia from AMDZone said...

ho ho

You have it wrong. Tigerton didn't increase performance at all; it's the same as Clovertown. What increased performance was the quad FSB chipset. If Intel didn't need the new quad FSB chipset they would have released Tigerton when they released Clovertown.

This is much the same as it was with AMD's Athlon MP which was basically just an Athlon but it ran on the 760 MP dual FSB chipset.

Scientia from AMDZone said...

All

This is comment from dave_graham at AMDZone:

-----------------------------------

i've been asked to pass this to my FAE @ AMD.

what people have been benching is the B1 chip stepping with a BIOS patch applied to get around errata #281 (conspicuously absent on that errata worksheet). BA is the production stepping that fixes this issue on the NB itself and will handle some of the performance "issues" people have been bitching about. B2 steppings are the "SE" or higher rated parts.

cheers,

dave

----------------------------------

As I said to gutterat, it isn't on the errata list. Fancy that.

So anyway, he is suggesting that there is indeed a bug in the NorthBridge that is killing peformance. To be honest I'm not sure how this would only show up in the L1 bandwidth. And, apparently they do have it fixed just not out yet. I'm guessing a lot of people have something similar to my reaction.

Great, more waiting to find out how things really stack up...

Of course, I can guess the reaction of all those review sites. Something like:

We did all that testing on crippled chip??? WTF??

Ho Ho said...

scientia
"What increased performance was the quad FSB chipset."

My point still stands: platform performance increased.


Did AMD really not have enough working chips to give to reviewers?

Giant said...

Check it out:

http://youtube.com/watch?v=F7LNUkHa7U8

September 18th, San Fransisco is the start of IDF. The skull is clearly a reference to Skulltrail.

Aguia said...

Well Scientia that seems possible.

I read on some site from one AMD employee/worker that we would see one more revision (B3) and maybe even one or two more until end 2007, beginning 2008.
So maybe there are many “performance bugs” to polish yet.

AndyW35 said...

Scientia from AMDZone said...

As I said to gutterat, it isn't on the errata list. Fancy that.

So anyway, he is suggesting that there is indeed a bug in the NorthBridge that is killing peformance. To be honest I'm not sure how this would only show up in the L1 bandwidth. And, apparently they do have it fixed just not out yet. I'm guessing a lot of people have something similar to my reaction.

Great, more waiting to find out how things really stack up...

Of course, I can guess the reaction of all those review sites. Something like:

We did all that testing on crippled chip??? WTF??


I think what he has been told is wrong, that's why you are puzzling why it only turns up in L1 cache bench.

Anand got a B2 chip as well and it scaled up 25% on a 25% clock increase for Cinebench test, something you would not see if there is a bug that is limiting performance in early B1's. So what Anand is testing is not bugged.

Looking at it another way, if AMD have been shipping prodution chips for at least the last 2 weeks then would it make any sense whatsoever for AMD to send a bugged earlier chip in the last week to Anand knowing that it would lower performance? Er, no. On another forum Dave_Graham on replying to this logical conundrum that it was because Anand was not very high in AMD's esteem when it came to server testing so they go what they were given. Unfortunately AMD also gave Anand one of the few B2's so that theory is also shot down.

What it boils down to is that Dave_graham has been told something by his contact and it is obvious that the person he is talking to, a field engineer, is not fully knowledgable in what another part of AMD is sending to review sites, but Dave believes him implicitly.

AMD is pretty secretive and this has led to this sort of conflicting information appear many times over the years with one part saying one thing and another part saying something else.

This bug probably did exist but there is no evidence at all that it affected review results. There is a strange cache value discrepency, something that is not uncommon with AMD, just like 65nm K8 I seem to recall.

Scientia from AMDZone said...

andyw35

For anyone wanting to read Dave's comment you can look at this forum thread.

Reading through this I realized that some of his comments were similar to mine so before anyone asks, no, I am not him. I hadn't actually read any of his comments before his post on AMDZone.

Anyway, his opinion seems only partly based on inside information. He actually has Barcelona chips running in a server right now and says he is going to be doing benchmarking in a couple of weeks. He claims to work at the company that owns VMWare. He claims that the 2.0 chip is equal to 2.33 Clovertown.

Yes, I've seen that Anand had stepping 2 with his 2.5Ghz chips. And, yes, Cinnebench did scale 25% but only one other application, WME, had similar scaling. I don't know. I would find the results believable if we were seeing SSE scores close to twice as high as K8.

Greg said...

Giant said:

"South Korea finds that Intel has done nothing serious enough to warrant sanctions:

All they did was send Intel a "statement of objection""

Bother reading the article, and everything else the associated press is putting out.

No decision has yet been released. The statement of objection is basically a "this is what we're about to penalize you for" sort of thing.

Citing this is actually proving that Intel is apparently doing something wrong, and puts more pressure on the US to convict Intel of violating Antitrust laws.

Sharikou, Ph. D said...

"read on some site from one AMD employee/worker that we would see one more revision (B3) and maybe even one or two more until end 2007, beginning 2008.
So maybe there are many “performance bugs” to polish yet."


Then who in their right mind would even buy Barcelona right now, if that's even possible.

It is now clear that Barcelona is still not ready, but AMD had NO choice, a delay would not have gone well with the investors so they launched a broken product.
2-3 revisions in a mere few months?

abinstein said...

Ho Ho -
"So if the data is indeed in L3 then memory bandwidth will still be wasted on the read?"

Did you read the term "cancelled"? If not then do a search on this page.

I really wonder how much memory bandwidth you're talking about can be wasted in the <10 cycles of L3 tag check? The memory fetch request probably haven't gone out of the IMC yet.

abinstein said...

"Then who in their right mind would even buy Barcelona right now, if that's even possible."

Some minds are very crippled, like yours, to believe such "performance bug FUD" can actually work.

On the other hand, if a "performance crippled" Barcelona already outperforms Clovertown, then why not buy it? It's a better product, no matter how "crippled" anyone claims it to be.

abinstein said...

"More lies from a desperate AMD fanboy."

Nop, I'm not a fanboy, you are.

Barcelona at 2.0GHz/75W outperforms Clovertown at 2.3GHz/80W in a majority of server workloads (both are server processors) while taking less power at the system level. Barcelona at 1.9GHz/55W outperforms Clovertown at 2.0GHz/50W even more, and still takes less power.

When going to 2-socket setup, Barcelona has even greater advantage over Clovertown. At 4-socket or higher, the only way Clovertown can compete with Barcelona in terms of performance is to include expensive and power hungry Intel 7300 chipsets.

enumae said...

Abinstein

After seeing the initial benchmarks and proposed clock speeds for December, how do you think Barcelona will do on the desktop against Intel?

InTheKnow said...

No. I would say that 45nm seems about the same as 65nm.

Which according to the data I showed you was better than the yield at launch on 90nm. So 45nm should have better yields at launch than 90nm did.

It would be interesting to compare with AMD if we had the same graph.

Yes it would, but I suspect that neither AMD nor Intel will ever give us such plots and make a direct comparison possible.

You need to understand the difference. Mature enough for production is not mature. In other words Intel hits a minimum yield target and then begins production but the yield continues to improve.

Once again, I will refer you to the 90nm yield graph. Intel reached a flat level ~2 months prior to launch and did not show measurable improvement beyond that. So you would have me believe that Intel never reached mature yields at 90nm?

You seem to think a "mature" process stops improving, but there is a thing called continuous improvement that has nothing to do with the process being "mature". Taking your definition to it's logical conclusion no process will ever be "mature" until it stops improving.

That would mean that, since no-one has ever produced a perfect 12" wafer there has never been a "mature" 12" process. 8" has knocked on the door of "maturity" with a few perfect wafers, but as a process still hasn't reached it and probably won't since it is being slowly phased out.

What I'm saying is that AMD begins production at a higher yield.

Why are you making a statement that you can't possibly support? I'm very skeptical that you actually know both Intel's and AMD's yield numbers. Without knowing these numbers, this is strictly an opinion with nothing to back it up.

It's not difficult to understand that Intel has a bigger job getting yields up on its separate FABs.

That is what Copy Exactly is for. The data would seem to say it works pretty well. I refer you to Intel's Copy Exactly data here.

I'll be interested to see if AMD's methodology works as well. AMP has yet to prove it can match fabs.

Giant said...

See this hector ruiz interview

http://www.eweek.com/article2/0,1895,2181149,00.asp

"This is the first time that any company in the world has put 600 million transistors on a chip—and you can round that off and say it's a billion."

How the hell does that work? First his rediculous claim that you can just round up that number by 400 million transistors?

Next up his bogus claim that no one has put 600 million transistors on a chip before. That's BS. G80 is 681 million trasistors. Tulsa was 1.3 billion transistors. The Itanium 2 Montecito is 1.7 billion transistors. R600 is 700 million transistors. The list goes on and on.

Scientia from AMDZone said...

intheknow

Your own link shows what I said, that Intel's yield still improves after production begins. Once again, the chart is a log scale; the part that you claim is flat isn't flat.

It should be common sense that with only one FAB AMD can tweak the process before production starts. With Copy Exact Intel has to tweak the process on each FAB faster production starts.

If you really are trying to claim that Intel starts with highr yields than AMD you should ask yourself how many times an Intel FAB has been ranked number one. Intel never talks about the rankings because FAB 30 was consistently number one. I guess we'll have to see if this changes with FAB 36.

InTheKnow said...

Scientia, did you read anything I said about continuous improvement? Until perfect wafers come out of the factory there will always be room for improvement. Each small improvement beyond a certain point requires increasingly greater effort, because all the easy solutions have been found. Nobody is going to hold production until the process is perfect. It is too expensive to develop and you need to see an ROI before you move to the next node.

It should be common sense that with only one FAB AMD can tweak the process before production starts.

And what do you think D1D is doing during development? Just watching the Si move through the factory?

If you really are trying to claim that Intel starts with highr yields than AMD you should ask yourself how many times an Intel FAB has been ranked number one.

I ask myself a lot of things. Like, "what goes into calculating these rankings?" "Where do the numbers come from since they aren't in the public domain?"

And no, I'm not claiming Intel's yields are higher, because like everyone else out there, including you, I don't have access to the numbers.

muziqaz said...

me, myself, I am a bit disappointed about k10 initial results, but I think it's just me who was hoping for more, and some amd workers saying some huge numbers which haven't come true.
But anyway, in my opinion k10 is like ruth diamond who needs to be pollished to get the most of it. I think we haven't seen the real k10 yet. As we can see now, k10 scales very well with the higher clock.And as i understand, desktop parts with higher HT frequency will show the full benefits of l3 cache. I am abit mystified about very high l1 latencies. But by the time phenom hits the market we will know what is the matter with those latencies.
An interesting thing is, that with those initial test results of those first barcelona's we can't judge how phenom will perform.Or can we? phenom will have HT3 with higher frequency, ddr2 with lower latency, higher overall frequency, and AM2+ mobos with all these fantastic energy saving features. though atleast we can see how would phenom perform in the worst case scenario. :)
I am also very mystified why AMD did not use HT3 in servers ;/
It leaves k10 somehow crippled. though even without HT3 power saving features k10 shows very impresive stuff in watt/performance category. Now imagine how that category would look like with 1207+ mobo(HT3)? I think it would kill everything intel has in power saving/performance charts.
So yeah, barcelona's launch makes me want for more.Phenom maybe... If not, then we always have Bulldozer in couple of years :D So please AMD do not mess that beast up :))

and have anyone noticed anything weird in techreport's testing, when they test power savings while running SPECjbb2005?
well in overall performance(page 4) k10 2.5ghz beats 3ghz xeon. and 2.0ghz k10 beats 2.33ghz xeon, and in power usage under SPECjbb2005(page11) k10 2.5ghz eats less energy than 2.33ghz xeon, and 2.0ghz k10 is much better saver than even L5335 xeon. And then check what they say about those results:
'Well, that's interesting to see. I'm not sure exactly what to make of it just yet. I'd like to correlate power and performance here, but as I've mentioned, our testing time has been limited. Perhaps next time.' He(them) are not sure what to make of it :DDDD perhaps next time. The conclusion is very simple in here: k10 eats xeon platform alive in power saving/performance chart. yeah, maybe next time :D
I don't know, if I wrote such rushed out incomplete articles, why bother at all? maybe next time he will succeed to write a better article.

P.S. sorry for my english. :)

Giant said...

THE FINANCIAL ANALYSTS over at UBS Investment Research think that AMD's Barcelona is not enough to tip the balance in its favour over Intel.

According to a note sent to its wealthy clients that I've seen even though I'm practically destitute, Barcelona doesn't have a big lead over Intel. The credit crunch won't help AMD because financing its "deep negative" cash flow is gonna be tough.

Also, it seems to UBS that next week's Intel Development Forum (IDF) will see Intel showing off fab new prods. AMD has to execute Mr Griffin and Mr Phenom perfectly for average selling prices to rise.

It also reckons a few of AMD's ideas such as "Bobcat" and "Torrenza" should be canned. It is valuing AMD at $15. AMD is trading at $12.69 right now, INTC at $25.33, and Rambus (RMBS) at $17.50. µ

Greg said...

The fact that UBS seems to think it knows what technology AMD should and shouldn't dump (especially eg. Torrenza) shows how superficial their analysis is.

This is one of those times when it becomes obvious how easily poor press and poor ethics within the press can effect investment in ways that are, at best, pathetic.

Greg said...

Also, Giant, cite your source, as it's obvious that's not your writing style and that's not your normal post layout (with the heading in caps, like an online article would have).

Scientia from AMDZone said...

intheknow

I would imagine that D1D's initial yields are similar to AMD's. However, the other Intel FAB's would be worse initially.

Again, we need to distinguish that this has only been since 130nm SOI since AMD's initial yields there were definitely worse than what Intel was getting with 90nm.

Scientia from AMDZone said...

Greg, Giant's source is The Inquirer.

These comments are easily dismissed just like the earlier speculation that AMD was going to completely outsource. I don't know if roborat has yet admitted he was in error to believe the outsourcing rumor. And this is in spite of the fact that AMD has already installed some 300mm tooling in FAB 30 as the preliminary upgrade to FAB 38.

The comments about Bobcat and Torrenza indicate that the person is not familiar with either processor technology or the market itself. The line about perfect execution has been repeated many times and still has no truth in it.

abinstein said...

"After seeing the initial benchmarks and proposed clock speeds for December, how do you think Barcelona will do on the desktop against Intel?"

At the same clock rate, Phenom X4 will be faster than quad-core Penryn, but Phenom X2 will probably be equal or slightly slower than dual-core Penryn.

But if a program is SSE4 optimized then obviously Penryn will take some lead.

Giant said...

Why are you so shy and deleted your previous accusation?

I did not remove that post. Scientia deleted it.

Clovertown is faster by over 10% on average across a wide variety of workloads. http://techreport.com/articles.x/13176

abinstein said...

Giant -
"http://techreport.com/articles.x/13176"

You are embarrassing. The "wide range" of benchmarks that you refer to is the useless & narrow selection of enthusiast toys.

Know that both Barcelona and Clovertown are server & workstation processors, thus their strengths should be measured by server & workstation workload performance, which their buyers depend on for their businesses.

We already know that Barcelona 2.0GHz has about the same SPECint and 40% better SPECfp than Clovertown 2.3GHz. Also look at this AnandTech comparison especially the compression (WinRAR), 3D apps, and server apps, Barcelona is indeed superior than Clovertown.

Now do let us know how you reconcile your fanboism with reality.

Giant said...

It may be speculated that AMD is only yielding K10 at a jaw-dropping 30%. 30%!!

Based on AMD's published data, which claimed 0.5 defects density, and coupled with yield management chart, it can be concluded that AMD's K10 are yielding at no more than 35%.

AMD's Data:
http://www.iian.ibeam.com/events/thom001/22876/browser/slides/20070726084721294707/default_large/Slide158.JPG

Yield management by ICEC
http://smithsonianchips.si.edu/ice/cd/CEICM/SECTION3.pdf

If you scroll down to Figure 3-9 in Yield Management, you can see the chart. Follow the 0.5 density defect rate to 280, then....

Aguia said...

Stumbling in the Aisles

So it seems that after all Scientia is right.

And the guys that said my post about the more revisions would be impossible in so "short" time are wrong.

Christian M. Howell said...


So it seems that after all Scientia is right.

And the guys that said my post about the more revisions would be impossible in so "short" time are wrong.


I guess everyone just continually underestimates APM. If you look at in-depth white papers you will se that APM allows AMD to fix problems before the wafer is completed.

Perhaps that's why as Scientia mentioned Fab 30 was the number one Fab in the world. If they can run Fab36 at the same 125%, they can perhaps afford to start ramping 65nm tools at Fab38.

It's also funny how no one seems to realize that it seems as though Windsor and not Brisbane is still the majority of desktop chips.

This means that their largest segment is 200mm @90nm. Gong to 300mm @65nm provides >3X the amount of chips even with Opteron as 223mm vs 283mm is only 25% larger with 150% more die space.

How s that not a cost-savings? That's what AMD needs more than ASP since if things cost less you can charge less and grow your volume to slightly offset lower ASPs.

Since Barcelona takes Opteron out back and puts a whipping on it and K8 is still a more than worthy server CPU, AMD can charge more for the 2.5GHz chip than for the 3.2GHz K8 and drop the price of the K8 slightly (10-15%) and start EOLing more of them.

Even if they keep some 90nm equipment in Fab30, they still get twice the area with 300mm wafers which is again a cost savings for the same volume.

And now we can all see that AMDs partners will support them. Dell has said they are doubling the amount of AMD servers. LucasArts has about 20,000 Opteron machines, half of which will definitely get upgraded over the next few months.

Sun has already showed of their new 32 core machine and the new TACC supercomputer.

SPECJBB

LinPack

SSE


These benchmarks show that AMD has a great chip here and if the rumors about B3 are true we can expect even more perf for the higher clocked chips.

At least now they don't have as much short-term debt. I can't say that they will go back in the black this quarter but I expect much better results as 300mm @65nm becomes more of a factor in costs.

Axel said...

Mr. Howell

Since Barcelona takes Opteron out back and puts a whipping on it and K8 is still a more than worthy server CPU, AMD can charge more for the 2.5GHz chip than for the 3.2GHz K8 and drop the price of the K8 slightly (10-15%) and start EOLing more of them.

I would hope Barcelona puts a whipping on Opteron, it has double the number of cores. But you're forgetting that it's also more than double the die size of Opteron, so it's MUCH more expensive to fab. AMD have to charge much more for Barcelona than Opteron in order to keep a decent margin on them. Intel aren't going to let this happen, because Clovertown is more than competitive with K10 and Penryn will be even more so.

Axel said...

Abinstein

Know that both Barcelona and Clovertown are server & workstation processors, thus their strengths should be measured by server & workstation workload performance, which their buyers depend on for their businesses.

Tech Report's review shows that Clovertown owns K10 per clock on a "wide variety" of workstation workloads. The server space is K10's only bastion of strength, and even then only in memory intensive situations. In other server workloads, K10 & Clovertown are roughly equal per clock, with a clear advantage going to K10 in performance per watt.

But as you should already know, the desktop and mobile spaces are the most important to AMD from a revenue standpoint. With K10 apparently putting up a poor showing in these markets, AMD are screwed due to the expensive large native die. AMD will not survive 2008 without making massive changes to their business model.

Christian M. Howell said...

I would hope Barcelona puts a whipping on Opteron, it has double the number of cores. But you're forgetting that it's also more than double the die size of Opteron, so it's MUCH more expensive to fab. AMD have to charge much more for Barcelona than Opteron in order to keep a decent margin on them. Intel aren't going to let this happen, because Clovertown is more than competitive with K10 and Penryn will be even more so.


K8 Opteron is 223mm. K10 Opteron is 283mm. That is NOWHERE near 2x. and also don't forget that 90nm Opteron is made on 200mm wafers while K10 Opteron is mde on 300mm wafers which offer 2X the surface area.

abinstein said...

"K8 Opteron is 223mm. K10 Opteron is 283mm. That is NOWHERE near 2x."

This is very true. This is why Barcelona is mainly for servers and workstations, where scalability and power efficiency are more important than clock rate and single-threaded performance.

There's no doubt that Phenom will face tougher challenges from Penryn on the desktop enthusiast space.

abinstein said...

"Tech Report's review shows that Clovertown owns K10 per clock on a "wide variety" of workstation workloads."

Can you point out a single "workstation" workload, other than the usual crowd of Cinebench and Povray? Do you understand how big the workstation market is, and how many more applications needed to be run on a standard workstation, than these two enthusiast toys?

Not to mention the other "physics simulation" (or something like that) that's completely useless in solving research/engineering problems. Just because something look good on the screen doesn't make it useful for workstation.

abinstein said...

gutterrat -
"You claim not to be a fanboi yet you and scientia hang out over at amdzone and your posts reflect a big AMD bias."

I can understand the Intel crowd's desire to mark everyone who speaks favorably of AMD as "fans" or "fanboys". This is the normal FUD tactics.

Yet any rational man can see that it is not favorable comments that make one a fanboy, but his attitude to refuse and ditort the truth.

Both scientia and I recognize the fact that Core 2 is quite (~20% in average) faster than K8 at the same clock speed. You've never called him an Intel fanboy for that, have you? Now can you or can you not face the truth that Barcelona is simply a faster and more power efficient server/workstation chip than Clovertown at the same clockrate, or at the same power consumption?

The fact that people like you spread FUDs about AMD does not me us fanboys, but you FUDers. AMD will expand its capacity; it will stick with SOI at 45nm; it will make fusion happen and increase enthusiast desktop performance. Get your popcorns and condiments ready and you can watch these happen when you consume them.

enumae said...

Abinstein

AMD will expand its capacity;


I am not trying to be difficult. I am just looking for some clarification.

When you say expand capacity, do you mean more FAB's or do you mean within the existing FAB's by ramping both FAB 36 to full capacity and FAB 30 transitioned to FAB 38 (300mm wafers, 65nm)?

Axel said...

Mr. Howell

K8 Opteron is 223mm. K10 Opteron is 283mm. That is NOWHERE near 2x. and also don't forget that 90nm Opteron is made on 200mm wafers while K10 Opteron is mde on 300mm wafers which offer 2X the surface area.

You're right, Opterons are still all 90-nm. I wonder why AMD haven't switched any of them to 65-nm, it's baffling. Anyway, if you compare 65-nm K8 to 65-nm K10, there's roughly a doubling in die size. The comparison would make more sense for the desktop space: Phenom vs 65-nm K8. AMD will not be able to make their margins on Phenom X4 due to the huge dies, aggressive Yorkfield pricing, and lower clock-for-clock performance compared to Yorkfield. This will be their Achilles Heel that will lead to massive restructuring (including the new "Asset Light") in 2008. Some details will probably be revealed over the next couple months.

bk said...

Interesting comment on Inquirer:

http://www.theinquirer.net/?article=42339

"Subject: Barca late

Sorry I've never interacted with the Inquirer before so I don't know if the "Author Flame" was the best route to do this but the link was at the bottom of the article and I'm lazy (hence this run on sentence) but I have some insight on why the AMD Barces are late. Now mind you if anyone asks I'm not an official source and this information is wild, very lucky, spot on, conjecture made by one of your anomalous readers. I wasn't going to ever say anything about it and now that it's water under the bridge I don't feel to bad about it but I really thought the info would come out of another source and it is just making my skin crawl every time I read wild assumptions or just the plain lack of truth about the subject. Right, I'll get to the point now. AMD's newest fab was supposed to be a ground up build for 65nm wafer production. However due to skyrocketing demand for Athlons at the time, things changed. Several factors were taken into consideration, by AMD, such as the grossly underestimated time line and threat of Intel's (now dubbed) core architecture and the fact that they could shave several months off the production start up time at Fab 36 if they were to start at 90nm instead of 65nm. This decision was a heated and very openly contested debated among AMD execs which I'm sure everyone will agree should have been contested until it was rescinded. While the conversion of the facility to a 65nm process was fairly quick in coming it was still detrimental to the Barcelona's birth date. Full scale spinup work for the new chip could not begin until 65nm production was fully up and running. Even after the implementation of the new 65nm process was complete AMD management struck again. To increase output to satisfy demand of then current Athlon cores, mainstream partial die shrinks were fast tracked for the A64 and given priority over the Barcelona startups. While a lot of work was done during this time frame on Barce, the partial die shrink from 90nm to 65nm of the A64s was given more resources. All told the sad end product was at least a 4 month delay in the Barcelona launch. Bad decisions were only to blame in this instance not production/architecture problems. Well that and all the AMD fans complaining that Dell didn't sell AMD CPUs. In fact if people didn't love AMD so much back then and didn't buy so many CPUs they would of had a competitive product for almost half a year now. ;)

Cheers,

Dano"

GutterRat said...

Back to the real world, another AMD VP leaves for greener pastures.

This time, it's Sue Snyder, former VP of International Policy and Relations and Executive Legal Counsel for AMD

According to this press release, Sue has experience with international regulatory and licensing issues, establishing international public/private partnerships, handling import/export and antitrust issues in Europe, and negotiating incentives for facility construction.

How many more have left yet to be discovered and what does this mean for AMD's existing litigation?

Scientia from AMDZone said...

Giant

"Based on AMD's published data, which claimed 0.5 defects density, and coupled with yield management chart, it can be concluded that AMD's K10 are yielding at no more than 35%."

This was talked about a long time ago. The general consensus was that AMD would not give the exact defect density since production specific information tends to be highly secret for both Intel and AMD. I can guarantee that AMD's yield's are not 30%

GutterRat said...

scientia wrote,

I can guarantee that AMD's yield's are not 30%


How can you guarantee this? You have moles inside AMD telling you what their yields are? If not, then what is the basis by which you make your claim?

Scientia from AMDZone said...

Christian M. Howell

"you will se that APM allows AMD to fix problems before the wafer is completed."

Not exactly. If a new mask is required you can only fix in process if you haven't done that layer yet. However, these are test chips so you can scrap them without effecting production.

"Perhaps that's why as Scientia mentioned Fab 30 was the number one Fab in the world."

Rated by Sematech.

" If they can run Fab36 at the same 125%"

FAB 30 was expanded to 50% over its original capacity by adding more clean room space. FAB 36 increases from the original estimate of 20K to 24K wspm after they built the new bump and test facility and moved some equipment there.

"they can perhaps afford to start ramping 65nm tools at Fab38."

As far as I know, FAB 30 will only get a pilot 300mm production line. This is simply to get things ready in case AMD decides that they need the capacity. The actual ramping could occur from begining of 2008 to end of 2008. In other words, with the original ramp, FAB 38 would have been at 50% capacity (10K) by end of 2008. The way it stands now it may only be at 4 or 5K.

"Even if they keep some 90nm equipment in Fab30"

The FAB 30 200mm tooling is being removed. I would guess that less than half (maybe 40%) of the tooling will remain in operation by end of year.

Scientia from AMDZone said...

GutterRat

"How can you guarantee this? You have moles inside AMD telling you what their yields are? If not, then what is the basis by which you make your claim?"

This information is in the public slides. It doesn't have a scale but is relative to past yields so you can tell if you knew what they were before. I assume you've seen the slides.

Scientia from AMDZone said...

GutterRat

"Back to the real world, another AMD VP leaves for greener pastures."

Wow, what a spin. Henri didn't leave for greener pastures. Henri went to a smaller company so that he could potentially become CEO some day (because that won't happen at AMD). Henri almost certainly took a pay cut.

Scientia from AMDZone said...

Axel

"Opterons are still all 90-nm. I wonder why AMD haven't switched any of them to 65-nm, it's baffling."

Not worthwhile for the volume and until FAB 30 is fully taken down it has to produce something.

"AMD will not be able to make their margins on Phenom X4 due to the huge dies, aggressive Yorkfield pricing, and lower clock-for-clock performance compared to Yorkfield."

Well, AMD hasn't gotten the full benefit of 65nm yet so their base costs are still dropping. We've already seen that Barcelona is priced higher than Penryn for server chips. We'll have to see about the desktop chips in Q1.

" This will be their Achilles Heel that will lead to massive restructuring (including the new "Asset Light") in 2008. Some details will probably be revealed over the next couple months."

Well, now I know who is still sitting in the pumpkin patch with Roborat. There won't be any big announcement. You (and Roborat) should understand by now that AMD is currently installing 300mm tooling in FAB 30; it won't be sold.

Scientia from AMDZone said...

lex

"APM is simply leveraging the data from the last lot on the tool and feeding those results back to the next lot, or its feeding the results of the current lot from a previous step and adjusting a downstream step to compensate."

Yes, this is accurate.

Scientia from AMDZone said...

GutterRat

"Perhaps it would be to your advantage to concede that perhaps there are people out there that know a bit more about the real situation than you."

There could very well be. However, I saw serious technical errors in comments by the authors of both the Tech Report and Anandtech revies. Clearly some know less.

"AMD employees acknowledge that this was a paper launch, plain and simple."

This is simply false. Either that or by your definition Conroe was a paper launch. Barcelona has shipped whether it is available on NewEgg or not.

" You want a link for that? Sorry, no can do because I would jeopardize sources and methods :)"

That's quite all right; I have it narrowed down to either the Mad Hatter or Mad March Hare.

"OK, so what you are admitting that you have refused to accept Barcelona as a non-starter to Intel's existing CPUs."

Barcelona isn't a non-starter. It compares favorably in performance/watt. However, I haven't seen any benchmarks which suggest that it is running as fast as it should. The desktop won't be based on performance/watt.

abinstein said...

"I am not trying to be difficult. I am just looking for some clarification."

You are being pointless. What difference does it make as long as it's expanding capacity. Does AMD have to resort to more expensive ways to prove the expansion worthy of your note?

abinstein said...

gutterrat -
"No, it is a paper launch. I have called 3 major OEMs and not one of them will ship me any systems with Barcelona CPUs in them until November."

I'm glad you finally recognize that Barcelona is a superior product and is prepared to order systems based on it. Unfortunately, you are not an AMD customer.

The OEMs delay their product release for numerous reasons, including BIOS, support, documentation, even packaging. I don't know what take them so long to ship a product in 2 motnhs; maybe they have poor operating efficiency.

However, system makers in Taiwan will deliver Barcelona based systems by the end of this month. You should call someone you know there. If you don't know any, then I'm glad to take your order with some commission.

InTheKnow said...

Gutterrat said...

Quiz: how many ex-AMD employees have you interviewed recently?

Might I suggest that EX employees might have a somewhat biased take on the situation. There may be truth in what they say, but I suspect their view may be slanted towards the negative.

enumae said...

Abinstein
You are being pointless.


And you are being rude, or is this called logical... I was not trying to attack your comment, but I was looking for clarification on your statement.

What difference does it make as long as it's expanding capacity.

While in the context of processors from past discussions it has usually been described as ramp production, rather than as expand capacity.

If I misunderstood something sorry.

Does AMD have to resort to more expensive ways to prove the expansion worthy of your note?

No, and again, I was not trying to attack your comment so please don't be rude.

Christian M. Howell said...

You're right, Opterons are still all 90-nm. I wonder why AMD haven't switched any of them to 65-nm, it's baffling. Anyway, if you compare 65-nm K8 to 65-nm K10, there's roughly a doubling in die size. The comparison would make more sense for the desktop space: Phenom vs 65-nm K8. AMD will not be able to make their margins on Phenom X4 due to the huge dies, aggressive Yorkfield pricing, and lower clock-for-clock performance compared to Yorkfield. This will be their Achilles Heel that will lead to massive restructuring (including the new "Asset Light") in 2008. Some details will probably be revealed over the next couple months.


You can't really compare Brisbane to Barcelona. There are only 3 Brisbane chips and most desktops are 90nm also.

Brisbane is I believe about 67% the size of Windsor which still doesn't make 2x for Phenom.

I'm not sure what the dimensions of Kuma are but that will be replacing K8 in the mainstream, not Agena.

Quad comes first because server is quad. The comparison remains the same with opteron. At least 75% more chips per wafer.

And please stop saying huge dies.

Christian M. Howell said...


Not exactly. If a new mask is required you can only fix in process if you haven't done that layer yet. However, these are test chips so you can scrap them without effecting production.

From what I read of the white papers, the software can make corrections as it monitors certain facets of transistors being etched. My assumption is that this means no two chips can have the same defect.



Rated by Sematech.


That's why I mentioned it.


FAB 30 was expanded to 50% over its original capacity by adding more clean room space. FAB 36 increases from the original estimate of 20K to 24K wspm after they built the new bump and test facility and moved some equipment there.


That's perhaps why Fabtech reported that Chartered orders are down.


As far as I know, FAB 30 will only get a pilot 300mm production line. This is simply to get things ready in case AMD decides that they need the capacity. The actual ramping could occur from begining of 2008 to end of 2008. In other words, with the original ramp, FAB 38 would have been at 50% capacity (10K) by end of 2008. The way it stands now it may only be at 4 or 5K.


Since AMD is supposed to convert to 65nm, it would seem reasonable that 300mm wafers would come first. That would immediately allow nearly twice the Opterons\X2s at 90nm - which seem to be their highest volume - with Turion now at 65nm at Fab 36.


The FAB 30 200mm tooling is being removed. I would guess that less than half (maybe 40%) of the tooling will remain in operation by end of year.


I think we're confusing what I'm saying about 300mm @ 90nm. Perhaps I'm incorrect in my assumption that the actual etching assembly can be placed into the machinery for 300mm.



At any rate, there is a recognized cost savings for switching to Barcelona in the server space.

abinstein said...

enumae -
"And you are being rude, or is this called logical... I was not trying to attack your comment, but I was looking for clarification on your statement."

Oh well, I'm sorry if my response seems rude to you. It was not my intention. Maybe I should've said it this way: the question you raised is pointless.

No one says only building new fabs is called expansion. You can expand current fab capacity by adding more tools, streamlining manufacturing, or increasing existing tool operating efficiency.

Building a new fab is expensive, and it's definitely not a good thing to do for AMD in short term. As scientia pointed out, AMD was able to expand FAB30 to ~1.5x of its original designed capacity; AMD can do the same for FAB36 and FAB38. AMD can even expand capacity by shifting low-end production to Chartered.

AMD's flex capacity chart shows the company make less than 60mil sellable dies in 2006, but plans to make more than 80mil in 2007 and close to 100mil in 2008, while transitioning from dual-core to quad-core. If this is not called expansion, what is?

abinstein said...

"From what I read of the white papers, the software can make corrections as it monitors certain facets of transistors being etched."

IIRC current APM has not got the ability to monitor individual die, but the individual wafer. But that still makes it possible to "fix" defects without changing the mask, since many defects are not strictly related to the mask anyway.

The sooner any problem in the lithography process can be identified & traced, the better the production line can be adjusted, and the less wafers will be wasted.

"My assumption is that this means no two chips can have the same defect."

There are systematic defects and non-systematic ones. Current APM should be able to tune on the wafer level to fix the systematic ones, but it probably can't do it on the per-die level, nor can it do anything on the non-systematic problems.

Ho Ho said...

abinstein
"The OEMs delay their product release for numerous reasons, including BIOS, support, documentation, even packaging. I don't know what take them so long to ship a product in 2 motnhs; maybe they have poor operating efficiency."

You mean poor OEM's as IBM? Do you remember their press releases that were leaked a day before launch?

Mo said...

I cannot believe that some of you are trying to justify that it was NOT a paper launch.

This was a full blast pre-mature paper launch.

The stepping they launched with B1 is not even the shipping stepping so that alone.

Not a single processor is available anywhere, OEM's dont have servers with barcelona that are shipping.

When Core 2 came out, on the day of launch, you could buy a core 2 duo. It was a limited supply but there was a supple.

Stop trying to justify that it was a hard launch. It wasn't.

Scientia from AMDZone said...

13ringinheat

"Well here is another piece of AMD fud.......AMD is to release tri core CPUs........lets see how this is spun....."

I didn't find this credible. There is no logical reason for AMD to create a completely new die mask for a three core processor. Software tends to be either dual thread or n-thread so there is no advantage to having three cores versus four. It seems to me that any 3 core chips are ever released they would just be 4 cores with one core disabled.

Scientia from AMDZone said...

abinstein

"Current APM should be able to tune on the wafer level to fix the systematic ones, but it probably can't do it on the per-die level, nor can it do anything on the non-systematic problems."

Your first guess is correct. It is strictly wafer level. And, the corrections are only general; they cannot correct defects. A better way to think of it is controlling tool variation to keep more of die yield in a favorable range. This ability has less benefit as the process matures.

Scientia from AMDZone said...

enumae & abinstein

This seems like semantic argument. When tooling is added to an existing FAB, it is referred to as ramping. However, when a new process is started or a new core is put into production it is also called ramping.

If you want to get technical then I suppose FAB 36 is ramping production until it hits 20K wspm. The extra 4K is technically expansion because it was dependent on the new bump and test facility.

Obviously when AMD has one entire FAB (FAB 30) to change over to 300mm tooling plus another 7K left to ramp on FAB 36, the question of capacity is moot right now. The only real question is whether AMD is serious about building a 3rd FAB. FAB 38 will stop being flex capacity after mid 2009.

Scientia from AMDZone said...

mo

AMD has shipped product to OEM's. If you want to call this a paper launch that is fine but C2D was in similar short supply for at least a month after "launch".

I can remember very clearly when Intel executives had to make statements about when production would make the next handful of Conroe's available. Have you forgotten?

Scientia from AMDZone said...

Christian M. Howell

"There are only 3 Brisbane chips and most desktops are 90nm also."

Not exactly. The Brisbane range covers the highest volume chips.

"Quad comes first because server is quad."

Quad core comes first because AMD has no quad core.

Scientia from AMDZone said...

Christian M. Howell

I think you may be misunderstanding the ramping on the new 300mm tooling in FAB 30. AMD will not do general production at 90nm on the new 300mm tooling. The new tooling on FAB 36 was tested at 90nm first but since 65nm is mature at this point AMD will just begin testing at 65nm. 90nm production should be gone by mid 2008 as 45nm becomes ready for launch.

Ho Ho said...

scientia
"I can remember very clearly when Intel executives had to make statements about when production would make the next handful of Conroe's available. Have you forgotten?"

I wonder why I remember I could get myself a new Core2 CPU on the same week they were released from several shops in Estonia.

abinstein said...

"I wonder why I remember I could get myself a new Core2 CPU on the same week they were released from several shops in Estonia."

There's nothing to wonder. You can't get Barcelona because it's not a consumer processor, and you're not a server builder.

abinstein said...

"A better way to think of it is controlling tool variation to keep more of die yield in a favorable range. This ability has less benefit as the process matures."

Well, actually this ability is what makes a process mature.

abinstein said...

"This was a full blast pre-mature paper launch."

This is a paper launch for you, but not a paper launch for big customers and server vendors.


"The stepping they launched with B1 is not even the shipping stepping so that alone."

Purely wrong and complete FUD. They are launching stepping BA, not stepping B1. No stepping B1 should fall in the hands of any Barcelona customers.


"Not a single processor is available anywhere, OEM's dont have servers with barcelona that are shipping."

For this statement to be you possess the godly vision of everywhere, something I seriously doubt you do.

I have friends working at Acer. They've already shown me you're wrong last week. I guess they also possess some godly time-traveling ability. ;P


"When Core 2 came out, on the day of launch, you could buy a core 2 duo. It was a limited supply but there was a supple."

I wasn't able to get my Core 2 based Dell workstation by October.

Ho Ho said...

abinstein
"I wasn't able to get my Core 2 based Dell workstation by October."

I guess that Dell and IBM have as bad operating efficiency then as I bought my Core2 just a couple of weeks after launch and there was plenty of supply for them in the local retailers. I could have bought it even sooner but I was waiting for the price to drop a bit.

Scientia from AMDZone said...

Ho Ho

"I wonder why I remember I could get myself a new Core2 CPU on the same week . . ."

This is the logical fallacy of Hasty Generalization. On the other hand, the statements by Intel executives are a matter of record.