Thursday, July 19, 2007

AMD's Q2 07 Earnings

Looking at the earnings, I'm reminded of the movie "Regarding Henry" where Henry gets shot in the head and is in a coma. The doctor then tells Henry's wife that if you have to get shot in the head that the place where Henry got hit is the best place. Similarly with AMD, I suppose that they had to lose $600 Million in one quarter that this was the best way to do it. Athough the amount of the loss this quarter was the same as last quarter, in other important ways, this quarter was a bit better.

Since there is not much good to say about losing over half a Billion dollars for the third quarter in a row, let's leave that for the moment and look at microprocessor revenue share. AMD's microprocessor revenue share really tanked in Q3 2002 when it hit a low of 5.5%. However, this number is not really comparable to today since at that time AMD had only one FAB capable of producing microprocessors, so AMD didn't need as much of the market. With two FABs today, AMD's costs are higher and therefore AMD needs more share. To compare today's numbers we need to look at the period since AMD started putting serious resources into a second leading edge FAB. Ground breaking on FAB 36 started November, 2003 so we will only compare the current quarter with AMD's history back to Q1 2004.

AMD's microprocessor revenue share climbed from around 9% in early 2004 to about 15% in Q3 2005. AMD's revenue share stayed above this until it fell drastically in Q1 2007 when it hit 13.3%. However, this quarter it bounced up to 15.8% which just a little higher than AMD had in Q4 2005. AMD's processor unit volume has probably also recovered to about what it was before Q1 which would be about 25%. In trying to evaluate these numbers most people stumble over the concept of Average Selling Price or ASP. While it might seem immediately intuitive that higher income is related to higher ASP this is not actually the case. The number we actually need is Average Selling Profit. For example, it would be better to sell a processor for $50 that costs us $20 to manufacturer than to sell a processor for $100 that costs us $80 to make. Unfortunately, average profit for processors is never mentioned. Logically, the number of units sold x average profit = total profit. It means very little to us if AMD's average price increases because this merely increases the total revenue but may not in fact increase profits. However, the numbers show that the cost of producing microprocessors at AMD has dropped by 8.5%. This boosts the profit by 18%. This is seen in the increase of gross margin percent from 28.1% last quarter to 33.5% this quarter. So, AMD's actual average asking price may be lower this quarter than last but it doesn't matter because on average AMD is making more money off of each processor than it did last quarter. Increased unit volume and better profit per unit are both positive changes for AMD this quarter.

I had estimated previously that AMD could lose another $1.5 Billion this year without danger of bankruptcy. Having just lost $600 Million this margin is now down to $900 Million. AMD was also in danger of having its R&D and capital purchases budget choked by its reduced income. For example, AMD had originally planned to spend $2 Billion in 2007 for capital purchases but then had to reduce this estimate to $1.5 Billion. Recently though, the German government pledged $360 Million in aid and AMD's capital purchases estimate climbed again to $1.8 Billion. Also, AMD believes that it can break even in Q4. If this is realistic and Q3 splits the difference losing $300 Million then AMD would pull out of its nosedive with $600 Million to spare. However, trying to estimate Q3 revenue is not easy. AMD's recent price cuts will affect Q3. However, as AMD now takes down FAB 30 they will get increasing benefits of a greater 65nm to 90nm mix. AMD could also see income of $100 - $200 Million in Q3 from sales of 200mm tooling from FAB 30. And, AMD gets at least some benefit from K10 Opteron sales in Q3. To me, this seems like enough to cut the losses by $100 Million or so but not $300 Million unless AMD sees increases in unit volume. This could happen perhaps if DTX is more popular than current solutions. Q4 looks much better. The volume of K10 server chips should be reasonable in Q4 and AMD should begin delivering HT 3.0 capable Budapest versions. Assuming that AMD can bump the clocks up from a paltry 2.0Ghz to a more reasonable 2.3 - 2.5Ghz AMD should gain a little on Intel. On Intel's part, the decision to push back 45nm production at the Chandler FAB to Q1 2008 means that only D1D will be producing 45nm chips in Q4 2007. This volume wouldn't go very far on the desktop but would have some effect on servers. Assuming Intel's 45nm server chips are available in 3.2Ghz speeds, AMD will need at least 2.4Ghz quad cores.

To some extent this is like walking up a down escalator. AMD will certainly increase clock speeds in Q4 but it appears that Intel will as well. Essentially, AMD will use its K10 chips to maximum benefit as Opterons because Current Opterons are still running on the old 90nm process. So, Opteron gets higher IPC, greatly enhanced power saving features, and power savings due to 65nm. AMD's greatest advantage will be in 2-way or 4-way where Intel's use of FBDIMM makes Intel systems consume more power and run slower. However, Intel too gets power savings from 45nm and its larger cache will help the most on 2-way and up quad core systems. So, AMD's advances will get mostly countered by similar advances from Intel. AMD will come out only slightly ahead in power/watt on 2-way and up systems due to faster HT, enhanced power saving, and inherent advantages of native versus MCM quad. On the other hand, Intel retains highest overall performance due to its much greater clock speed. In Q1 2008 the race continues as Intel ramps 45nm desktop chips while AMD ramps K10 desktop chips. Then in Q2 Intel ramps 45nm mobile while AMD delivers its most aggressive mobile offering with a new mobile cpu and all new mobile chipset. The situation continues in the second half of 2008 as Intel gears up for Nehalem with Integrated Memory Controller and CSI (similar to AMD's HyperTransport) while AMD gears up for its own 45nm chips with Direct Connect Architecture 2.0 (including a possible MCM octal core). Production in 2008 should be similar as Intel ramps two new 45nm FABs while AMD converts 200mm FAB 30 to 300mm FAB 38 and finally enjoys the benefits of having two similar 300mm FABs.

I suppose I'm somewhat reminded of the mythical hydra when thinking of Intel's current FSB based chipset solutions. With each head that was cut off, the hydra would grow two more. In a similar fashion Intel has attempted to overcome the problems of shared bus by having first two FSB's and then four FSB's later this year. I've seen some people naively describe this as two of Intel's current 5000 series northbridge chips. However, quad FSB chipset complexity is not twice that of a dual FSB chipset, but actually four times greater if the same speed is maintained or half the speed with similar complexity. If we simply linked two northbridges together we would incur a large latency penalty between the northbridges every time we transferred from memory across chips. So, we put all the functions in one chip. This sounds great but look at the numbers. For two FSB's and two FBDIMM channels we have 2x4 = 8 connections. To maintain the same bandwidth for four FSB's we would need eight FBDIMM channels for 4x8 = 32 connections. Four times the connections means four times the circuitry and four times the power draw. This is a reasonable interim solution for Intel but we can also understand why Intel will drop this architecture with Nehalem.

Some have tried to compare the release of K10 to K8 in 2003. I assume this analogy comes to mind because of AMD's less competitive position with K7 and because of both K8's delays and low initial clock speeds. However, they are actually quite different. The K8 die was twice the size of the K7 die and yields on the new 130nm SOI process were terrible at about half of K7's yields. Also, the only available chipset at K8's release was AMD's own 8000 series. It took months for desktop motherboards to begin showing up. In contrast, K10's 65nm yields are good and it already has full support with existing AM2 and socket F motherboards. Another similarity is that in 2003, Xeon's FSB was woefully underpowered on 4-way systems which is why Xeon relied heavily on L3 cache. Clearly, Intel is still relying on large cache as shown by Yorkfield's 50% increase but Intel today is much more competitive in FSB speed. In 2003, 4-way Xeon used DDR-266 compared to Opteron's DDR-400. Today, AMD's memory controller is equivalent to a FSB speed of 1600 Mhz. The same ratio as 2003 would put Intel at 1066 Mhz so we can see that Intel is already ahead of this at 1333 Mhz. Further, since AMD already uses the fastest DDR2-800 memory, greater controller speeds won't help unless DDR2-1066 is released. In fact, if DDR2-1066 is not released, AMD's lead in memory speed is likely to vanish as AMD won't switch to DDR3 until late 2008. This is essentially what happened with DDR where Intel influenced JEDEC to stop at DDR-400 and switch to DDR2 while AMD could have used DDR-500. It is possible that after the lackluster acceptance of Intel's FBDIMM that JEDEC might be more open. However, it remains to be seen if JEDEC will continue to follow Intel or whether they will be more supportive of AMD.

The final thing that I've been curious about is the change in processors families. When AMD had K7, all of their market segments were covered with this one architecture. So, AMD had the dual socket MP version of K7, the mobile version of K7, and the low end Duron version. AMD continued this same approach with K8 with Opteron, Turion, and Sempron. Intel in contrast has often had very different architectures for different segments. For example, Intel continued to sell Pentium Pro for servers even after PII was launched. Further, Intel's Celeron version with on-die cache instead of off-die cache was also quite different in architecture. Intel continued this with P4 Xeon by using L3 cache and PAE addressing which was not used in the desktop version. We see the same thing today where Tulsa with its hybrid FSB design is also substantially modified from the desktop version. Intel changed this even further with Pentium M whose architecture was very different from P4's. Today though Intel has reversed this situation by using one architecture for everything. Intel has Conroe, Woodcrest, and Merom versions of C2D much as AMD had with K8. Apparently Intel will continue this with 45nm by having Wolfdale, Yorkfield, and Penryn versions. AMD however is going the other way and will release a new mobile architecture called Puma in 2008 based not on K10 but a development of K8. The main difference that I can see is that Puma is not designed around massive throughput SSE as the desktop and server chips are. This may be a reasonable architecture split since I suspect few people do video or audio bulk conversion or other SSE intensive tasks on their notebooks. If this really is an advantage then it is certainly possible that Intel will go back to a split architecture when it introduces Nehalem. In fact there have already been suggestions that only the multi-socket versions will use an IMC while Intel will continue to use a FSB with the single socket chips. If true this would indeed put Intel back into a split architecture strategy.

107 comments:

enumae said...

Scientia
However, this quarter it bounced up to 18.4% which is the highest percentage in the past 3 1/2 years.

I can't seem to find this, can you explain?

However, the numbers show that the cost of producing microprocessors at AMD has dropped by 8.5%. This boosts the profit by 18%. This is seen in the increase of gross margin percent from 28.1% last quarter to 33.5% this quarter.

Please explain this as well.

Thanks

Scientia from AMDZone said...

The first number was a typo in my spreadsheet; it should be 15.8% which is a little higher than it was in Q4 2005.

The second set of numbers are from the revenues and losses in the microprocessor section. Basically, if AMD's costs were the same then the losses should have increased as revenues increased. But, they actually dropped.

Ho Ho said...

I think enumae actually wanted to know where did you got those numbers*. I remember you waited for third parties to confirm the numbers in last quarter when AMD lost marketshare. Are those numbers confirmed by someone else but AMD?

*) If he doesn't then I certainly do

enumae said...

Scientia
The first number was a typo in my spreadsheet; it should be 15.8% which is a little higher than it was in Q4 2005.

In the second quarter, AMD accounted for 11.4% of worldwide microprocessor sales, up 0.5 percentage points from the 10.9% it held in the first quarter, iSuppli said.

-Link-

The second set of numbers are from the revenues and losses in the microprocessor section.

Ok, I will look again.

Scientia from AMDZone said...

Q1 07 - Earnings 918, expenses 1239
expenses 135% of earnings
Q2 07 - Earnings 1098, expenses 1356
expenses 123% of earnings

Definitely an improvement. Obviously though we need expenses to be less than earnings.

You can also look at the gross margin percent.

Q1 07 - 28.1%
Q2 07 - 33.5%

This would be a 16% drop from Q1 costs. This is probably due mostly to a greater ratio of 65nm to 90nm parts. This should continue to increase as FAB 30 ramps down. Maybe we could see AMD back up to 45% in Q4. I think we would also see a small boost if ATI moved R600 to 65nm production.

I don't know exactly what the volume share is. However, since AMD's prices didn't go up and their revenues did we can assume their volume increased a lot. It looks like Intel's volume didn't change much from Q1 so I would say that AMD gained back share it lost in Q1. But, I don't know the exact numbers.

Ho Ho said...

It's funny how you waited weeks for confirmation to see an independent third party to talk about AMD marketshare drop in Q1 whereas you agree to the first (wrong) numbers you see when they report gaining back some of it :)


"However, as AMD now takes down FAB 30 they will get increasing benefits of a greater 65nm to 90nm mix"

Isn't it somewhat being offset by ramping much more expensive K10?


"Q4 looks much better. The volume of K10 server chips should be reasonable in Q4 and AMD should begin delivering HT 3.0 capable Budapest versions"

Wasn't AMD original plan to truely start ramping K10 next year? You yourself have stated several times that K10 will have neglible affect on AMD this year. What made you change your mind?


"For two FSB's and two FBDIMM channels we have 2x4 = 8 connections. To maintain the same bandwidth for four FSB's we would need eight FBDIMM channels for 4x8 = 32 connections"

Where did you get the idea Intel will connect twice as much memory channels to that chipset? I'll give you one article to read and especially to look the pictures. Intel will not double the number of memory channels with Tigerton. Next time check the facts before building up a theory based on them :)

Ho Ho said...

An interesting link from our friend Roborat's blog: The Problems With SOI.

Any comments from anyone more educated in the finer details of manufactruring process? The analysis seems to be based on AMD public datasheets and sounds solid to me. It also makes me believe that the rumours of AMD ditching SoI for 32nm and below might actually be true.

Azmount Aryl said...

Yes, thats absolutely true. AMD's 65nm SOI is much worse than TSMC's bulk 65nm. Thats why they are selling all their fabs and will begin manufacturing everything at TSMC in 2008, as pointed out by me in one of my posts at amdzone.com

Duhh... told ya soo!

Scientia from AMDZone said...

ho ho

"It's funny how you waited weeks for confirmation to see an independent third party to talk about AMD marketshare drop in Q1"

Well, this shouldn't be difficult for most people to understand. The talk in Q1 was that AMD's revenue fell sharply because they slashed their processor prices. None of the early articles gave a breakdown of volume share so I had no idea what the volume was.

"whereas you agree to the first (wrong) numbers"

AMD's Q2 Earnings report numbers are wrong? Maybe you should contact the SEC.

Again, I think most people can understand that if your revenues increase a lot (while your competitor's drop) and your chip prices are either equal or down then you must have sold at least that many more chips. This should be true unless: 1.) AMD has increased chip prices. 2.) Intel has reduced its prices in total in the last two quarters more than AMD did total in the last two quarters.

"Isn't it somewhat being offset by ramping much more expensive K10?"

Not in Q3; the K10 volume is too low. With FAB 30 running at capacity in Q1 and Q2, the only change in mix was due to ramping at FAB 36. In the second half of the year the mix changes faster because not only is FAB 36 ramping but FAB 30 is decreasing at the same time. K10 won't have that much affect unless you expect it to be 25% or more of FAB 36's capacity in Q4.

"Wasn't AMD original plan to truely start ramping K10 next year?"

No. Server chips are only a fraction of desktop chips. AMD will have good volume of server chips in Q4. However, this volume is small so it won't change revenues by much. When people ask about K10 volumes they are asking when it will increase revenues significantly and that will be when the desktop volumes come up in Q1 08.

"Where did you get the idea Intel will connect twice as much memory channels to that chipset?"

I didn't say that Intel would. I was showing what was required. If you leave the number of FBDIMM channels the same you have to double the channel speed to have everything equal. And, that is not going to happen. Basically, with only 4 FBDIMM channels Intel is going to take a big hit in memory bandwidth on 4-way.

Scientia from AMDZone said...

ho ho

"An interesting link from our friend Roborat's blog"

roborat cherry picks items that favor Intel no matter how silly they are. I assume he is still claiming that AMD will sell its FABs in 2008.

"The analysis seems to be based on AMD public datasheets and sounds solid to me."

I've read Ed's comments (which were a waste of time). I read wbmw's post on Investor's Hub and the looked at the orginal white paper data. I'm not seeing anything that supports either of his claims.

His first claim is that the data shows that AMD's 65nm process is worse than the 90nm process. The data does not show that. The wattage profile for ADO5000IAA5DD which is 65nm is the best of all processors in the chart. The 65nm chips show similar wattage profiles to the older 90nm 65 watt chips.

He implies that the new 3600+ is an indication that 65nm is worse. However, if you compare this profile with the older 65 watt 90nm parts you can see that it's profile would have been the same. Presumably the reason AMD did not release a 3600+ part on the older process was that similar wattage profile 90nm chips were simply sold as 89 watt 3800+.

Secondly, he claims that 65nm is stuck at 2.6Ghz which is again not indicated by the data. There is nothing that would for example exclude a 2.8Ghz 89 watt 65nm chip. It appears that AMD simply has not seen any reason to replace the lowest volume bins (2.8 & 3.0 Ghz) with 65nm chips at the same TDP (89 watts). I would agree that the savings to AMD for doing this are not currently compelling.

"It also makes me believe that the rumours of AMD ditching SoI for 32nm and below might actually be true."

It's possible. Like other potential technologies (FinFET and Tri-Gate for example); this would be dependent on what the current leading research is showing. We should be hearing something about 32nm gates before too long.

Scientia from AMDZone said...

ho ho

If you want to get a better idea about technology then you should probably look at something like The Road to 32 nm.

Presumably, this is the type of thing you are talking about.

Perhaps the big news of the evening was that when asked if SoI would still be advantageous at 32 nm, AMD’s Krivokapic declined to comment.

Shahidi from IBM, another present-day SoI champion, stepped in to say that SoI would still offer shallower junctions and fewer isolation wells, but admitted that it would continue to present barriers to the very circuit designers everyone was counting on to bail them out on performance.


So, AMD may indeed drop SOI. However, other comments are interesting too but you can ignore those because they contradict Intel instead of AMD.

Asked if high-k dielectrics and metal gates were the answer, the panel was cautious, saying that the combination doesn’t buy much until 32 nm

Scientia from AMDZone said...

IBM’s Shahidi described 32 nm as, at least initially, a shrink of the industry 45 nm process, saying there would be no multi-gate transistors

Semiconductor International

Ghavam Shahidi, IBM fellow, has been outspoken on the topic of finFETs, stating that if 3-D transistors were going to be used, they would have been used by now and, reiterating his stance at the Applied Materials' event at this year's IEDM, said that finFETs will not be used at the 32 nm node.

That would seem to rule out FinFETs.

Roborat, Ph. D. said...

scientia said: "roborat cherry picks items that favor Intel no matter how silly they are. I assume he is still claiming that AMD will sell its FABs in 2008."

can you please make a link to when i said AMD will sell any of its Fab in 2008?

...cherry picks items that favor Intel

yes like when i said Toshiba's adoption is good for AMD... like when I said in Q2 i expect AMD to gain back marketshare because of channel stuffing... etc, etc.
I pick on what i think has more significance.

Scientia from AMDZone said...

roborat

"can you please make a link to when i said AMD will sell any of its Fab in 2008?"

Intel Breaks AMD's Business Model

You quote Covello who did say that AMD would completely outsource in 2008. I don't see anywhere in your article that you disagree with his conclusions. However, your comments do seem supportive of his conclusions:

"An increasingly larger chunk of AMD's volume is no longer generating enough margins to warrant in-house production.

In fact the strategy worked so well that Intel even managed to deny AMD the financial mans to run 2 Fabs on a day-to-day basis.

AMD's earning projection shows red until the end of 2008. In order for AMD to survive, it needs to outsource.can mark this down as the beginning of the end for AMD. No semiconductor company has yet to return from outsourcing."


Again, your position seems to be strongly in favor of Covello's point of view. Covello was engaging in a completely artificial speculation to pump Intel stock. However, your article doesn't seem to recognize this.

BTW, your above statements contain several errors. However, I realize that you may have changed your mind since then as this was before the Q2 results were known.

"yes like when i said Toshiba's adoption is good for AMD... like when I said in Q2 i expect AMD to gain back marketshare because of channel stuffing... etc, etc.
I pick on what i think has more significance."


And, I've also pointed out that once strong AMD users Tiger Direct and eMachines have dropped AMD processors from their offerings. Yet, you constantly try to portray me as some kind of AMD fanatic.

I get tired of people who have strong biases but can't make any arguments for their position. Typically these people then resort to ad hominem attacks. I'm called a fanboy and then compared to Sharikou. What is so insane about this is that when I realized that Sharikou was linking to my thermal article to suggest that Intel won't do well in the future I had to go there and point out that my article covers 2006 and says nothing about clock speeds toward the end of this year.

I understand that people don't like having their posts deleted or edited but if you've seen Sharikou's blog lately then you know what a flame mentality can do to the comments. I'll take exception if anyone is called a fanboy whether they like AMD or Intel yet that is nothing compared to the insults at Sharikou's blog.

If you want to discuss AMD's current position, that would be fine. And, BTW, I was estimating a loss of $550 Billion this quarter so AMD did a bit worse. However, I do see improvement in other areas.

Scientia from AMDZone said...

roborat

I am working on a critique of your Q2 article but I'm already at 600 words and I haven't gotten past your first paragraph. Overall though, your article is very, very negative towards AMD. You write as though you either have a personal grudge against AMD or you think Ruiz is lying about everything he says. I don't understand this attitude and I especially don't understand it if you claim to be fair. Your article seems to very churlish.

abinstein said...

scientia -
"I've read Ed's comments (which were a waste of time)."

Yup, it has always been so for me. Now I rarely bother to read his articles any more, and every once a while I decide read one I regret the waste.

Ho Ho -
"It also makes me believe that the rumours of AMD ditching SoI for 32nm and below might actually be true."
How does an apparent rumor makes you believe another rumor?

azmount -
"AMD's 65nm SOI is much worse than TSMC's bulk 65nm."

Is this a joke or simply dumb FUD? Which single chip out of TSMC today could even touch the complexity and performance of a 2.6GHz dual-core K8? Or a 2.0GHz quad-core K10?

Scientia from AMDZone said...

abinstein

I assumed azmount was joking. His post sounded way over the top.

Also, even though the link that ho ho gave means nothing I posted link to edn.com which has comments from all of the top FAB players. And, it does give serious suggestion that SOI may be dropped at 32nm. Not only that but AMD repeated this at the Earnings conference. SOI is not definite for 32nm at IBM or AMD.

AndyW35 said...

According to Digitimes there will be a 6400+ being released at some point which is 3.2GHz supposedly. It will be interesting to see whether this is still on the 90nm process or the 65nm.

abinstein said...

"I posted link to edn.com which has comments from all of the top FAB players. And, it does give serious suggestion that SOI may be dropped at 32nm."

Yes, it's quite obvious from the link and AMD's Q2 conference call that, for 32nm, the company is still evaluating the choices among SOI and bulk. Saying "SOI will be dropped at 32nm" is too far-fetching, while at this moment it's not even sure whether 32nm will be advantageous performance & power-wise.

yomamafor2 said...

Its quite laughable for someone who has absolutely no idea what he's talking about, but tries to explain (or spin?) everything professionally.

Scientia, although I do respect some of your works, but what you're posting here is balant lies, based on...nothing. Some, if not most of your statements are completely groundless. How do you know a 2.4Ghz K10 cannot match a Penryn at 3.2Ghz? Do you have the benchmark to prove it? How do you know the K10 65nm yield is good now? Do you just intrinstically know that?

Also, your lack of computer architecture backgroun is apparent when you tried to argue Intel's FSB approach with AMD's HTT. There is a reason why Intel uses larger amount of L2 cache, while still keeping the FSB approach. Clearly, you did not spend much time understanding Intel's platform at all.

Although your posts are slightly better than brainless AMD fanboys' (like sharikou), but afterall, you still do not possess the knowledge even be considered as an enthusiast.

Ho Ho said...

scientia
"Basically, with only 4 FBDIMM channels Intel is going to take a big hit in memory bandwidth on 4-way."

I don't think it is that bad. If I read the charts correctly it will have huge cache in northbridge to ease the load. Also AMD has a little disadvantage thanks to NUMA if for some reason some application (database) stores its data in one memory pool and other CPUs have to access it. Intel has no such problems.


"His first claim is that the data shows that AMD's 65nm process is worse than the 90nm process. The data does not show that"

What does tha data show then?


"The wattage profile for ADO5000IAA5DD which is 65nm is the best of all processors in the chart"

Well, he did say that the lower leakage parts are used for the high end and that one is the lower leakage one. What can you say about ADO3600IAA5DD and ADO4000IAA5DD?


"However, if you compare this profile with the older 65 watt 90nm parts you can see that it's profile would have been the same"

I must be doing something wrong since this is not what I see. Can you explain how exactly you got the result? Some calculations would be nice to see.


"There is nothing that would for example exclude a 2.8Ghz 89 watt 65nm chip"

This is pure speculation but perhaps it simply can't (easily) produce higher wattage 65nm K8's?) I guess we'll see soon enough once we know if the 3.2GHz 6400+ with double the L2 cache is still 90nm or 65nm.


abinstein
"Which single chip out of TSMC today could even touch the complexity and performance of a 2.6GHz dual-core K8? Or a 2.0GHz quad-core K10?"

Aren't R600 and G80 produced at TSMC? Sure, they are not 65 but 80 and 90nm but still over/nearly 700M transistor dies are a huge compared to even K10. Not to mention they have much more die space allocated to computational units than CPUs.

Décío Luiz Gazzoni Filho said...

The main difference that I can see is that Puma is not designed around massive throughput SSE as the desktop and server chips are.

Massive throughput? You mean the 2 64-bit SSE2 units of K8, which apparently carried over to K10 according to the recently published K10 optimization guide?

Now Core 2 Duo with its 3 128-bit SSE2 units, that might be called massive indeed. Theoretical peak throughput 50% higher than K8's, and in practice much better due to half the latency. To put things in perspective, I recently wrote some SSE2 code which does equally well on Core Duo and K8, despite the fact that Core Duo has a single SSE unit (a 128-bit one at that though), and of course it's twice as fast on Core 2 Duo.

Aguia said...

ho ho,
It is obvious its 90nm.

F3 revision

It’s easy to achieve 3.3Ghz on that revision.

I have friends with X2 3800+ F3 doing OC to 3.0 GHz with zero efforts.

Scientia from AMDZone said...

yomamafor2

"but what you're posting here is balant lies, based on...nothing."

I really have to laugh at that. How many web writers picked up on the idea that AMD would be fully outsourced in 2008 based on nothing but a memo from Covello? How many web writers are still writing stories based on this absurd idea?

"your statements are completely groundless."

Perhaps I should explain the concept of redundancy to you. In your post you have:

"laughable", "absolutely no idea", "spin", "balant lies", "based on...nothing", "completely groundless"

Generally when people have to repeat things this many times their writing contains very few actual points.

" How do you know a 2.4Ghz K10 cannot match a Penryn at 3.2Ghz?"

I would have no way of knowing that for certain. What I say is based on different pieces of information and some projection or estimation on my part. Is it possible that a 2.4Ghz K10 could match a 3.2Ghz Penryn? Yes. Is it likely? No.

To hit this level of performance K10 would need a 50% increase in IPC which would be a miracle if AMD could do it. K8 increased IPC by 20% compared to K7 and C2D seems to have increased IPC by 20% compared to Yonah. So, how likely is 50%?

"How do you know the K10 65nm yield is good now? Do you just intrinstically know that?"

Krishna Shankar - JMP Securities: just following up on Barcelona, you said would be up to 2 gigs, is it the design or is it the yield?

Dirk R. Meyer: First of all, we’re very happy with our 65-nanometer yields across all products, including Barcelona, so no issue there. The fact of the matter, Barcelona, while being an absolutely great product, is complicated and it’s taking a little bit more design work than we anticipated getting the final revision in place.

"Also, your lack of computer architecture backgroun is apparent"

Looking at my bookcase my oldest book on computer architecture is "From Chip to Systems" by Rodney Zaks published 1981, 550 pages.

" when you tried to argue Intel's FSB approach with AMD's HTT."

Could you be a little more specific?

"There is a reason why Intel uses larger amount of L2 cache, while still keeping the FSB approach."

Yes, there are several reasons:

1.) The cancellation of Whitefield delayed the introduction of CSI.

2.) Intel makes money off of chipsets. Dual and quad FSB chipsets help stave off competition because they are hard to make. For example, AMD's dual FSB 760 MP chipset was the only one produced for Athlon MP.

3.) Shared bus architectures create bottlenecks which are helped by additional cache. This was true for Pentium Pro and is still true today.

4.) Cache also helps with memory latency. Again, this goes all the way back to when L1 cache was off-die but was still faster than main memory.

5.) There are also differences in optimal L2 cache size because of Intel's inclusive versus AMD's exclusive cache design. And, there are differences due to shared L2 versus independent L2.

"Clearly, you did not spend much time understanding Intel's platform at all."

You said that there are several reasons but didn't actually mention any. I said there were several reasons and then listed them. Notice the difference?

"you still do not possess the knowledge even be considered as an enthusiast."

I'm still waiting for the substance in your post.

Scientia from AMDZone said...

ho ho

"it will have huge cache in northbridge to ease the load."

Yes, essentially off-die L3.

" Also AMD has a little disadvantage thanks to NUMA if for some reason some application (database) stores its data in one memory pool and other CPUs have to access it. Intel has no such problems."

This is indeed true for blocks of data that are much larger than the cache, like a database, for example. A properly designed NUMA system would allow each processor to access its own piece of the database while a non-NUMA system could have cores chasing all over. This shouldn't be an issue on a single socket quad core K10 but obviously could be for 2-way and up.

"What does tha data show then?"

65 watt 3600+ is not something special for 65nm. AMD sold these same profile chips on 90nm as 89 watt 3800+. However, with changes in prices and greater demand for 3600+ it makes sense to add it with 65nm.

AMD currently has no incentive to replace a low volume 2.8 or 3.0Ghz chip with a 65nm chip with the same TDP. Remember that FAB30 will still produce 90nm chips until 2008.

I have to admit that I'm uncertain what wattage chips AMD is currently pushing for. Some of what they said suggests that they consider 65 watts to be their primary market. So, this brings up the question of whether the lack of 89 watt chips is due to a problem with the 65nm process or because AMD is concentrating on 65 watt parts.

"I guess we'll see soon enough once we know if the 3.2GHz 6400+ with double the L2 cache is still 90nm or 65nm."

No. There are currently no 2.8 or 3.0Ghz 65nm chips so I can't see a 3.2Ghz 65nm. This will be 90nm. Presumably, this should also allow 3.0Ghz to step down to 89 watts and 2.8Ghz to 65 watts.

Again, 3.2Ghz is low volume so the savings to AMD with 65nm would be tiny even if 65nm were up to it. What we need to watch for is if 65nm moves up to 2.8Ghz at 65 watts.

"Aren't R600 and G80 produced at TSMC? Sure, they are not 65 but 80 and 90nm but still over/nearly 700M transistor dies are a huge compared to even K10. Not to mention they have much more die space allocated to computational units than CPUs."

GPU's are brute force computation; this is far less cutting edge than processors from AMD or Intel.

Scientia from AMDZone said...

decio

"Massive throughput? You mean the 2 64-bit SSE2 units of K8, which apparently carried over to K10 according to the recently published K10 optimization guide?"

I don't know what guide you are reading. In the current Software Optimization Guide for AMD Family 10h Processors

Page 216, SSE/FP units: FADD, FMUL, FSTORE

Page 222: execution units that are each capable of computing and delivering results of up to 128 bits per cycle.

Ho Ho said...

scientia
"A properly designed NUMA system would allow each processor to access its own piece of the database while a non-NUMA system could have cores chasing all over."

That way you can have other problems. For example when for some reason there are huge amount of requests coming that all need data from single memory pool you are screwed. No such problems on UMA as all data is equally accessible to all CPUs.

Btw, could you describe in a couple of sentences how an UMA system has "cores chasing all over" and NUMA doesn't?


I'm not saying that UMA is always better than NUMA, just that they both have their strengths and weaknesses. Though I think you agree when I say that with similar total bandwidth UMA is considerably better than NUMA. Of course it is much harder to design such an UMA system that can compete with 2P+ NUMA systems in terms of total system bandwidth, though FBDIMM helps quite a bit in that area.



"This shouldn't be an issue on a single socket quad core K10 but obviously could be for 2-way and up."

FSB throughput isn't too big of a problem on 1P either.


"65 watt 3600+ is not something special for 65nm."

You didn't answer my question. I was asking what kind of power usage/leakage you see there. Was the author correct or not when he claimed that lower clocked 65nm K8's take a whole lot of power in order to work?


"GPU's are brute force computation; this is far less cutting edge than processors from AMD or Intel."

What exactly is the difference and how it makes them so much simpler to produce?

CPUs have lots of cache and you can map some defected parts of it to working ones, much like with HDDs do with bad sectors. GPUs have very little cache compared to their FP power. Sure, they do have lots of similar parts (quads) in one die but so what? K10 also has four identical parts. K10 has over 4MiB of caches and die size of a bit less than 300mm^2. G80 has considerably less cache (0.5Mib?) and nearly 500mm^2 die size at 90nm with considerable amount of transistors working at 1.5GHz, not much slower than K10. G80 die size is not that much smaller than the biggest CPU dies ever made (Power, Itanium) and those had most of the die made up from caches.

If you call 1024bit ring bus or G80 thread scheduler that feeds a total of 128 individual execution cores simple then please tell me what part of an x86 is considerably more complicated than the things found in GPUs? 128bit SSE? Integrated 128bit memory controller? Crossbar switch? Something else?

Yes, you can disable a few quads and sell them as lower end parts but I'm not talking about those.There are lots of high-end non-crippled GPUs out there so it is very much possible to produce such complicated things. NV seems to be making excellent profits with its huge die.

Another interesting thing is that NV was capable of putting huge amount of FP units working at extremely high frequencies without using too much power. Of course those FP units are not exactly the same as in CPUs but there are far more of them.



Perhaps this is only my bad memory, sleepiness and illness combined but when trying to remember the chips that have had production problems during the last few years it seems to me as SoI is more common than bulk silicon. Power5 and Cell are first that came to my mind though Cell seems to be doing better now.


Btw, has anyone seen any power usage measurement being done on over clocked AMD CPUs? I'd be mostly interested with >3GHz clocks and lower end 65nm ones.

lex said...

Scientia has nice long article but no prediction, or did I miss it?

Here is my prediction its pretty simply, AMD is finshed. AMD will never rise again to the glory years of 2003-2006. Some say never say never and they have but once chance.

Lets first talk about where AMD is and why they have no hope.
1) 11.4% plus or minus a bit on Market share. That simply isn't enough MS to fund competitive R&D.
2) Debt burden of 3.67 BILLION dollars combined with 1) that is two strikes.
3) Cash of only 1.6 billion vs INTEL at 8.9 BILLION means no reserve to mount a counter offensive, third strike.
3) R&D of 1.2 billion compared to INTEL R&D of 8.9 BILLION, your OUT, game over for AMD.

Now if this was like the car business AMD might be able to profit and grow as a small fry. But this the x86 business and its different and here is how.

In all segements its always about performance/dollar. If you aren't competitive you are just kicked to the next lowest level that sells for cheaper. There is nothing special about the high end or server segment that protects you somehow. THere is no such thing as installed base, services or other elements that lock your customers in. Look what happened to INTEL in the high end when AMD got a credible part. Look what happened to AMD in the gamer arena when COre2 came out. The only thing that matters is performance. If you don't got it you sell down and boy is it painful. Look at the Opteron as an example. AMD had to price what was really the same silicon at 30% - 50% of what they sold it in 2006 in 2007 as the Core2 took over the market. You go from making high margins to nothing pretty quickly if you don't have the performance crown.

THus AMD can't hide ANYWHERE within the x86 market. They can't target high end, performance/enthusiaste or even the low end. In the end they must compete with leading edge technology. Even with ramping 65nm won't give them enough. INTEL has 4 300mm 65nm factories at the end of ramp with what I can only assume is world class yields and cost. AMD is just starting. There is no subsitute for volume learning. Now to stay competitive you need to be on the leading edge and use that leading edge to feed down the stack.

Lets examine what it takes to have a leading edge design.

1) You need leading edge design team. Both companies have that. The difference is INTEL has about 3x more designers and thus are now doing this Tick-Tock that AMD simpley can't match. AMD must now bet the farm again taking huge risks while INTEL can take the incremental low risk Tick and the larger Tock every two years and always be close to way ahead. There is no secret weapon here. Architecture options in CPU design are well studied and documented and choice you take is all about tradeoffs for the platform and constraints of your silicon. No matter what the same design approache will always be better with more advance silicon.

Assessment: Advantage is INTEL with multiple teams permitting a rapid evolution/revolution cadence that AMD simply can't match.

2) TO produce competitive parts you need great silicon. These days in the age of nano-technology you really need top to bottom integration of the whole process; everything from suppliers, masks, to core R&D, and up to and including manufacturing. If you look around this type of stuff takes billions a year in R&D. The likes of TI, Freescale and others have pretty much bowed out. IBM can't even justify the money it spends and amortizes its development thru it consortiums. Thus INTEL holds a huge advantage in tightly controling and optimizing everything from the supplier all the way to the mega billion factory to be optimized for one thing and one thing only, to produce the fastest CPUs.
AMD with 1/6 the R&D budget is now relegated to partner and as with all partnerships its a compromise. Sure they draft off of IBM and Charter but they don't get that customized process. They have to pay as Charter and IBM who are both in the business of making money back from these alliances. THus any profit AMD has to share. Already they are behind and charter is a 2nd class foundry. Start behind, working with compromise, and with no more money to spend on catching up. I don't care about the AMD hype about being on time for 65nm, they are late to 65nm by a good year, and will be late to 45nm too. You simply can't pull rabbit out of your hat or buy the technology. It takes years and billions to get ready. AMD will be late to 45nm and even later to HighK/Metal Gate making their products even more uncompetitive in 2nd half 2008.

Assessment: Huge advantage to INTEL with AMD having no chance here.

Summary: COmpeting with smaller design team at slower cadence running designs on inferior process means AMD is relegated for probably 18 out of the 24 month cycle to be uncompetitive across most product lines.


If you look at the stituation between 2006-2007 respresents a good snapshot of how things will pan out in the future. We saw Tick Tock for Core and Core 2. AMD had a tail window of competitveness using older process technology and more advance design. Then INTEL answered with new technology and new architecture and crushed them. AMD is up to plate and is showing the challenge/failure of being too small ( not enough money/resources). Barcelona is slipping and by the time they get it INTEL's next Tick will be more then a match.

AMD's future is going to be 4-6 quarters of huge losses and a couple quarters of slight profits. That is no way sustainable.

If you look back at the last 12 months you have seen AMD drift from 100M profitability to 3 quarters losing in excessive of 500million. They have 3.9 Billion debt, are still behind in technology, haven't ramped their next generation product yet, clearly shown the first few months with Barcelona will not offer performance leadership. Simply put AMD is totally screwed. Got not cash reserves, got no technology silver bullet, got no design advantage. Again to look back is to see the future, unstainable business situation.

The only thing AMD has got going for it is that OEMs will always buy a little AMD to keep INTEL honest. The problem for them the little AMD they buy must still be price / performance competitive. Thus until AMD gets a product that can run with Penrym/Nehalem at 45nm ( remember they need to do this on a 65nm technology making it with something that is inherenty slower and bigger ) it will have no pricing leverage.

Bottom line is AMD is going to be relegated to be in the red and on lifesupport. There simply is NO business model where they can return to leadership again or even sustained profitabiliy. Of course there is the small chance that 45nm ramp crashes at INTEL, that Nehalem is a disaster, or that there are surprises in Penrym that haven't yet surfaced. Thus AMD's only hope for a window to profitability is their competitors screw up? That truely is a bad business situation!


What is to become of AMD then?

Scientia from AMDZone said...

ho ho

"Btw, could you describe in a couple of sentences how an UMA system has "cores chasing all over" and NUMA doesn't?"

A distributed memory Opteron running on a non-NUMA OS could easily have cores chasing all over.

The problem a UMA system could have is when the searches could be split up into pieces for each socket but UMA would bog down because of one bus. However, this is less of a factor now with multi-FSB chipsets using multiple memory channels. Basically, once AMD has a quad core chip and Intel has a quad FSB the differences will be less pronounced.

"Was the author correct or not when he claimed that lower clocked 65nm K8's take a whole lot of power in order to work?"

Yes, just as they did with 90nm. This is not a difference with 65nm.

"CSure, they do have lots of similar parts (quads) in one die but so what? K10 also has four identical parts."

Would you like to compare the differences in:

Branch Prediction
Instruction Pre-decoding
Instruction Decoding
Stack Maintenance
OoO Operation
Pipeline Length

These are not even remotely similar. True CPU's are general purpose processors which have very sophisticated logic and instruction flow capabilities. GPU's on the other hand have massive parallel data processing capability but very limited logic and instruction flow.

Décío Luiz Gazzoni Filho said...

I don't know what guide you are reading. In the current Software Optimization Guide for AMD Family 10h Processors

Page 216, SSE/FP units: FADD, FMUL, FSTORE

Page 222: execution units that are each capable of computing and delivering results of up to 128 bits per cycle.


I'm reading Table 15, instruction latencies for SSE2 instructions. Not a single instruction outputs its result in a single cycle, even trivially implementable ones like PAND/POR/PXOR which still require 2 cycles. That smells like either 64-bit units or pipelined 128-bit units.

And while K10 may technically have three SSE units, Core 2 Duo's SSE units are more general purpose. In K10, most instructions can be executed in only one or two of the available execution units. Very few can be executed on all three. On the other hand, Core 2 Duo can execute many important instructions (integer addition/subtraction/logic operations/loads/stores) on all three.

Scientia from AMDZone said...

lex

”11.4% plus or minus a bit on Market share.”

AMD's current procesor marketshare is 15.8%. And, this can increase.

”R&D of 1.2 billion compared to INTEL R&D of 8.9 BILLION”

Good point. But, your numbers aren't even close. Let's use the average between Q1 and Q2.

Intel $1,425 Million/Quarter, $5.7 Billion/Year
AMD $454 Million/Quarter, $1.8 Billion/Year

So, Intel spends about 3.2X as much as AMD. However, Intel does have to pay for R&D for Itanium. It is also has to pay for motherboard design and its dual FSB and quad FSB chipsets are much more complex to design. And, AMD shares FAB process development costs with IBM.

”Look at the Opteron as an example. AMD had to price what was really the same silicon at 30% - 50% of what they sold it in 2006 in 2007 as the Core2 took over the market. You go from making high margins to nothing pretty quickly.”

AMD's average gross margin for 2006 was 50.7% while Intel's was 51.5%.

AMD's gross margin for Q2 2007 is 33.5% while Intel's is 46.9%.

For Q2: AMD's average profit per chip is 66% of 2006 while Intel's is 91%.

So, AMD is down 28% compared to Intel. That is a long way from down 50-70%. Also, AMD's costs should drop at an increased rate now that FAB30 is going down. AMD's gross margin could be 38% next quarter.

” INTEL has 4 300mm 65nm factories at the end of ramp with what I can only assume is world class yields and cost.”

Okay. Intel will be ramping four 300mm 45nm FABs in 2008 while AMD will ramp two. If these were equal this would give AMD 33% of the total. Even if AMD's two FABs were only equivalent to 1.5 of Intel's this would still mean 27% for AMD.

”There is no subsitute for volume learning.”

AMD does get some advantage by having two 300mm FABs in the same place. This gives AMD more flexibility in production.

”thus are now doing this Tick-Tock that AMD simpley can't match.”

Wrong. AMD will have a modular core in 2008.

” AMD must now bet the farm again taking huge risks while INTEL can take the incremental low risk Tick and the larger Tock every two years and always be close to way ahead.”

I don't think you understand. The maximum theoretical update rate for AMD is once every three months. If AMD can make updates every eight months, for example, they might be able to keep up with Intel.

”If you look at the stituation between 2006-2007 respresents a good snapshot of how things will pan out in the future. We saw Tick Tock for Core and Core 2.”

No. Core 2 Duo is a four year design. The very first example of the tick tock process is Penryn.

Scientia from AMDZone said...

decio

Well, at least you are trying to find the right information. I've seen lots of people look at the SuperPi results and proclaim that K10 is no faster than K8. Yet the Guide clearly states:

Page 7: The new AMD Family 10h processors add support for 128-bit floating-point execution units. As a result, the throughput of both single-precision and double-precision floating-point SSEx vector operations has improved by 2X over the previous generation of AMD processors.

"I'm reading Table 15, instruction latencies for SSE2 instructions."

Okay, now I understand your mistake. You should be looking at Throughput instead of Latency. The definition of Throughput on page 232:

Throughput: This value indicates the maximum theoretical rate of execution of that instruction. For example, a value of 1/2 means that one such instruction executes every two clocks, or two such instructions in four clocks and so on. A value of 3/1 indicates that three such instructions can be executed every clock, but fewer than three such instructions would still take one clock.

"Not a single instruction outputs its result in a single cycle, even trivially implementable ones like PAND/POR/PXOR which still require 2 cycles."

No. You can execute 2 PAND, POR, or PXOR per clock. This is because these instructions can execute on either FADD or FMUL. Most instructions are 1 per clock.

"On the other hand, Core 2 Duo can execute many important instructions (integer addition/subtraction/logic operations/loads/stores) on all three."

The problem is though that you only have enough bus bandwidth on C2D to load 2 instructions per clock. So, even if you can execute 3 there is no way to load 3.

abinstein said...

Ho Ho -
"Aren't R600 and G80 produced at TSMC? Sure, they are not 65 but 80 and 90nm but still over/nearly 700M transistor dies are a huge compared to even K10."

Wrong. Larger die doesn't mean higher complexity. GPU's are highly regularly circuits, and its much easier to design & fabricate than any modern CPU.

Aguia said...

Décío Luiz Gazzoni Filho,

AMD K10 SSE units

abinstein said...

decio -

As a start, I'll just say that your comments on Core 2 and K10 instruction latency and throughput are very wrong.

"Massive throughput? You mean the 2 64-bit SSE2 units of K8, which apparently carried over to K10 according to the recently published K10 optimization guide?"

No, K10 has 128-bit SSE ALUs, whereas K8 has 64-bit ones. This makes K10 lower latency and twice throughput on most 128-bit SSE instructions. For example:

INST K8_LAT K8_THU K10_LAT K10_THU
ADDPD 5 1/2 4 1
ANDNPD 3 1/2 2 2
DIVPD 37 1/34 20 1/17

You may cry foul that K10 does not reduce much latency from K8 for ADDPD. This is because there are "tricks" (which are apparently used in K8) to make two 64-bit ALU do 128-bit calculation in less than twice the 64-bit latency. Without such tricks, K8 would have had ADDPD latency 7 clock cycles.


"Now Core 2 Duo with its 3 128-bit SSE2 units, that might be called massive indeed."

Out of 6 ports of Core 2's execution core, 3 can execute SIMD uops, but only one can do SIMD MUL, two SIMD ADD, one FP MUL and one FP ADD. Compared to K10 the max advantage is one SIMD ADD per cycle (assume 1-cycle latency).


"Theoretical peak throughput 50% higher than K8's, and in practice much better due to half the latency."

This statement is not correct in general. It is probably true for (integer) SIMD instructions, but not for floating-point, where Core 2 has just marginally better throughput than K8 even with wider ALUs.

abinstein said...

Aguia -
"AMD K10 SSE units"

??? Am I missing something? That page doesn't say anything about K10 SSE!

Ho Ho said...

abinstein
"This statement is not correct in general. It is probably true for (integer) SIMD instructions, but not for floating-point, where Core 2 has just marginally better throughput than K8 even with wider ALUs."

Can you explain why I've seen 2x faster execution of FP32 SSE2 ray tracer on Core2 compared to K8 at same clock speed? Or perhaps you only think of 64bit SSE performance?

Aguia said...

Really abinstein?!

However, stating that the data cache bandwidth is twice as high as Intel's Core is ignoring a few things. Eric Bron, probably one of the most knowledgeable developers when it comes to SSE, stated: "Intel Core can sustain one 128-bit load and one 128-bit store per cycle (I've measured actual timings very near this theoretical peak), so Core can copy 128 bits per cycle. Barcelona (K10) can only copy 64 bits per cycle from the above store bandwidth limitation." So the twice as much "load bandwidth" is only a small part of the story:

* Intel Core can do a 128-bit Load and 128-bit Store in one cycle if possible
* AMD's K10 can either do two 128-bit loads, or two 64-bit stores, or one 128-bit Load and one 64-bit Store

Depending on the situation, AMD's K10 can do twice as much, about equal or about 33% less work in one cycle. So you cannot conclude that the AMD K10 has twice as much SSE bandwidth as the Intel quad core Xeon. It will only be faster if loads happen twice as often (or more) as Stores. In most "harder to vectorize" FP code, this is the case, so here the K10 chip will probably win by a small margin (as the percentage of SSE code is low). An example of this is the SpecFP benchmark. In some "easy to vectorize" SSE code this is not the case, and in that case the K10 will probably not be beaten per clock cycle, but the clock speed disadvantage might give the Xeon the edge.

Décío Luiz Gazzoni Filho said...

Okay, now I understand your mistake. You should be looking at Throughput instead of Latency.

First of all, I do understand that K10 can execute two SSE instructions per cycle. Look at the second sentence of my first post.

As for what I should be looking at, you're wrong -- you have to look at both throughput and latency, they're equally important. Throughput may determine peak execution rate, but when you have high latency, you won't ever reach the peak rate because you just can't fill the execution units fast enough, you're just waiting for dependent instructions to complete. If your code has enough parallelism, you can do tricks like loop unrolling and software pipelining, but that increases register pressure and register spills waste instruction slots with useless loads and stores, further reducing peak execution rate. It's great that the new 64-bit instruction set has twice as many registers, except that you have to waste an extra byte on the REX prefix to access these registers. SSE instructions are long enough without an extra prefix, and the extra prefix will waste cache space and increase decode pressure.

What's your qualification to speak on this anyway? Have you written any optimized code I might have heard of? I wrote some cores for a distributed computing project you may have heard of, RC5-72 by distributed.net, now let's see your cards.

"Not a single instruction outputs its result in a single cycle, even trivially implementable ones like PAND/POR/PXOR which still require 2 cycles."

No. You can execute 2 PAND, POR, or PXOR per clock. This is because these instructions can execute on either FADD or FMUL. Most instructions are 1 per clock.


Yes you can execute two instructions per clock (as I said above). The thing is that the result of an instruction which began execution at the i-th cycle won't be available in the next cycle, but at best only in the (i+2)-th cycle. Worst case, fully serial code with dependencies between every instruction, not only will you execute a single instruction per cycle, but the execution unit will go idle every other cycle while it waits for the result of the previous instruction. Hence a worst case IPC of 0.5. Even Core Duo does better than that since it executes quite a few instructions with single-cycle latency.

You can argue all you want -- doesn't change the fact that I can write code which is twice as fast on Core 2 Duo as it is on K8, and I presume, K10 once it's out.

Azmount Aryl said...

Sci said...
The problem is though that you only have enough bus bandwidth on C2D to load 2 instructions per clock. So, even if you can execute 3 there is no way to load 3.



...unless of course we do more than one operation upon a register on average. Which is why i believe we have more than Two registers?

InTheKnow said...

On Intel's part, the decision to push back 45nm production at the Chandler FAB to Q1 2008

Can you please provide a link to support this statement? I have been unable to find any supporting evidence on my searches of the web.

Scientia from AMDZone said...

decio

Okay, you don't believe AMD; maybe you'll believe Intel.

"you have to look at both throughput and latency"

Intel: Comparisons of latency and throughput data between different microarchitectures can be misleading.

"you're just waiting for dependent instructions to complete."

Intel: Instruction latency data is useful when tuning a dependency chain. However, dependency chains limit the out-of-order core’s ability to execute micro-ops in parallel.

Coding techniques that rely on instruction latency alone to influence the scheduling of instructions are likely to be sub-optimal

"It's great that the new 64-bit instruction set has twice as many registers, except that you have to waste an extra byte on the REX prefix to access these registers."

Well, I have to agree with you. If you write assembler code with lots of dependencies and avoid using the extra registers then C2D architecture will work better for you.

"I wrote some cores for a distributed computing project you may have heard of, RC5-72 by distributed.net"

You hand tuned the assembler code for RC5-72. Yes, that would explain it.

"You can argue all you want -- doesn't change the fact that I can write code which is twice as fast on Core 2 Duo as it is on K8"

Why would I argue with that? C2D is twice as fast in SSE as K8. Any good compiler can produce code that is twice as fast on C2D as either Core Duo or K8.

"and I presume, K10 once it's out. "

With the way you code, that may be true. However, any good compiler will produce code that is twice as fast on K10 as K8 as well.

Scientia from AMDZone said...

ho ho

I believe abinstein is distinguishing between x87 FP operations and SSE. In other words, x87 instructions are no faster on C2D while SSE instructions are. This becomes less of a factor as the use of x87 instructions are now discouraged.

aguia

Quite correct. C2D and K10 are not 1:1 comparisons. Sometimes the architecture of one is an advantage and sometimes the other is. What this means in practice is that the techniques taht allow you to speed up code on C2D won't work for K10 and vice versa. This is why PGA will produce different code for each.

I would in general describe the SSE bandwidth of Core Duo and K8 as the same; and the SSE bandwidth of C2D and K10 as the same.

intheknow

That came from a forum from someone who said he worked at Nikon and had just been reassigned to his second Intel FAB location. I assume he is some kind of Nikon rep.

The comment about the Chandler FAB being pushed back to Q1 seemed believable because he was very pro-Intel. He stated that he thought the Twinscan 1700i units that AMD was getting from ASML were junk and that the Nikon units were much better (and better than Canon naturally). He believed that AMD would have much worse yields than Intel. It seemed that someone who was just making up stuff would more likely claim he worked for Intel rather than someplace like Nikon. His comments about canceled ASML orders and new Nikon orders also seemed consistent. I don't suppose it is a huge difference because I still expect 45nm units from D1D in Q4.

So, take it for what it's worth.

abinstein said...

Aguia -
"Really abinstein?!"

Yeah.. AnandTech managed to spend a whole paragraph without saying the right or important thing. I read the paragraph the first time but really, it doesn't describe K10's SSE, right? It describes (somewhat incorrectly) K10's SSE load/store. :)

abinstein said...

decio -
"First of all, I do understand that K10 can execute two SSE instructions per cycle. Look at the second sentence of my first post."

Potentially, K10 performs (2 SIMD logic/cycle) or (1 SIMD ADD/cycle + 1 SIMD logic/cycle), whereas Core 2 performs (3 SIMD logic/cycle) or (2 SIMD ADD/cycle + 1 SIMD logic/cycle). Thus Core 2 will have some advantage in programs that perform a lot of (integer) SIMD logic or ADD.

However, Core 2's SIMD share the same dispatch and retire ports as its general-purpose instructions. Unless the code is heavily SIMD the advantage is probably little.


"Throughput may determine peak execution rate, but when you have high latency, you won't ever reach the peak rate because you just can't fill the execution units fast enough, you're just waiting for dependent instructions to complete. "

This is true fact, but you are reaching to the wrong conclusion. The latency specified in the K10 optimization manual is the number of clock cycles "to execute the serially dependent micro-ops." This can be higher than the macro-op dependence chain requires because most macro-ops can overlap their micro-op executions. For example, the 2-cycle latency of PAND in K10 comprises of first fetching the register, then the actual logic operation. When two depending PAND are issued back-to-back, result from the first instruction can be forwarded to the second, essentially overlapping the first arithmetic execution and the second operand fetch.

In the case of Intel's Core 2, the latency seems to be counted after the micro-op is sent to a given port. This means if two depending micro-ops are sent to the same port, they can execute back-to-back with the specified latency delay. However, if two depending micro-ops are sent to different ports, the latency will be higher. This is why Intel specifically says it is misleading to compare latency numbers across microarchitecture.

In terms of (SSE) floating-point instructions, Core 2 requires one extra cycle of bypass delay, so the latency should be one cycle higher than the number in the table. This is not a big problem since FP codes usually have lots of parallelism and can hide latency well.


"If your code has enough parallelism, you can do tricks like loop unrolling and software pipelining, but that increases register pressure and register spills waste instruction slots with useless loads and stores, further reducing peak execution rate."

First, loop unrolling does NOT waste instruction slots (in ROB or RS). It takes some instruction cache space, yes, since the number of instructions is apparently increased. However, the same (or less) number of instructions will be on-flight in run time.

Second, loop unrolling does NOT increase register pressure, because it does not add true (RAW) dependency. If there were true dependency in unrolled loops, there would have been in the original loop. Fake dependencies (WAR and WAW) are handled by register renaming in hardware.

Take a look at Intel's optimization reference page 3-15 about loop unrolling. You won't see register pressure or waste of instruction slop as cons.


"SSE instructions are long enough without an extra prefix, and the extra prefix will waste cache space and increase decode pressure."

It is precisely because "SSE instructions are long enough" already, that the extra REX prefix does not matter that much there.


"The thing is that the result of an instruction which began execution at the i-th cycle won't be available in the next cycle, but at best only in the (i+2)-th cycle."

As I said above, the 2 cycles are not all spent on the logic evaluation, and can be overlapped. Ever heard of register forwarding? :)


"You can argue all you want -- doesn't change the fact that I can write code which is twice as fast on Core 2 Duo as it is on K8, and I presume, K10 once it's out."

Maybe you can share with us, quantitatively, how you optimize RC5-72 for Core 2, and how it runs better than K8? That will be more interesting than the wrong arguments you made above. :)

Scientia from AMDZone said...

abinstein

The public Source Code for RC5-72 is here in tar gzip archive format.

However, you should be aware that one of the things that Decio didn't mention about RC5-72 is:

RC5 involves a large number of integer additions, rotates and XORs. It doesn't require floating point calculations and won't, in general, benefit from them.

And here:

Integral to the mathematics of the RC5 algorithm are 32-bit rotate operations.

The Client software is only 32 bit for windows but there are AMD64 versions for Linux and BSD.

I wonder which client Decio worked on.

abinstein said...

scientia -
"However, you should be aware that one of the things that Decio didn't mention about RC5-72 is"

Thanks for the info. Yes, like all encryption/decryption engines RC5 does not require a bit of floating-point, and like most symmetric crypto of its time, it's designed for 32-bit software.

I didn't look into the source code but the serial nature of crypto algorithms means wider ALU is not helpful. My guess is RC5-72 can take advantage of 128-bit SSE only because it is doing brute-force to crack the secret key (i.e., trying 4 keys at a time). It just my guess, though.

Décío Luiz Gazzoni Filho said...

Scientia

"you have to look at both throughput and latency"

Intel: Comparisons of latency and throughput data between different microarchitectures can be misleading.


I didn't claim otherwise. However, it was you who said I had to look at throughput not latency. I'm claiming you have to look at both. How does Intel's statement contradict me?

"you're just waiting for dependent instructions to complete."

Intel: Instruction latency data is useful when tuning a dependency chain. However, dependency chains limit the out-of-order core’s ability to execute micro-ops in parallel.

Coding techniques that rely on instruction latency alone to influence the scheduling of instructions are likely to be sub-optimal


Consider a loop whose body has a long dependency chain, but each loop iteration is completely independent. You can SIMDize and then 2-way unroll the loop, so that you have 2 independent instruction flows running at the same time and can feed 2 distinct execution units (I do that all the time in Cell, but for masking latencies, not achieving parallelism). However, each instruction flow is still serial and is limited by instruction latency. You could of course further unroll the loop, but that increases your working set, which requires more registers, and as we know registers are a precious commodity in x86 -- this isn't Cell where I can load four 26-entry arrays in registers and still have registers to spare. Sure register renaming solves part of the problem, but it doesn't change the fact that you have to juggle data in and out of registers, and each load/store you do is throwing away an instruction slot you could otherwise use for arithmetic.

"It's great that the new 64-bit instruction set has twice as many registers, except that you have to waste an extra byte on the REX prefix to access these registers."

Well, I have to agree with you. If you write assembler code with lots of dependencies and avoid using the extra registers then C2D architecture will work better for you.


First, the 8-register case is still the most important case, as most CPUs out there are still running 32-bit OSes. Maybe the tables will turn in 3 or 5 years, but this is 2007 and we can't assume a 64-bit environment is generally available.

Also, I didn't argue that one should avoid the extra registers -- just that if you don't pay attention, you'll be bottlenecked in decoding because you're already spending 4-8 (even 9 if you use the scaled addressing mode) bytes per instruction, and a REX prefix would add an extra byte to that. I believe the instruction fetch unit on C2D fetches 16 bytes per cycle, so you can do two register-register SSE instructions and one register-memory instruction per cycle for a total of 16 bytes. Add a REX prefix to a single one of them and you've overflowed the fetch buffer.

"and I presume, K10 once it's out. "

With the way you code, that may be true. However, any good compiler will produce code that is twice as fast on K10 as K8 as well.


Really? I'm delighted to learn about this. I'm all for faster code even if the user chooses an inferior architecture. Could you point to the specific changes between K8 and K10 that enable this two-fold speedup?

Décío Luiz Gazzoni Filho said...

abinstein

I will reply to your last comment for now, when I have time I'll address your other points.

I didn't look into the source code but the serial nature of crypto algorithms means wider ALU is not helpful.

Not true. Our code generally checks 2 or 3 keys in parallel (that's what the 2-pipe/3-pipe terminology means). Other than execution unit bottlenecks, we have to deal with the fact that rotates (the ROL instruction) require the rotate amount to be stored in ECX, so there's a lot of juggling involved to put the proper values in ECX at the right time, and of course lots of loads and stores, either for array accesses or due to register spilling. That's why I like Cell -- I can load all data in registers and have absolutely no loads or stores in the main loop.

My guess is RC5-72 can take advantage of 128-bit SSE only because it is doing brute-force to crack the secret key (i.e., trying 4 keys at a time). It just my guess, though.

That'd be correct, except we don'to do SSE2 at all. The way SSE2 shift instructions are defined, they're useless for RC5 -- we need to perform shifts by different amounts in each of the four 32-bit partitions of a 128-bit SSE register, and that's impossible, you have to shift by the same amount in all partitions. In the end all of our code is integer only. PowerPC (with Altivec) or Cell can do independent shift amounts in different register partitions (plus, it has rotate instructions rather than shifts only) so it's quite suitable for RC5 and these processors are the undisputed speed kings in that contest. Even old G4s handily beat modern x86 processors in RC5.

The code I'm working on, that I mentioned before, is a different cipher design, and implemented in so-called `bitslicing' mode. It has nothing to do with RC5.

abinstein said...

"Not true. Our code generally checks 2 or 3 keys in parallel (that's what the 2-pipe/3-pipe terminology means). "

My original statement is actually true. SIMD do not help crypto such as DES, RC5, SHA, etc., unless you are checking several keys in parallel (as in RC5-72).

In all normal usage of crypto algorithms you simply don't process a message with several keys at the same time in the same process/thread.

It's nice to know that my guess was correct (thanks for your confirmation).

George said...

"Assuming Intel's 45nm server chips are available in 3.2Ghz speeds, AMD will need at least 2.4Ghz quad cores."

What are you smoking? A theoretical Barcelona 2.6 barely beats a Clovertown 2.66 by 1% on 8-thread SPECint_rate2006. How in the world do you equate a Barcelona 2.4 to a Penryn 3.2 in performance? They’re not even in the same league outside of HPC.

Décío Luiz Gazzoni Filho said...

My original statement is actually true. SIMD do not help crypto such as DES, RC5, SHA, etc., unless you are checking several keys in parallel (as in RC5-72).

In all normal usage of crypto algorithms you simply don't process a message with several keys at the same time in the same process/thread.


Actually, there are cipher modes of operation (particularly the counter mode, which emulates a stream cipher) where you can operate on all message blocks in parallel. Also, you can decrypt (not encrypt) messages in CBC mode in parallel. Moreover, in certain applications, one can queue messages to do batch message processing. In fact, one of my lines of research is implementing ciphers in `bitslicing' mode, where you view an n-bit CPU as n 1-bit SIMD CPUs, so in the case of SSE2 one is working with 128 different messages at a time.

But I understand you want examples of parallelism at the algorithm level. For that have a look at some ciphers designed according to the wide trail strategy, like the AES or the hash function Whirlpool (designed by my MSc advisor and Vincent Rijmen, one of the co-designers of the AES). S-boxes and round keys are all applied in parallel (so 16- or 64-way parallelism), the linear diffusion layer is applied independently to each column (so 4- or 8-way parallelism). There are lots of opportunities for parallelism there, even more so when working in bitslicing mode, where you dispense with table lookups completely and are no longer bottlenecked by cache throughput.

It's nice to know that my guess was correct (thanks for your confirmation).

Please don't put words in my mouth. As you can see I said nothing of what you claim, and in fact just showed the contrary. This attitude is very rude and drives people away. Please refrain from doing so in the future or I'll take it as I sign that I should just go back to lurking.

Scientia from AMDZone said...

george

"What are you smoking? A theoretical Barcelona 2.6 barely beats a Clovertown 2.66 by 1% on 8-thread SPECint_rate2006. How in the world do you equate a Barcelona 2.4 to a Penryn 3.2 in performance? They’re not even in the same league outside of HPC."

Thank you. You've given a perfect example of how my posts are frequently misinterpreted. You obviously assume that I am claiming that a 2.4Ghz Barcelona will be as fast as a 3.2Ghz Penryn. Your interpretation however is not even close to what I was saying.

At launch of Barcelona in Aug/Sep:

Intel - 3.0Ghz Clovertown
AMD - 2.0Ghz Barcelona

From this point, AMD needs to gain on Intel rather than losing ground. If Intel bumps the speed to:

3.2Ghz Clovertown
then AMD needs:
2.13Ghz Barcelona so 2.2Ghz would gain

however I'm assuming that Penryn is faster than Clovertown so I assume that AMD will need:

2.4Ghz Barcelona

to gain on Intel. It appears that with only a 2.2Ghz model, AMD could actually lose ground. My assumption is that Penryn is 7% faster than Clovertown so it looks like:

3.33Ghz Penryn quad core
would be matched in ratio (not speed) by
2.4Ghz Barcelona

so if Intel has a 3.33Ghz model Penryn in, say, Q1 08 then AMD will need to bump their clock speed up to 2.6Ghz to gain on Intel.

Your misinterpretation was about a matching speed. I'm not actually sure what the matching speed would be. My best estimate at the moment is that for a:

3.2Ghz Penryn quad core

we would need something between:

2.9Ghz and 3.4Ghz Barcelona to match. That's about as close as I can currently get.

abinstein said...

"Actually, there are cipher modes of operation (particularly the counter mode, which emulates a stream cipher) where you can operate on all message blocks in parallel."

Yes, but first we are not talking about stream cipher nor bitslicing, second I was not refering to mode operations but the cipher's algorithms themselves.


"But I understand you want examples of parallelism at the algorithm level."

If you understood it, then why the above talks on "modes of operation"? It seems to me you are objecting for the purpose of objection.


"Please don't put words in my mouth."

For the record I'll quote what I said and what you said below to show that I'm not putting words in your month:

I said: My guess is RC5-72 can take advantage of 128-bit SSE only because it is doing brute-force to crack the secret key (i.e., trying 4 keys at a time).

Then you said: Our code generally checks 2 or 3 keys in parallel (that's what the 2-pipe/3-pipe terminology means). ...
That'd be correct, except we don'to do SSE2 at all. The way SSE2 shift instructions are defined, they're useless for RC5


So "you confirmed my guess" (except the SSE part where you are using Altivec - it'd have been correct to say SIMD in general).


"As you can see I said nothing of what you claim, and in fact just showed the contrary."

I did not say you agree with my claims, but you confirmed my guess. As for my claims, please tell me how DES, RC5, or SHA (from my claim) can be parallelized beyond 32-bit? I'm all ears.

Note I did *not* mention AES or other newer authentication algorithms purposely.

abinstein said...

decio -
"I do that all the time in Cell, but for masking latencies, not achieving parallelism"

Masking latencies is archieving parallelism.


"You could of course further unroll the loop, but that increases your working set, which requires more registers, and as we know registers are a precious commodity in x86"

Here is your misconception. Unrolling the loop itself does not increase register pressure; unrolling loop plus aggressive scheduling does.

More importantly, the pressure is on physical registers, not ISA ones, so even though x86 has limited ISA registers, both Core 2 and K8 have a lot more rename registers to take advantage of loop unrolling.

For reference see the example in p.310 of CA:AQA 3e by H&P.

Hope I'm not being rude here?

Aguia said...

So Scientia since you are talking of clock speeds I have some questions for you, so if you like to reply,

Assuming that AMD CPU will perform about the same of the current Conroe at the same frequency, and also if AMD keep their products price according to their performance VS Intel, does that mean that the 2.0Ghz will cost less than the Q6600 2.4Ghz, or you think AMD will forget this?

Because at 2.0GHz AMD CPU would have to cost about ~200€, not very cost effective to AMD.

Lower speed versions would even cost lower, this if Intel doesn’t launch a new Q6500 or Q6400.

What’s your opinion?
Isn’t AMD very attached to clock speed as much as Intel is, since Intel in desktop and mobile (I think you forgot those on your previous article), is stagnant in clock speed, desktop limited to 3.0Ghz, mobile 2.4Ghz.

Pentium M 780 - 90 nm - 2266 MHz - 27 W - July 2005
Core Duo T2700 - 65 nm - 2333 MHz - 31 W - June 28, 2006
Core 2 Duo T7700 - 65 nm - 2400 MHz - 35 W - May 9, 2007

Core 2 Extreme X7800 - 65 nm - 2600 MHz - 44 W - July 16, 2007

It took 1 year to achieve extra 67Mhz in clock speed increasing 4W in TDP.
It took another year achieve another extra 67Mhz in clock speed increasing another 4W in TDP.
In total it took two years for Intel increase their CPU speed in 133Mhz, also increasing their TDP 8W.
The extra 200Mhz in the X7800 “cost” Intel 9W in TDP.

Do you see AMD follow or take advantage of this TDP increase? Is Intel is opening opportunities here?

Azmount Aryl said...

AMD absolutely Must stay competitive on price/performance table when compared to intel's offerings. That is the only reason behind them maintaining their MS and having slight increase from 2007Q1. The formula that is valid for dual cores is also valid for quad cores, o thou I expect AMD to try get away with higher priced Phenom X4 at the beginning, new tech always can be sold for a premium but the problem here is that this wont last long. Maybe a quarter.

Décío Luiz Gazzoni Filho said...

abinstein

If you understood it, then why the above talks on "modes of operation"? It seems to me you are objecting for the purpose of objection.

No, I'm just thinking in practical terms. If you have an algorithm with little parallelism, but in your application the algorithm is applied to many datasets in parallel, do you object to achieving parallelism that way just because it's not parallelism in the algorithm itself? I'd rather see higher performance even if it's not in the most elegant way possible.

I did not say you agree with my claims, but you confirmed my guess. As for my claims, please tell me how DES, RC5, or SHA (from my claim) can be parallelized beyond 32-bit? I'm all ears.

Note I did *not* mention AES or other newer authentication algorithms purposely.


Except that the DES is insecure and being phased out, nobody uses RC5 (except for distributed.net), and I'll give you SHA-1/SHA-2, though SHA-1 is broken and a contest is under way to replace the SHA family.

On the other hand, AES is being quickly adopted everywhere due to its status as a US government endorsed standard, so I posit that it represents a very important case as the single most important cipher in new designs.

"I do that all the time in Cell, but for masking latencies, not achieving parallelism"

Masking latencies is archieving parallelism.


I was trying to distinguish the case where e.g. in Cell even the simplest instructions require 2 cycles to produce the output, so rather than sitting idle in dependency chains, I perform 2-way loop unrolling so in the i-th cycle I'm working on loop iteration j, in the (i+1)-th cycle I'm working on loop iteration j+1 and waiting for results from the i-th cycle, then when they're available in the (i+2)-th cycle I go back to working on loop iteration j, and so on. But notice that I never schedule instructions in parallel (you can't anyway, since virtually all ALU instructions, save for quadword rotates and a few isolated instructions, run in the same pipeline in Cell). In Core 2 Duo you don't have the latency problem, but you still need to keep multiple execution units fed, so you do the same trick, but as shown, with different goals.

"You could of course further unroll the loop, but that increases your working set, which requires more registers, and as we know registers are a precious commodity in x86"

Here is your misconception. Unrolling the loop itself does not increase register pressure; unrolling loop plus aggressive scheduling does.

More importantly, the pressure is on physical registers, not ISA ones, so even though x86 has limited ISA registers, both Core 2 and K8 have a lot more rename registers to take advantage of loop unrolling.


Here's what I'm trying to say. I just counted, in my Cell implementation of the cipher I'm working on, which proportion of instructions are loads/stores, and that works out to 21.3%. I did the same for my SSE2 code and the proportion of MOVDQAs to other instructions is 47.6%. Every instruction slot used by a MOVDQA is an instruction slot that's not used by a PAND or PXOR or POR, that's my whole point. Oh, and for added dramatic effect, my SSE2 code has lots of register-memory operations, which of course don't exist in Cell which is a RISC architecture -- if Cell had those then the ratio of loads/stores to ALU ops would be even lower there, probably a low single-digit percentage.

OK, so register renaming speeds up execution by working around false dependencies, I completely agree with you there, but the lack of ISA registers still requires the programmer to juggle data between registers and memory, and that juggling requires inserting loads/stores in the code and wasting instruction slots which could otherwise be used for instructions that do actual work. I hope I got my point across, because I just can't be any more clear than this.

abinstein said...

decio -
"Except that the DES is insecure and being phased out, nobody uses RC5 (except for distributed.net), and I'll give you SHA-1/SHA-2"

I agree what you say here, and also what you said above about practicality and below about AES. Actually I've always believed the same things. My original statement was about RC5 and crypto algorithms of its time (just search "of its time" in this comment area).


"I just counted, in my Cell implementation of the cipher I'm working on, which proportion of instructions are loads/stores, and that works out to 21.3%. I did the same for my SSE2 code and the proportion of MOVDQAs to other instructions is 47.6%. Every instruction slot used by a MOVDQA is an instruction slot that's not used by a PAND or PXOR or POR, that's my whole point."

Thanks, this is informative. And yes, it is a problem for less ISA registers, where one must spend some instructions on register-to-memory move. This of course happens when the active working set requires more registers than available in the ISA.


"the lack of ISA registers still requires the programmer to juggle data between registers and memory, and that juggling requires inserting loads/stores in the code and wasting instruction slots which could otherwise be used for instructions that do actual work."

Yes, absolutely. For general purpose processing, or any program that are complex enough to have many "live" registers at the same cycle, RISC is superior than CISC by design.

But this problem (lacking ISA register) is not affected by loop unrolling, because you will need to store all pertinent states at the end of the non-unrolled loop anyway, right?

Scientia from AMDZone said...

decio

"as we know registers are a precious commodity in x86 -- this isn't Cell where I can load four 26-entry arrays in registers and still have registers to spare. Sure register renaming solves part of the problem, but it doesn't change the fact that you have to juggle data in and out of registers, and each load/store you do is throwing away an instruction slot you could otherwise use for arithmetic."

Not exactly. C2D is capable of discarding stores and loads while Cell is not. This would not help much with K8 but should be similar to C2D with K10.

"I'm delighted to learn about this. I'm all for faster code even if the user chooses an inferior architecture."

I'm not sure what architecture you consider inferior. You seem to like Cell in spite of its clear limitations.

" Could you point to the specific changes between K8 and K10 that enable this two-fold speedup?"

You don't seem very familiar with K10. First, the L1-I cache has twice bus bandwidth, the instruction fetch has been doubled to 32 bytes, most SSE instructions now decode in one clock instead of 2, the L1-D cache also has twice the bus bandwidth, SSE can now do two 128 bit loads instead of one, and the SSE execution units have been widened to 128 bits. There are other improvements that do not give 2X speedup like a second pre-fetch and improvement to the original pre-fetch, better branch prediction, additional stack hardware, some additional Integer hardware like for LZCNT. Changes to allow loads over stores and some othter changes.

Décío Luiz Gazzoni Filho said...

But this problem (lacking ISA register) is not affected by loop unrolling, because you will need to store all pertinent states at the end of the non-unrolled loop anyway, right?

Yes, you have to (at the bare minimum) load data at the beginning of the loop and save it at the end, but with enough registers you can avoid all intermediate loads and stores. Let's use a concrete example from my code: this is a 31-round cipher with a 64-bit state which consists of the application of 16 4-bit S-boxes and a bit permutation. Since I'm working in bitslicing mode, the cipher state is a 64-entry array of 128-bit variables. In Cell I have the luxury of loading the full cipher state in registers and working from there. I still need loads and stores for the key schedule (which uses an 80-bit key and hence overflows the register file), but as for the cipher state proper, I can just load it at the beginning, do all 31 rounds without accessing memory and store it at the end.

Contrast that against my SSE2 implementation. Since I'm register-starved, every round I have to load every entry of the cipher state array, process it and store it, only to load it again on the next round. Some of these loads/stores can be masked using register-memory addressing, such as the application of subkeys, but many others can't since I have to load the data, perform various operations on it and then save it -- register-memory ops are only good if you need to do a single operation, which is an important special case, but that's just it, a special case.

Scientia from AMDZone said...

Aguia

"VS Intel, does that mean that the 2.0Ghz will cost less than the Q6600 2.4Ghz, or you think AMD will forget this?"

The chips in Aug/Sept will be server chips. I would expect the 2.0Ghz X4 Opteron to be priced similar to the 2.33Ghz Clovertown which is currently $900. I believe AMD will release some desktop chips in Q4 for Christmas but I don't know what models.

AMD's current top end dual core is FX-74 which is priced $320 so I would assume AMD would go above that with X4 Phenom. Q6600 is priced $375 but the problem is that they then have a huge gap with Q6700 being $968. AMD may not release any X4 Phenoms in Q4. For example, they could limit the release to a 2.4Ghz FX version which would probably do reasonably well compared to Q6700.

The 2.4Ghz Athlon X2 is running about $100 but Phenom would have much higher SSE performance.

C2D 2.13Ghz is $195 but has a slower 1066Mhz bus. So, probably around there. And, then a bit higher for 2.2 and 2.4Ghz versions.

"Lower speed versions would even cost lower, this if Intel doesn’t launch a new Q6500 or Q6400."

That depends. Most of the desktop lineup won't come until Q1 08 and AMD is likely to have 2.6Ghz chips by then. A QX6700 is currently running $968 so that should give AMD some room even if Intel drops prices a bit.

"The extra 200Mhz in the X7800 “cost” Intel 9W in TDP.

Do you see AMD follow or take advantage of this TDP increase? Is Intel is opening opportunities here?"


This is mostly dependent on how popular AMD's current mobile Turion and chipset combination are. Although it appears that this is improving now on 65nm. AMD won't really hit its stride in mobile until about Q2/Q3 08 when it releases the new mobile core and chipset.

By Q2 08 AMD will likely have halted 90nm production. This should mean that the lowest Sempron chip will be a Brisbane dual core by Q3.

Scientia from AMDZone said...

azmount

As I mentioned above, Intel's current Kentsfield prices are completely artificial. If Intel brings the prices into closer alignment that yes, AMD will have to follow.

Scientia from AMDZone said...

decio

Am I to understand that you are storing the data in an XMM register but doing the actual rotation in a GP register?

If this is the case what operation is giving you double speed with C2D?

enumae said...

Scientia
I would expect the 2.0Ghz X4 Opteron to be priced similar to the 2.33Ghz Clovertown which is currently $900.

That is prior to the price cuts. The E5345 will be $455 before the end of this month (supposedly).

Q6600 is priced $375 but the problem is that they then have a huge gap with Q6700 being $968.

The Q6600 is $266 ($287) and the Q6700 is supposed to be around $530 ($589 on line).

-------------------------------

On a side note...

Looking at the current market place Intel is about 10-20% faster in absolute performance than K8. When Barcelona comes out it seems somewhat clear it will be going against Penryn.

My question is, how is AMD going to change there current situation considering that they could/most likely will (due to clock speed), still be 10-20% slower in absolute performance, while facing further pricing pressure from Clovertown, Conroe and Woodcrest?

Thanks.

Décío Luiz Gazzoni Filho said...

scientia

Am I to understand that you are storing the data in an XMM register but doing the actual rotation in a GP register?

If this is the case what operation is giving you double speed with C2D?


This isn't RC5. As I said in a reply to abinstein, RC5 doesn't use SSE2, there's even a detailed explanation why in that post (a limitation of the SSE2 shift instructions PSLLD/PSRLD). All of the x86 RC5 code is integer only.

The code I wrote which I keep refering to is a completely different cipher design (a detail I also mentioned, in the same reply to abinstein), which unfortunately I can't name for now since it hasn't been published yet -- the wonders of working in research. But have a look at the challenges of implementing DES in bitslicing mode and they'll be quite similar to what I'm dealing with.

Azmount Aryl said...

Sci said...
C2D 2.13Ghz is $195 but has a slower 1066Mhz bus. So, probably around there


New prices are here - a 2.66GHz C2D with 1333MHz FSB for $209. In stock

Also, isn't intel doing price cuts every quarter? Should we expect one more round in Q4?

I should say that I agree on to big of a price difference between Q6600 (266) and Q6700 (532), if AMD had their X4's today, they could have squeezed up there in between and make an extra buck.

abinstein said...

Scientia, I believe the quad-core Opterons will be priced much higher than quad-core Xeons. In other words, 2.0GHz Barcelona will probably be priced around Xeon 5355 (2.66GHz).

Barcelona will take good advantage of current motherboards, which is a big cost effectiveness. OTOH, Xeon 5345 or above has performance heavily depending on FSB speed.

For desktop, however, it could be much different, and Phenom's pricing is likely much lower, especially for dual-core ones.

abinstein said...

"New prices are here - a 2.66GHz C2D with 1333MHz FSB for $209. In stock"

If you look at it, the E6750 is priced lower than E6700. It is obvious that the price is artificial.

The question is, why? Because E6750 is really the left-overs of Xeon 5355. Intel is making the money on Xeon, and those that don't work well in quads for some reason are sold as duals.

If you look at benchmarks, E6750 essentially have the same performance as E6700 even with 33% faster FSB. The low price of E6750 is thus to compensate the higher price the buyer must spend on the motherboard, with 1333/1066 being some $50 more expensive than 1066/800.

Note this "motherboard tax" and incompatibility between minor/sub generations do not happen with Athlon or Opteron.

Azmount Aryl said...

Your not looking at it the right way Abinstein. Intel knows that AMD's quad core CPUs are coming, they also know what weakness their own CPUs have. By selling 1333MHz FSB chips for lower prices they promote them motherboards that are compatible with em, so that when desktop X4 arrives intel will have 1333MHz standard with possibly even 1600MHz motherboars available widely. Theres nothing more to that, just simple business-oriented thinking on behalf of intel's folks .

Aguia said...

Finally Good CPU from VIA?

Via Set to Update C7 Chips with Better Performance

Giant said...

Note this "motherboard tax" and incompatibility between minor/sub generations do not happen with Athlon or Opteron.

Nonsense. The P965 chipset fully supports the 1333mhz FSB chipsets, as does the older 975X chipset. The P965/975X chipsets are also fully compatible with the 45nm Penryn CPUs thanks to Intel keeping the voltage requirements identical. The only thing that is needed is a BIOS update.

abinstein said...

Azmount -
"By selling 1333MHz FSB chips for lower prices they promote them motherboards that are compatible with em, so that when desktop X4 arrives intel will have 1333MHz standard with possibly even 1600MHz motherboars available widely."

Apparently not. If this was the case, Intel should've been selling 1333-capable motherboard cheaper. However, just compare the price of 1333/1066 MBs and that of 1066/800 ones. The former is as I said quite more expensive for the same level of functionalities otherwise (this also to respond to Giant - I don't care which chipset is capable of which, but you simply have to pay more for a MB that supports E6750).

The fact is 1333MT/s FSB does not help dual-core performance. To encourage everyone upgrade to 1333/1066 motherboard would cannibalizes its own E6400 or lower chip sales. The only reason there is 1333MT/s Core 2 Duo around is because they are left-overs of the MCM quad-cores.

Intel and its customers are in this situation because of sticking with the old FSB architecture. A year ago there is no 1333-capable motherboard, and a year later 1066-FSB processors can't be used on 1333/1600 motherboards.

Ho Ho said...

abinstein
"a year later 1066-FSB processors can't be used on 1333/1600 motherboards."

Are you sure of it? This is the first time I hear about it and for some reason I highly doubt it.

abinstein said...

"Are you sure of it? This is the first time I hear about it and for some reason I highly doubt it."

Oops... that's obviously a typo. No I'm not sure of it. What I meant to say the 1600-capable processor can't be used on 1333/1066 motherboards.

Ho Ho said...

abinstein

You do know that only Xeons will be with 1.6GHz FSB? Or do you have any other information?

abinstein said...

"You do know that only Xeons will be with 1.6GHz FSB? Or do you have any other information?"

As always, you are not getting what I said, look at the tree but not the forest. There will be 1600-FSB dual-core Core 2 Duo, just as there is 1333-FSB Core 2 Duo. There will also be 1600-FSB Core 2 Quad for desktop, unless Penryn is going to top at 3.2GHz.

enumae said...

Abinstein
There will be 1600-FSB dual-core Core 2 Duo, just as there is 1333-FSB Core 2 Duo. There will also be 1600-FSB Core 2 Quad for desktop, unless Penryn is going to top at 3.2GHz.

Please post a link to this.

Thanks

abinstein said...

"Please post a link to this."

When Intel tell you it's going to clock Penryn as high as 3.4 (or 3.6?) GHz, does it give you a "link"?

What I said is just speculation. You don't need to believe me but I'll try to sacrifice some of my noon rest and state the reasons below.

First, 1333-FSB does not help dual-core performance, yet Intel releases them. Why? Definitely not to encourage 1333/1066 motherboard sales, because the right way to do the latter is to lower the price of 1333-capable chipsets, not to lower the price of the processors (Intel depends more on processor sales than chipset). It's probably because there are dual-core Core 2 dies that do not work well as quad-core Xeon for some reasons, and are thus sold a dual-cores only. The same thing will happen again with Penryn cores with 1600-FSB.

Second, a 1333MT/s FSB is required to feed 2.33/2.66GHz Clovertown. 1600MT/s will be required for 3.2GHz Penryn. This is assuming Penryn has no better ILP (especially not SSE which are bandwidth demanding) than Clovertown. Do you think Intel's going to release faster-than-3.2GHz Penryn for desktop? If so, then you better prepare to buy 1600MT/s motherboard for those chips.

Ho Ho said...

Will those 1.6GHz FSB desktop CPUs be here before 2009? If no then CSI will likely take over soon and there wouldn't be much of a reason to create dualcores with that high FSB. Quadcores, perhaps yes but most likely only at extreme series.

Also before 2009 we'll have Nehalem and currently I think it is unknown if it'll work on current motherboards or not.


"It's probably because there are dual-core Core 2 dies that do not work well as quad-core Xeon for some reasons, and are thus sold a dual-cores only"

How many such dualcores are for every working quadcore? Also what exactly could there be that makes it impossible to use two of those 2.66GHz dualcores to make one working quadcore?

I'm not sure why exactly Intel released those 1.33GHz dualcores. I'm quite certain that they are not leftovers from quadcores.

Pop Catalin Sever said...

AMD Demoed 3.0 GHz Phenom Quadcore. This is quite a surprise and should stop all those apocalyptic previsions for AMD ...

Also more news regarding AMD future products there: Bobcat, Falcon, Sandtiger among others were anounced ...

This was the juiciest article about AMD I've read in a while ...

Pop Catalin Sever said...

Corect link

enumae said...

Abinstein
What I said is just speculation.

You made the statement Intel will, you came across as though it were fact which is why I asked for a link.

Is there a need for the sarcasm?

First, 1333-FSB does not help dual-core performance, yet Intel releases them.Why?

The 1333FSB processors support Intel® Trusted Execution Technology, and could have an impact on Business customers.

Definitely not to encourage 1333/1066 motherboard sales

It could be Intel's TXT and the fact that these 1333FSB motherboards support Penryn.

Or maybe the channel is already aware that Intel is phasing out the 1066FSB motherboards and are trying to clear inventory.

HKEPC has an Intel road map for their chipsets.

1600MT/s will be required for 3.2GHz Penryn.

Do you think Intel's going to release faster-than-3.2GHz Penryn for desktop?

If so, then you better prepare to buy 1600MT/s motherboard for those chips.


Why does Intel supposedly plan on releasing a 3.33GHz - 1333FSB Penryn in Q4 (DigiTimes) if they need 1600FSB?

Aguia said...

Scientia have you already seen AMD presentation?

Very Impressive.

Two quick notes that seem very interesting,
AMD 2.0GHz is 25% faster than Intel 2.33Ghz and 30% better performance for watt.

AMD has already quad core CPUs at the same yield of current dual core.

Much more can be said about it, like the quantity of projects AMD has in their hands, if AMD does deliver all them Intel will need much more than a penryn and nehalem to compete.

Ho Ho said...

Scientia have you already seen AMD presentation?

Care to link it? It would be interesting to see what configurations and benchmarks were used.

Ho Ho said...

AMD has already quad core CPUs at the same yield of current dual core.

My logic sais this is impossible. Pure die size difference should make it have worse yield than current CPUs do.

Aguia said...

Slide 160 and 45.
Slide 159 also interesting.

The all presentation is interesting, let’s hope AMD/ATI does deliver.

Just one more note, the AMD staff that was there in the presentation like Hector and the others didn’t "smile" too much. Well it was my first "live" AMD Investor Day. Maybe they never smile too much. Of course any in the red company can’t be too happy ;)

Aguia said...

ho ho,

the presentation isn't avaiable yet.

2007 Analyst Day

I have seen it "live" in the webcast.

Aguia said...

This link still works for me:

2007 Analyst Day

Ho Ho said...

Thanks, I'll probably take a look at it tomorrow, it is nearly 2 a.m here :)


If AMD sais it can release 120W 2.6GHz quadcore Opteron in Q2 2008 then who believes there will be 3GHz quads availiable this year? Also note that allthough it has 2.5GHz quads in Q4 this year it takes them two quarters to get 100MHz speedboost. Though they do predict they can lower the TDP of 2.4GHz quad from 120W down to 95W at the same time interval.

Aguia said...

The problem seams to be Barcelona and the HT 1.0 compatibility.

The Agena with the split power plane and the HT3.0 connectivity seams to allow greater clock speed.

It was one Agena quad core CPU that was demonstrated not one Barcelona. And of course single socket.

core2dude said...


Also before 2009 we'll have Nehalem and currently I think it is unknown if it'll work on current motherboards or not.

Nehalem won't work in current motherboards. No chance what so ever! Most likely it won't even physically fit in the same socket.

abinstein:

It's probably because there are dual-core Core 2 dies that do not work well as quad-core Xeon for some reasons, and are thus sold a dual-cores only

huh??

lex said...

why Hector and Dirk weren't smilling..

65nm Yields suck
Barcelona has serious speed issues
Barcelona has serious power issues
3 stepping and still bugs
just lost another 600 billion
no real course to fund 45nm
Lots of great plans but no money or people to really execute them

I wouldn't be smiling either if I had to spin that situation.

Ho Ho said...

aguia
"The Agena with the split power plane and the HT3.0 connectivity seams to allow greater clock speed."

Have you got anything to support that claim? To me it seems pretty much impossible.



On slide 42 about virtualization they show that 2GHz quadcore will be 79% faster than 3GHz dualcore at same power envelope. Somehow I doubt that there will be 2GHz energy efficient (=68W) K10s availiable at launch if they use 95W CPU in comparisons. Has anyone seen different information?

Also they tell that 2P Bacrelona has SpecFP_rate score of 69.5. Well, I found this from spec.org. What am I missing? Why does it look like wP barcelona is slower than 2P K8 at same clock speed?

Ho Ho said...

Sorry, I just realized I had looked wrong Spec numbers for the K8

Ho Ho said...

Here is a 2006 result from 4-core K8 box at 2GHz, it is 40.7 points. That would make K10 at same clock speed and with double the coires around 70% faster. I was expecting much more than that personally, especially as K10 has twice the cores and it is supposed to do much more work per clock.

Aguia said...

ho ho
I'm a little confused...

The K8 Quad (2x2) 2.0Ghz scored 40.7
The Xeon Quad (2x2) 2.0Ghz scored 36.1
The Barcelona Quad (1x4) 2.0Ghz scored 69.5
The Xeon Quad (1x4) 2.13Ghz scored 33.0
The Xeon Octo (2x4) 2.0Ghz scored 52.0
The K8 Octo (4x2) 3.0Ghz scored 89.1

How is this a bad result?

Ho Ho said...

aguia
"The Barcelona Quad (1x4) 2.0Ghz scored 69.5"

No, on slide 45 it sais "2P servers:
Barcelona (2.0GHz, 95W) vs. Xeon 5345 (2.33GHz, 1333MHz FSB 80W)."

What makes you think Barcelona had only single CPU with four cores against two Xeons?


Also I made a little error before, virtualization performance is compared on slide 43, not 42.

Aguia said...

Sorry my mistake.

I was seeing something strange, but could get it.

abinstein said...

Ho Ho -
"On slide 42 about virtualization they show that 2GHz quadcore will be 79% faster than 3GHz dualcore at same power envelope. Somehow I doubt that there will be 2GHz energy efficient (=68W) K10s availiable at launch if they use 95W CPU in comparisons."
Your logic is strange. The slide explicitly says at same power, why do you want to compare 68W quad-core with 95W dual-core? The slide is showing quad-core K10 having much better virtualization performance than dual-core K8, even at same power.


"That would make K10 at same clock speed and with double the coires around 70% faster. I was expecting much more than that personally, especially as K10 has twice the cores and it is supposed to do much more work per clock."

First, for Core 2 Xeons, going from dual-core to quad-core at the same clock rate gives only 50-60% performance improvement, thus K10's 70% is already impressive in the face of its competition.

Second, the 70% you calculated is not meaningful anyway, because you don't know whether the scores (69.5 vs. 40.7) are generated on the same type of systems (compiler, OS, memory, etc). For example, two identical 8-core Opteron 8220 server hardware can have SPECfp peak at 85.4 or 91.8 just because a different OS is used. You also don't know whether the scores will be higher than 69.5 when production systems are out.

While it is premature to determine K10's scalability, one thing is for sure: it's much better than Xeon's.

Ho Ho said...

abinstein
"First, for Core 2 Xeons, going from dual-core to quad-core at the same clock rate gives only 50-60% performance improvement, thus K10's 70% is already impressive in the face of its competition."

How much do you gain with K8? How big difference in scaling is there between using higher clocked K8 and going from dualcore K8 to quadcore K10? My fast and dirty calculations show that upgrading K8's with faster K8 scaling is actually pretty good. With similar scaling K10 should show a lot better result than it does according to this benchmark.


"Second, the 70% you calculated is not meaningful anyway, because you don't know whether the scores (69.5 vs. 40.7) are generated on the same type of systems (compiler, OS, memory, etc)."

Yes but I do expect AMD to use optimal setup for measuring Barcelona performance. I somehow doubt things would change a lot better with simply changing something besides the CPU.


"While it is premature to determine K10's scalability, one thing is for sure: it's much better than Xeon's."

You could also say that it performs rather nicely at low clocks, we know nothing about high clocks yet. When you clock it higher it might start hitting memory bandwidth limits in scenarios wher Core2 doesn't thanks to its larger cache, though there probably aren't many such scenarios. Still I'm sure people will find them and show them to show one or the other in better light.

Also what we don't know is how will it perform on desktop applications. Some SpecInt results would be nice for starters, though I still wouldn't call those programs there a representation of average desktop usage.

Ho Ho said...

forgot something,

abinstein
"The slide explicitly says at same power, why do you want to compare 68W quad-core with 95W dual-core?"

I was just thinking out loud. Basically what I meant was that there probably won't be 2GHz quads at 68W availiable at launch. Low power models will have lower clock speeds than that, at least until new revisions arrive.

Also there is no 95W 3GHz dualcore Opterons. Lowest power is at 119.2W. When I said "compare against 95W CPU" I meant 95W Barcelona. If they would have low-power 2GHz Barcelona availiable at launch it would have ben logical to use that in comparisons.

http://www.amdcompare.com/us-en/opteron/default.aspx

Ho Ho said...

There is another possibility why the score is lower than expected for Barcelona: it lacks memory bandwidth.

I'll try to make some calculations based on K8 tomorrow, it's 2 a.m. here at the moment. If anyone else can do the calculations for me then I'd greatly appreciate it.

Aguia said...

ho ho,
what about this possibility.

Barcelona is "just" one K8 with 4 cores.
That would explain the 70% performance over the current dual cores.
No improvements where made to x87.
AMD didn’t double FPU, the only thing they doubled was the SSE units and updated them to 128 bit.

The L3 doesn’t help much the results because its "integrated" on the IMC (if that’s the case L3 cache performance will scale according to the Northbridge clock speed) and those tests never scaled with cache or Intel was already beating the current Opteron.

That design still beat Xeon because current Opteron is still an excellent server processor.
The problem is really the lack of one quad core part.

That would also explain the lack of HT3, the "only" three HT links, the mixing of codenames like K10 and Barcelona.
And avoiding precise performance questions, concentration the speech on technologies like split core power.

This is of course just theorizing.

PS: The calcs ho ho?

Scientia from AMDZone said...

Okay, now that I finally have the Technology Analyst Day article up we can continue discussing this there.

greenmachine said...

Scientia,

I think everyone may find this thesis of interest.

http://etd.lsu.edu/docs/available/
etd-06122007-093459/unrestricted/
Prakash_thesis.pdf

I am sorry to break up the html link, but I could not get it to fit on one line.

Perhaps you could write an article on it. It seems to verify a lot of your previous articles on the strengths and weaknesses of core 2 duo and amd architectures.

I too have been looking for valid benchmarks instead of just fog. I was surprised at how close the two are in most applications.

Thanks

Paul Crowley said...

Yes, like all encryption/decryption engines RC5 does not require a bit of floating-point

Actually floating-point can be used for crypto; see eg http://citeseer.ist.psu.edu/447761.html