Scientia's Blog: 2007

Monday, December 24, 2007

AMD K10 Tries To Grow Some Teeth

It has been a long, long wait for proper testing on K10. The previous hasty reviews generally showed K10 doing poorly with user applications and pretty well with server applications. Unfortunately, there never seemed to be enough information to determine if K10 were working properly or contained some serious design flaw. With the update review at Legit Reviews we now have at least some information.

It is clear that the initial teething problems with ATI have been taken care of. The new Spider platform looks to be first rate and a good win for AMD. However, this platform does need a good processor and AMD has been slow to respond. The Phenom is apparently shipping but only as 9500 (2.2Ghz) and 9600 (2.3Ghz). Things don't really get interesting until 9900 (2.6Ghz) but this is either not available until Q1 or may even be pushed back to Q2 as recently stated by Digitimes. Of course, the same source also says that Intel is delaying the Core 2 Quad Q9300, Q9450 and Q9550 until late Q1. The difference is that AMD has not commented while Intel has confirmed the delay. However, Charlie at The Inquirer says that Intel is still have teething problems of its own with its 45nm process. I have to say that this again brings up the question of how much FUD is in factory previews. We have Intel showing Penryns clocked to 3.33Ghz and AMD showing K10's clocked to 3.0Ghz with neither chip anywhere on the horizon. I guess in this smoke and mirrors contest Intel does seem to be closer to reality with a 3.2Ghz Penryn QX9770 scheduled maybe end of Q1 while AMD hasn't yet indicated when 2.6Ghz will be out. And, of course, with Intel's current lead the question of a delay in 45nm is moot unless AMD is planning to essentially skip 65nm versions and start producing 45nm in Q2 (delivery in Q3). That is about the only way I could see AMD getting back on track.

Okay, back to our original point. Assuming AMD manages to get real hardware out the door someday and make the 9900 more substantial than vaporware what do we have to look forward to? Is 9900 a contender or joke? This was the original Sandra XII scores showing very poor performance for K10 in both Integer and FP (the red bar). Such poor performance suggested a serious problem with K10. However, the new Sandra XII SP1 scores show double the previous scores removing any question of a serious design error in the memory section. The Sandra XII SP1 score of 10306 is fully 52% greater than Intel's QX9770. This explains why K10 does so well on memory limited benchmarks. Even more importantly however is the fact that it is 13% faster than 6400+ showing that K10's hardware changes to the memory controller do make a difference.

Next we need to revisit the two benchmark rumors that dogged K10 long before its release: Pov-Ray and Cinnebench. Early scores from these two benchmarks had Intel enthusiasts cackling and howling that K10 was no faster than K8. Neither was a proper benchmark but this didn't seem to stop the naysayers. We can see on the PovRay scores that 6400+ has to be 20% faster to get 2% better peformance than E6750. Thus C2D is about 18% faster than K8. However, in answer to the notion that Pov-Ray proves that K10 is no faster than K8 we see that there is no such proof. If we take the E6750 score and scale it to quad core at 2.4Ghz we can see that Q6600 scales 98% which is a very good score. However, K10 scales 102%. This suggest that K10 is at least 4% faster than K8. Since Q6600 shows no bandwidth limitations we know that the improvement is not due to memory. True, the change is small but for unoptimized code it is significant. The PovRay website also does not say what level of SSE is supported in the 64 bit version. Judging from the small increase from K8 to K10 and C2D it is clear that PovRay is not using 128 bit SSE operations. It does say that the 32 bit version only supports up to SSE2. It also does not say what compiler was used to create the executable. This too is significant since if Intel's compiler were used, K10 would take at least a 10% hit in speed. This could easily put K10 even or perhaps slightly ahead. Again, in terms of K10's speed relative to K8, the Pov-Ray Realtime scores are even more telling. Intel does manage perfect scaling with QX9770 versus E6750. However, K10 is fully 11% faster than 6400+ would be with perfect scaling. So, the Pov-Ray rumor is shelved.

Next we examine the contraversial Cinnebench scores. At the bottom we can see that K8 needs 20% more clock speed to run neck and neck with E6750. With this benchmark we see a problem though. K10 only scales 88% of 6400+. This isn't far off of Q6600's 90% scaling. However, when we look at the Penryn's we see something different. The 3.0 and 3.2Ghz Penryn's show perfect scaling from E6750. We know that this is not due to code tuning since Q6600 is slower. We know that it is not due to memory bandwidth since 9900 is slower. This pretty much only leaves cache size as the determining factor. Unfortunately, a benchmark that fits into Penryn's larger cache to run full speed is worthless for performance comparison. So, Cinnebench is still up in the air. However, given the improvements that we saw with Pov-Ray I think the Cinnebench scores can be ignored for now. When Shanghai is eventually released it will have a good deal more L3 cache than K10. If its Cinnebench scores perk up then we will know for certain that Cinnebench is useless due to cache tuning.

I could go over the rest of the benchmarks in the review but for the most part they are useless since they do not utilize all the cores. And, if we aren't using all the cores then why not just buy a dual core system? We can see that there are still questions of how benchmarks are compiled and that Intel may still be getting an edge due to the use of its compiler which is still heavily tilted in its favor. We can see that a lot of software still has not caught up to the use of 128 bit SSE codes which is the only way that C2D or K10 show their SSE strength. Older processors as far back as PIII work quite well with 64 bit SSE as does K8. And, there still remains the problem of multi-threading since a quad core is useless without quad threads. Nevertheless, this has answered a couple of questions. There is no big design defect in the K10 memory controller as earlier benchmarks suggested and K10 is clearly faster than K8 in terms of Pov-Ray. Unfortuately, we still have not turned up any real proof of whether K10's 128 bit SSE functions are on par with C2D's (and roughly twice as fast as K8's). Hopefully, this information will turn up eventually.

For now, all we can say is that assuming K10 is a signficant improvement on K8 AMD needs to get it out the door at speeds of 2.6 Ghz and above and sooner rather than later. For AMD's sake we also have to wonder if 45nm is truly still on track which should mean skipping some 65nm parts or whether this has slid as well. I suppose nothing but time will tell. Worst case for AMD would be Intel with 3.2Ghz and 3.33Ghz desktop Penryns and Nehalem launched to tackle the server market while AMD struggles to reach 2.8Ghz on 65nm with 45nm pushed back to 2009. Best case would probably be starting production of 45nm Shanghai parts in Q2 with delivery in Q3 with parts available in clock speeds of at least 3.0Ghz while still fitting within the normal TDP and Intel still at 3.2Ghz with Penryn. The sad part for AMD is that I could see either of these scenarios happening and neither one is particularly bad for Intel. We all know what AMD's New Year's resolution should be; we just don't know if AMD is capable of living up to it. C2D has some serious teeth and the fangs are just getting longer with Penryn; K10 with milk teeth is far less threatening.

Monday, November 26, 2007

AMD: All Dressed Up, But No Place To Go

AMD's current position is frustrating at best. Although AMD seems to have gotten its ATI offerings in order, its K10 offerings lag expectations in almost every way.

AMD's previous 65nm ATI 29xx GPU offerings looked a bit dismal compared to nVidia's. The high power draw and low performance justified more than one lukewarm review. The best ATI product, HD 2900XT, was only close to the nVidia 8800GT (at higher power draw) but nowhere near the 8800GTX. AMD's response to this seemed more than a little strange. They said that they intended to compete against GTX with Crossfire. However, it seemed difficult to imagine anyone really wanting double the power draw and heat of 2900XT. It seems though that with the new 38xx series on the 55nn process, AMD has gotten the power draw under control. And, with the reduced die size they can probably sell them at the reduced 2900 price and still make money. And, it looks like 38xx will actually make Crossfire a viable competitor against GTX.

AMD's new 55nm based 7xx chipsets look like winners of the first order. And, AMD's Overdrive utility looks very good. It is amazing to think about tweaking the overclock on a system with a factory control panel. The GPU's, the chipsets, and the Overdrive utility dress up any new AMD system fit for a ball. The problem is that these systems require an equally good processor but, so far, AMD has failed to deliver.

But, it isn't just one problem with K10; it is many. From lagging volumes to a clock ceiling of just 2.3Ghz to slow NorthBridge and L3 cache speeds to no clear indication of improvement over K8. Not only is there no clear indication of when things might be fixed there is not even a clear indication of what the problem actually is. Low volumes would suggest poor yields and ramping problems. Low clocks could suggest either process problems or architecture problems. The lack of 2.4Ghz speeds was blamed on a bug that won't be fixed until the next revision. This is in contrast with both reviews that overclocked to 2.6Ghz with no trouble and AMD's own 3.0Ghz demo. Under normal circumstances AMD should be able to deliver a demo chip in volume about 6 months later. That AMD does not appear ready to deliver 3.0Ghz K10's anytime in Q1 suggests a big problem of some kind. The poor performance also contrasts with statements from different sources who suggested that K10 had great performance at higher speeds.

The reviews aren't much help either. OC Workbench shows performance for Phenom close to that of Kentsfield at the same clock with a few exceptions. For example the Multimedia - Int x8 iSSE3 score indicates either a compiler problem or an architecture problem. Phenom's score is only about 1/3rd what it should be after turning reasonably good scores in the other categories. The Cinnebench scores are odd since Phenom appears to speed up more than 4X as all four cores are used compared to 3.53X for Kentsfield. A really good speedup would be 3.8-3.9X while 4.0 should be impossible. Something funky is definitely going on to get 4.35X. Phenom also falls off quite a bit in the TMPGenc 4.0 Express test.

The Anandtech Phenom Review suggests that Intel might be having some small difficulty with clocks on Penryn. However, this difficulty would be so small compared to AMD's that it is hardly noticeable. It only means that 3.2Ghz desktop chips won't be out until Q1. Since this is also when AMD is releasing 2.4Ghz this could easily put Intel at a 33% faster clock. Other statements in this review suggest that Intel may have streamlined its internal organization considerably which would also be bad news for AMD. This review also mentions that K10's L3/NB can't clock higher than 2.0Ghz. Anandtech does however confirm how nice AMD's Overdrive Utility is. The Anandtech benchmarks show generally slower performance for Phenom at the same clock as Kentsfield with higher power draw for Phenom. So, the Anandtech scores don't really show us where a problem might be.

I was hoping that when the November 2007 Top 500 HPC list came out that I would get some new test scores that would settle the question about whether K10 was better than K8. I downloaded the latest Top 500 list as a spreadsheet. I was delighted to see a genuine Barcelona score. HPC systems typically have well tuned hardware, operating systems, and software so benchmarks from these should be accurate. I was hoping the HPC numbers would avoid any question of unfavorable hardware, OS, or compiler.

I began crunching numbers. I normalized the scores based on the number of processors and clock speed. I was happy to see very tight clustering at 2.0 for dual core Opteron. Then I ran the numbers for Barcelona and it showed 4.0. This was double the value for dual core. This seemed to be a big problem to me since with twice as many cores and double the SSE width it would seem that K10's top SSE speed should be twice per core or four times larger for Barcelona. In other words, I was expecting something around 8.0 but didn't see that. This would suggest a problem.

However, I then ran the Woodcrest and Clovertown numbers. Woodcrest clustered tightly at 4.0. This was not a surprise since it too has twice the SSE width of K8. Unfortunately, the Clovertown numbers continued to cluster at 4.0. This was a surprise since (with twice as many cores) I was expecting 8.0 for Clovertown as well. So, unfortunately, these scores are inconclusive. K10 is showing the same normalized score as Clovertown but neither is showing any increase in speed over Woodcest. The questions still remain unanswered.

The bottom line is that AMD is having problems. Whether these problems are due to process, design flaws, or unfixed bugs is hard to say. AMD is also under the gun on time. For example if Phenom is only hitting 2.4Ghz in Q1 then that leaves no time to start production of Shanghai in Q2 for a Q3 release. I suppose it is possible that AMD could fix a serious design flaw with the Shanghai release but it is by no means certain. What is certain though is that AMD will have to do much better to have any chance of increasing its share up to profitable levels in 2008.

Monday, November 05, 2007

Has Intel's Process Tech Put Them Leagues Ahead?

There has been a lot of talk lately suggesting that Intel is way ahead of AMD because of superior R&D procedures. Some of the ideas involved are rather intriguing so it's probably worth taking a closer look.

You have to take things with a grain of salt. For example, there are people who insist that it wouldn't matter if AMD went bankrupt because Intel would do its very best to provide fast and inexpensive chips even without competition. Yet these same people will, in the next breath, also insist that the reason why Intel didn't release 3.2Ghz chips in 2006 (or 2007) was because, "they didn't have to". I don't think I need to go any further into what an odd contradiction of logic that is. At any rate, the theory being thrown around these days is that Intel saw the error of its ways when it ran into trouble with Prescott. Then it scrambled and made sweeping changes to its R&D department, purportedly mandating Restricted Design Rules so that the Design staff stayed within the limitations of the Process staff. The theory is that this has allowed Intel to be more consistent with design and to leap ahead of AMD which presumably has not instituted RDR. The theory also is that AMD's Continuous Transistor Improvement has changed from a benefit to a drawback. The idea is that rather than continuous changes allowing AMD to advance, these changes only produce chaos as each change spins off unexpected tool interactions that take months to fix.

The best analogy of RDR that I can think of is Group Code Recording and Run Length Limited recording. Let's look at magnetic media like tape or the surface of a floppy disk. Typically a '1' bit is recorded as a change in magnetic polarity while a '0' is no change. The problem is that this medium can only handle a certain density. If we try to pack too many transitions too closely together they will blend and the polarity change may not be strong enough to detect. Now, let's say that a given magnetic medium is able to handle 1,000 flux transistions per inch. If we record this directly then we can do 1,000 bits per inch. However, Frequency Modulation puts an encoding bit between data bits to ensure that we don't get two 1 bits in a row. This means that we can actually put 2,000 encoded bits per inch and of this 1,000 bits is actual data. We can see that although FM expanded the bits by a factor of 2 there was no actual change in data density. However, by using more complex encoding we can actually increase density. By using (1,7) RLL we can record the same 2,000 encoded bits per inch but we get 1,333 data bits. And, by using (2,7) RLL we space out the 1 bits even further and can double the recording density to 4,000 encoded bits per inch. This increases our data bits by 50% to 1,500. GCR is similar as it maps a group of bits into a larger group which allows elimination of bad bit patterns. You can see a detailed description of MFM, GCR, and RLL at Wikipedia. The important point is that although these encoding schemes initially make the data bits larger they actually allow greater recording densities.

RDR would be similar to these encoding schemes in that while it would initially make the design larger it would eliminate problem areas which would ultimately allow the design to be made smaller. Also, RDR would theoretically greatly reduce delays. When we see that Intel's gate length and cache memory cell size are both smaller than AMD's and we see the smooth transition to C2D and now Penryn we would be inclined to give credit to RDR much as EDN editor, Ron Wilson did. You'll need to know that OPC is Optical Proximity Correction and that DFM is Design For Manufacturability. One example of OPC is that you can't actually have square corners on a die mask so this is corrected by rounding the corners to a minimum radius. DFM just means that Intel tries very hard not to design something that it can't make. Now, DFM is a good idea since there are many historical examples of designs from Da Vinci's attempts to cast a large bronze horse to the Soviet N1 lunar rocket that failed because manufacturing was not up to design requirements. There are also numerous examples from the first attempts to lay the Transatlantic Telegraph Cable (nine year delay) to the Sidney Opera House (eight year delay) that floundered at high cost until manufacturing caught up to design.

I've read what both armchair and true experts have to say about IC manufacturing today and to be honest I still haven't been able to reach a conclusion about the Intel/RDR Leagues Ahead theory. The problems of comparing the manufacturing at AMD and Intel are numerous. For example, we have no idea how much is being spent on each side. We could set upper limits but there is no way to tell exactly how much and this does make a difference. For example, if the IBM/AMD process consortium are spending twice as much as Intel on process R&D then I would say that Intel is doing great. However, if Intel is spending twice as much then I'm not so sure. We also know that Intel has more design engineers and more R&D money than AMD does for the CPU design itself. It seems that this could be the reason for smaller gate size just as much as RDR. It is possible that differences between SOI and bulk silicon are factors as well. On the other hand, the fact that AMD only has one location (and currently just one FAB) to worry about surely gives them at least some advantage in process conversion and ramping. I don't really have an opinion as to whether AMD's use of SOI is good idea or a big mistake. However, I do think that the recent creation of the SOI Consortium with 19 members means that neither IBM nor AMD is likely to stop using SOI any sooner than 16nm which is beyond any current roadmap. I suppose it is possible that they see benefits (from Fully Depleted SOI perhaps) that are not general knowledge yet.

There is at least some suggestion in Yawei Jin's doctoral dissertation that SOI could have continuing benefits. The paper is rather technical but the important points are that SOI begins having problems at smaller scale.

"we found that even after optimization, the saturation drive current planar fully depleted SOI still can’t meet 2016 ITRS requirement. It is only 2/3 of ITRS requirement. The total gate capacitance is also more than twice of ITRS requirement. The intrinsic delay is more than triple of ITRS roadmap requirement. It means that ultra-thin body planar single-gate MOSFET is not a promising candidate for sub-10nm technology."

The results for planar double gates are similar: "we don’t think ultra-thin body single-gate structure or double-gate structure a good choice for sub-10nm logic device."

However, it appears that "non-planar double gate and non-planar triple-gate . . . are very promising to be the candidates of digital devices at small gate length." But, "in spite of the advantages, when the physical gate length scales down to be 9nm, these structures still can’t meet the ITRS requirements."

So, even though AMD and IBM have been working on non-planar, double gate FinFET technology, this does not appear sufficient. Apparently this would have to be combined with novel materials such as GaN in order to meet the requirements. It then appears that it is possible for AMD and IBM to continue using SOI down to a scale smaller than 22nn. So, it isn't clear that Intel has any longterm advantage by avoiding SOI based design.

However, even if AMD is competitive in the long run that would not prevent AMD from being seriously behind today. Certainly when we see reports that AMD will not get above 2.6Ghz in Q4 that sounds like anything but competitive. When we combine these limitations with glowing reports from reviewers who proclaim that Intel could do 4.0Ghz by the end of 2008 this disparity seems insurmountable. The only problem is that the same source that says that 2.6Ghz Phenom will be out in December or January also says Fastest Intel for 2008 is 3.2GHz quad core.

"Intel struggles to keep its Thermal Design Power (TDP) to 130W and its 3.2GHz QX9770 will be just a bit off that magical number. The planned TDP for QX9770 quad core with 12MB cache and FSB 1600 is 136W, and this is already considered high. According to the current Intel roadmap it doesn’t look like Intel plans to release anything faster than 3.2GHz for the remainder of the year. This means that 3.2 GHZ, FSB 1600 Yorkfield might be the fastest Intel for almost three quarters."

But this is not definitive: "Intel is known for changing its roadmap on a monthly basis, and if AMD gives them something to worry about we are sure that Intel has enough space for a 3.4GHz part."

So, in the end we are still left guessing. AMD may or may not be able to keep up with SOI versus Intel's bulk silicon. Intel may or may not be stuck at 3.2Ghz even using 45nm. AMD may or may not be able to hit 2.6Ghz in Q4. However, one would imagine that even if AMD can hit 2.6Ghz in December that only 2.8Ghz would be likely in Q1 versus Intel's 3.2Ghz. Nor does this look any better in Q2 if AMD is only reaching 3.0Ghz while Intel manages to squeeze out 3.3 or perhaps even 3.4Ghz. If AMD truly is the victim of an unmanageable design process then they surely realized this by Q2 06. However, even assuming that AMD rushed to make changes I wouldn't expect any benefits any sooner than 45nm. The fact that AMD was able to push 90nm to 3.2Ghz is also inconclusive. The fact that AMD was able to get better speed out of 90nm than Intel was able to get out of 65nm could suggest more skill on AMD's part or it could suggest that AMD had to concentrate on 90nm because of greater difficulty with transistors at 65nm's smaller scale. AMD was delayed at 65nm because of FAB36 while Intel needs a fixed process for distributed FAB processing. Too often we end up with apples to oranges when we try to compare Intel with AMD. Also, we have to wonder why if Intel is doing so well compared to AMD with power draw then why did Supermicro just announce World's Densest Blade Server with Quad-Core AMD Opteron Processors.

To be honest I haven't even been able to determine yet if the K10 design is actually meeting the design parameters. There is a slim possibility that K10's could show up in the November Top 500 Supercomputer list. This would be definitive because HPC code is highly tuned for best performance and there are plenty of K8 results for comparison. Something substantially less than twice as fast per core would indicate a design problem. Time will tell.

Friday, October 19, 2007

Q3 2007 Earnings, AMD & Intel

With AMD's Q3 Earnings information added to Intel's we can see that Intel is doing quite well while AMD is still struggling. Nevertheless the announcements have dispelled many rumors that have followed the two over the course of this year.

Intel's revenues and margins are up this quarter and it is obviously doing well. In contrast AMD lost $400 Million and has much lower margins. One could be forgiven for confusing today with early 2003 when Intel was doing well and AMD was struggling to turn a profit as K8 was released. However, today is quite different from 2003 since AMD is now firmly established in servers, mobile, and graphics. In early 2003 AMD had barely touched the server market with Athlon MP, had only a single chipset for Opteron and had yet to create mobile Turion. It is interesting though to review the common rumors that have surrounded Intel and AMD this year to see where perception differs from reality.

A commonly repeated rumor was that AMD was teetering on the edge of bankruptcy and would likely have to file after another bad quarter or two. However, it is clear from the reduction in stock holder equity this quarter that AMD could survive at least another year with similar losses each quarter. If AMD improves in Q4 then bankruptcy seems unlikely. Also along these lines was the persistent rumor that AMD would sell FAB 30. AMD's announcement of pilot conversion to 300mm/45nm (as I indicated months ago) shows that AMD will not be forced back to a single FAB.

A similar myth was that AMD would need to be purchased by another company to be bailed out. The biggest problem with this idea is that there really isn't anyone to buy AMD. Companies like Samsung and IBM would be competing with their own customers with such a purchase and companies like Texas Instruments and Motorola have been out of the frontline microprocessor business for many years.

It has often been suggested that Intel could simply flood the market with bargain priced chips to deprive AMD of revenue. However it is clear from Intel's statements that they were unable to meet demand during the third quarter without drawing down inventories. Since the fourth quarter is typically the highest volume of the year it is unlikely that Intel will be able to cover all of the demand. Leaving Q4 with lower inventories would further prevent Intel from capturing additional demand in Q1. This means that Intel will likely be unable to really squeeze AMD until the second quarter of 2008 when 45nm production is up to speed and demand falls off from Q1 08.

Lately, it has been fashionable to suggest that AMD was abandoning the channel yet AMD posted record sales to the channel in Q3. It was also common to see bashing of ATI's offerings versus nVidia. Yet the increased sales from chipsets and graphics indicate that ATI is gaining ground after its loss of direct business with Intel. The ATI assets should continue to add to profits in Q4. Secondly, as AMD and nVidia strive to gain advantage there is no doubt that Intel is being left behind. With new chipset offerings from AMD in Q4 and Q1, Intel's chipsets will once again be second rate in terms of graphics. The company most likely to be hurt is VIA as the move by Intel and AMD into the small form factor mini-ITX size commoditizes the market and removes most of the previous profits. One has to wonder how much longer VIA can stay in this market as it gets squeezed both in terms of chipsets and low end cpu's. One has to wonder as well if Transmeta might soon become an AMD acquisition as it similarly struggles at the low end.

Intel was often described as moving its 45nm timetable forward one quarter while AMD was often described as falling further and further behind. However, the demand shortages would be consistent with rumors that Intel was having trouble moving 45nm production beyond D1D before Q1 08. Presumably the Penryn chips that will be available in Q4 07 will be from D1D. AMD's statement countered rumors of a slipping 45nm schedule by repeating its timetable of first half of 2008. Intel's supposed one quarter pull in for 45nm was clearly ficticious since Intel's own Tick Tock schedule would be Q4 2007 or exactly two years since 65nm in Q4 2005. So, Intel is on schedule rather than moving the schedule forward. AMD's statement was a bit of a surprise since the previous timeline had been "midyear" for 45nm. This could easily have been Q3 but AMD's wording of first half of 2008 means late Q2 is more likely. If AMD really is on this track then half of Intel's previous process lead will be gone in another two quarters. This would also mean that AMD will have moved to 45nm in just 18 months rather than 24 months as Intel has done.

A current area of confusion is whether demoed chips are at all representative of actual production. It has become apparent that Intel's initial production from D1D is of very high quality. However, quality seems to fall somewhat as production is moved to additional FABs. In the overclocking community, it has been suggested that Intel's bulk production did not catch up to the quality of the early D1D chips until the recent release of the G0 stepping. This suggests that Intel's bulk production quality lags its initial production quality by a full year. This would seem to explain both having to destroy the initial 45nm chips. Clearly, Intel's demos are not indicative of actual production as seen by the lack of chips clocked above 3.0Ghz. This leaves the question of whether AMD's demo of a 3.0Ghz K10 was similarly cherry picked. A late Q1 or early Q2 08 release of a 3.0Ghz FX Phenom would be consistent however a late Q2 release would mean that AMD was equally guilty.

A lot of faith was put into Intel's mobile Centrino position. However, this quarter it is clear that Intel lost some mobile ground. Intel faces a much greater challenge as Griffin is released with an all new mobile chipset in Q1 08. All early indications are that mobile Penryn is unable to match Griffin's power draw. This will likely mean continued erosion of Intel's mobile position.

The server market is another area that tends to run counter to rumors. After Intel's very strong showing with Woodcrest and Clovertown it has taken about as much 2-way share from AMD as it can. The suggestion has been that Intel could hold its 2-way share while taking 4-way share from AMD with Tigerton. This scenario is almost certainly incorrect. Barcelona should provide strong competition in 2-way servers from Q4 07 forward so Intel is likely to lose some of its server share. And, Tigerton is only a slightly modified 65nm Clovertown which will have great difficulty overcoming the high power draw of its quad FSB, FBDIMM chipset. This makes it unlikely for Tigerton to do more than hold the line as a replacement for Tulsa. It appears that Intel's true 4-way champion, Nehalem, won't arrive until Q4 (the normal Tick Tock schedule) by which time AMD's 45nm Shanghai will already be out. I have no doubt that Nehalem could show very well against Shanghai since Nehalem will have more memory bandwidth. However, a big question is whether this will move down to the desktop. There still seems to be some confusion too as to whether or not Intel is really willing to discard its lucrative chipset sales and FSB licensing. The strength of Nehalem is also a two edged sword as Intel will simultaneously lose its die size (and cost) advantages since Nehalem is a true monolithic quad core die. Power draw is also unclear until it is known whether Nehalem has a direct IMC or some type of intermediate bus like AMD's G3MX. For example if Nehalem relies on FBDIMM as the current C2D generation does then Nehalem may have a tough time being competitive in performance/watt.

It now appears that AMD has indeed moved up its own timetable for delivering 2.6Ghz chips. These seem scheduled for release in Q4 whereas the previous roadmap had them appearing in Q2 08. This would likely mean 2.8Ghz chips in Q1 from AMD however Intel could maintain a comfortable lead with 3.1Ghz or faster chips of its own. The number of partners added to IBM/AMD research consortium also vanquished persistent rumors that AMD would toss SOI on 32nm.

However, the most surprising thing that has come up is AMD's view of the market. The popular wisdom was that AMD would be forced to back down and give up share to Intel. Yet in spite of Intel's good showing in profits, AMD is determined to increase revenues to $2 Billion a quarter. This would mean not only holding onto current share but taking more share from Intel. The $2 Billion mark is unlikely soon but it does seem that a near term loss of, say, $100-200 Million would be considerably better than the current $400 Million loss. The $2 Billion goal also seems to fit with AMD's previously stated goal of 30% share by end of 2008. I don't have AMD's volume share numbers yet for Q3 but I recall 23% from Q2. If AMD is at, say, 24% now then a 25% increase would be 30% and a similar 25% increase with the $1.6 Billion gross revenue would be $2 Billion. Assuming that AMD can hit its 45nm Shanghai target, I can't see any reason why this wouldn't be possible. It would mean ramping FAB 36 by the normal schedule and then partially ramping FAB 38 in Q1 and Q2. This would allow production to increase in Q2 and Q3 which would keep delivered chip volume increasing in Q3 and Q4 when FAB36 will already be topped out. Since Intel's 45nm ramp is slower than AMD's it is possible for AMD to blunt Intel's price advantage as long as Shanghai stays on track.

In spite of the rumors that have been dispelled we are left with a number of unanswered questions. Is there a bug that is holding back K10's performance? What speeds will Intel and AMD have in Q1? Is Intel really willing to toss its chipset revenue with Nehalem while losing its current cost advantages? Will the additional partners that have joined IBM and AMD in SOI research allow AMD to match or even exceed Intel's process tech? Can AMD really increase its share by 25%? Is AMD's faster adoption of Immersion a technical advantage or a technical curse? Will Shanghai be able to catch Penryn or hold AMD's position against Nehalem? The sad part is that it will take another year to know all of the answers for certain.

Thursday, October 11, 2007

Waiting For The Wind To Blow

We seem to be trapped in the Doldrums lately, waiting. We wait for faster clocks and desktop versions of K10 from AMD and 45nm desktop chips from Intel. We wait to find out just how things really stack up.

I'm familiar with the changes in architecture between K8 and K10. These changes are quite good but, strangely, we haven't seen this reflected in the Barcelona reviews. We are well past the point of pretending that everything is normal at AMD. It seems to me that there are only four possibilities:

The benchmarks are not complied properly for K10.
The reviews were not very high quality.
There is a bug in the revisions that have been tested.
There is a major flaw in the K10 architecture.

It wouldn't surprise me if a lot of the standard benchmarks we see claiming to compare Clovertown with Barcelona actually use the Intel compiler which has been known for quite some time to handicap AMD processors. There has also been some suggestion that Intel has spent time tuning their code specifically for these benchmarks.

There is also little doubt that the Barcelona reviews were both rushed and sloppy, much sloppier the usual testing we see from places like Anandtech. However, even the Tech Report review was disappointing since the author didn't seem to have much grasp of the technical aspects of K10.

However, I've also seen benchmarks at SPEC. These are somewhat difficult to compare because the IBM sponsored scores change both operating system and compiler when moving from K8 Opteron to K10 Opteron. Taken at face value these would show an increase in core speed of 11% for FP and 14% for Integer for K10. A 14% increase in Integer is good enough that I can't say categorically that it is incorrect. However, the FP score is a problem since the FP pipeline for K10 should be nearly 100% faster. It's hard to claim that it wasn't compiled correctly because it used Portland Group's compiler which should fully support K10. In fact, AMD has been working closely with Portland Group to ensure that it does. We also can't claim test bias because the testing was performed by AMD itself.

Given the SPEC scores created by AMD with the PGI compiler, I think we have to assume that K10 either has a bug that hasn't been fixed yet or has a serious architectural flaw that is preventing full FP speed. Earlier I had wondered if K10 had a bug in the L1 cache. This still seems to be a strong possibility. The problem is that this type of bug would not appear in the list of errata. Secondly, as strapped for cash as AMD is these days they would have no reason to make this public since it would likely mean fewer K10 sales. The only real difference between a bug and a major flaw is how long it takes to fix it. In terms of architecture, something like an L1 cache bug should be fixable in six months.

A minor bug could be fixed in as little as eight weeks (if rushed) but it would take another month for the new chips to get into circulation. However, for a standard run the fix would take fifteen weeks and at least another month to ship. So, four and a half to six months would be typical. The problem however is that the 45nm version, Shanghai, runs with the same circuitry so a flaw in K10 pushes Shanghai back until the flaw is fixed. The fact that no tapeout announcement for Shanghai has been made yet supports this scenario. AMD originally planned to release Shanghai at midyear so tapeout should have occurred in July. AMD has maintained that the 45nm immersion process is on track so this would seem to only leave a design problem. If this is indeed the case then this is both good and bad. It is clearly bad because the last thing AMD needs at this point is another problem getting in the way of competitive production. However, in an odd way I suppose it would be good if the test scores we've seen do not represent the actual potential of K10. For now there is nothing to do but wait.

Wednesday, September 19, 2007

The Top Developments Of 2007

It looks like both AMD and Intel have been as forthcoming as they are likely to be for awhile about their long range plans. The most significant items however have little to do with clock speeds or process size.

The two most significant developments have without doubt been SSE5 and motherboard buffered DIMM access. AMD has already announced its plan to handle motherboard buffered DIMMs with G3MX. This is significant because it means the end of registered DIMMS for AMD. With G3MX, AMD can use the fastest available desktop DIMMs with its server products. This is great for AMD and server vendors because desktop DIMMs tend to be both faster and cheaper than register DIMMs. This is also good news for DIMM makers because it would relieve them making registered DIMMs for a small market segment and allow them to concentrate on the desktop products. Intel may have the same thing in mind for Nehalem. There have been hints by Intel but nothing firm. I suppose Intel has reason to keep this secret since this would also mean the end of FBIMM in Intel's longterm plans. If Intel is too open about this it could make customers think twice about buying Intel's current server products which all use FBDIMM. So, whether this is the case with Nehalem or perhaps not until later it is clear that both FBDIMM and registered DIMMs are on their way out. This will be a fundamental boost to servers since their average DIMM speed will increase. However, this could also be a boost to desktops since adding the server volume to desktop DIMMs should make them cheaper to develop. This also avoids splitting the engineering resources at memory manufacturers so we could see better desktop memory as well.

SSE5 is also remarkable. Some have been comparing this with SSE4 but this is a mistake. SSE4 is just another SSE upgrade like SSE2 and SSE3. However, SSE5 is an actual extension to the x86 ISA. If AMD had been thinking clearer they might have called it AMD64-2. A good indication of how serious AMD is about SSE5 is that they will drop 3DNow support in Bulldozer. This clears away some bit codes that can be used for other things (like perhaps matching SSE4). Intel has already stated that they would not support it. On the other hand, Intel's statement means very little. We know that Intel executives openly lied about their intentions to support AMD64 right up until they did. And, Intel has every reason to lie about SSE5. The 3-way instructions can easily steal Itanium's thunder and Intel is still hoping (and praying) that Itanium will not get gobbled up by x86. Intel is also stuck in terms of competitiveness because it is too late to add SSE5 to Nehalem. This means that Intel would have to try to include it in the 32nm shrink which is difficult without making core changes. This could easily mean that Intel is behind in SSE5 until 2010. So, it wouldn't help Intel to announce support until it has to since supporting SSE5 now would only encourage development for an ISA extension that it will be behind in. Intel is taking the somewhat deceptive approach of working on a solution quietly while claiming not to be. Intel can hope that SSE5 won't become popular enough that it has to support it. However, if it does then Intel can always claim to be giving in to popular demand. It's dishonest but it is understandable for a company that has been painted into a corner.

AMD understands about being painted into a corner. Intel has had the advantage with MCM quad cores since separate dies mean both higher yields and higher clock speeds. For example, on a monolithic quad die you can only bin as high as the slowest core. However, Intel can pick and choose individual dies to put the highest binning ones together. Also, Intel can always pawn off a dual core die with a bad core as a lowly Conroe-L but it would be a much bigger loss for AMD to sell a quad die as a dual core. AMD's creative solution was the Triple Core announcement. This means that any quads with a bad core will be sold as X3's instead of X4's. This does make AMD's ASP look a bit better. I doubt Intel will follow suit on this but then it doesn't have to. For AMD, having an X4 knocked down to an X2 is a big loss but for Intel it just means having a Conroe knocked down to Conroe-L which is not so big. Simply put, AMD needs triple cores but Intel doesn't. On the other hand, just as AMD was forced to release a faster FX chip on the older 90nm process so too it seems Intel has been forced to deliver Tigerton not with the shiny new Penryn core but with the older Clovetown core. Tigerton is basically just Clovertown on a quad FSB chipset. This does suggest at least a bit of desperation since after working on this chipset for over a year Intel will be lucky if it breaks even on sales. To understand what a stumble Tigerton is you only have to consider the tortured upgrade path. In 2006 and most of 2007 Intel's 4-way platform meant Tulsa. Now we get Tigerton which uses the completely incompatible Caneland chipset. No upgrades from Tulsa. And, for anyone who buys a Tigerton system, oops, no upgrade to Nehalem either. In constrast, 4-way Opteron systems should be upgradable to 4-way Barcelona with just a BIOS update. And, if attractive, these should be upgradable to Shanghai as well. After Nehalem though, things become more even as AMD introduces Bulldozer on an incompatible platform. 2009 will without doubt be the year of new sockets.

For the first time in quite awhile we see Intel hitting its limits. Intel's 3.33Ghz demo had created the expectation of cool running 3.33Ghz desktop chips with 1600Mhz FSBs. It now appears that Intel will only release a single 45nm desktop chip in 2007 and it will only be clocked at 3.0Ghz. The chip only has a 1333Mhz FSB and draws a whopping 130 Watts. Thus we clearly see Intel's straining to deliver something faster much as AMD did recently with its 3.2Ghz FX. However, Intel is not straining because of AMD's 3.2Ghz FX chip (which clearly is no competition). Intel is straining because of AMD's server volume share. In the past year, AMD's sever volume has dropped from about 25% to only 13%. Now with Barcelona, AMD stands to start taking share back. There really isn't much Intel can do to prevent this now that Barcelona is finally out. But any sever chip share that is lost is a double blow because server chips are worth about three times as much as desktop chips. This means that any losses will hurt Intel's ASP and boost AMD's by much more than a similar change in desktop volume would. So, Intel is taking its best and brightest 45nm Penryn chips and allocating them all to the server market to try to hold the line against Barcelona. Of the 12% that Intel has gained it is almost certain to lose half back to AMD in the next quarter or two, but if it digs in, then it might hold onto the other half. This means that the desktop gets the short end of the stick in Q1 2008. However, by Q2 2008, Intel should be producing enough 45m chips to pay attention to the desktop again. I have to admit that this is worse than I was expecting since I assumed Intel could do a 3.33Ghz desktop chip by Q1. But now it looks like 3.33Ghz will have to wait until Q2.

AMD is still a bit of a wild card. It doesn't appear that they will have anything faster than 2.5Ghz in Q4 but 3.0Ghz might be doable by Q1. Certainly, AMD's demo would suggest a 3.0Ghz in Q1 but as we've just seen, demos are not always a good indicator. Intel's announcement that Nehalem has taped out is also a reminder that AMD has made no such announcement for Shanghai. AMD originally claimed mid 2008 for Shanghai and since chips normally appear about 12 months after tapeout we really should be seeing a tapeout announcement very soon if AMD is going to release by Q3 2008. There is little doubt that AMD needs 45nm as soon as possible to match Intel's costs as Penryn ramps up. A delay would seem odd since Shanghai seems to have fewer architecture changes than Penryn. AMD needs a tapeout announcement soon to avoid rumors of problems with its immersion scanning process.

Sunday, September 16, 2007

Untying The Gordian Knot -- Initial Barcelona Benchmarks

I can't recall the last time that there was so much controversy surrounding the launch of a new product nor can I recall the last time that testing was so poor and analysis so far off the mark. Barcelona launches amid a flurry of conflicting praise and criticism. The problem right now is that many ascribe the positive portrayals of Barcelona to simple acceptance of AMD slide shows and press releases while those who are strongly critical seem to only have the poor quality testing for backing.

I can't really criticize people for misunderstanding the benchmarks since these have included similar poor analysis by the testers themselves. It is unfortunate indeed that there is not one single review that has enough testing to be informative. It is equally unfortunate that proper analysis requires some technical understanding of Barcelona (K10) which has been lacking among typical review sites such as Tech Report and Anandtech.

K8 is able to handle two 64 bit SSE operations per clock. It takes the capacity of two of the three decoders to decode the two instructions each clock. The L1 Data cache bus is able to handle two 64 bit loads per clock. Finally, K8 has two math capable SSE execution units, FMUL and FADD and is able to handle one 64 bit SSE operation on each per clock. Core 2 Duo, however, does better but it is important to understand how. Simply doubling the number of 64 bit operations from two to four would be impossible since this would require four simple decoders and four SSE execution units. C2D only has three available simple decoders and two execution units. C2D gets around these limitations by having execution units that are 128 bits wide which allows two 64 bit operations to be run in pairs on a single execution unit. Consequently, by using 128 bit SSE instructions, C2D only needs two decoders plus two execution units to double what K8 can do. Naturally, though, C2D's L1 Data bus also has to be twice as wide to handle the doubled volume of SSE data. This 2:1 speed advantage for C2D over K8 is not really debatable as it has been demonstrated time and time again on HPC Linpack tests. Any sampling of the Linpack peak scores will show that C2D is twice as fast. Today though, Barcelona's SSE has been expanded in a very similar way to Core 2 Duo's. Whereas two of K8's decoders could process two 64 bit instructions per clock, Barcelona's decoders can process 128 bit SSE instructions at the same rate. The execution units have been widened to 128 bits and are capable of handling pairs of 64 bit operations at each clock. Likewise, the L1 Data cache bus has been widened to allow two 128 bit loads per clock.

The reason it is necessary to understand the architecture is to understand how the code needs to be changed in order to see the increase in speed. Suppose you have an application that uses 64 bit SSE operations for its calculations. This code will not run any faster on K10 or C2D. Since both K10 and C2D have only two SSE execution units, the code would be bottlenecked at the same speed as K8. The only way to make these processors run at full speed is to replace the original 64 bit operations with 128 bit operations allowing the 64 bit operations to be executed in pairs. Without this change, there is no change in SSE speed. It becomes very important therefore to fully understand whether or not a given benchmark has been compiled in a way that will make this necessary change in the code.

Let's dive right into the first fallacy dogging K10: the claim that K10's SSE is no faster than what is seen on K8. This naive idea is easily proven false even with code that is optimized for C2D, Anandtech's Linpack test:

It is clear that this code is using 128 bit SSE for Xeon because it is more than twice the speed of K8. It also seems clear that Opteron 2350 (Barcelona) is using 128 bit SSE as well because it is also more than twice as fast as the pair of Opteron 2224's (K8). However, we can see that both paired Xeon 5160 (dual core C2D) and Xeon 5345 (quad core C2D) are significantly faster than Barcelona. This is because the code order is arranged to match the latency of instructions in C2D. However, when we look for a better test, Techware Labs Linpack test:

Barcelona 2347 (1.9Ghz) 37.5 Gflop/s
Intel Xeon 5150(2.6Ghz) 35.3 Gflop/s

The problem with this data is that the article does not make clear exactly what the test configuration is. We can infer that Barcelona is dual since the motherboard is dual. We could also infer that 5150 is dual since 5160 was dual. But this is not stated explicitly. So, we either have a comparison where the author is correct and Barcelona is faster or we have Barcelona using twice as many cores to achieve a similar ratio to what we saw at Anandtech.

This Sandra benchmark shows no increase in speed for C2D so it is clear that it is not using 128 bit instructions. Since the SSE code is not optimized there is no reason to assume that the Integer code is either. Therefore, we have to discard both the Whetstone and Dhrystone scores.

Tech Report Sandra Multimedia

Normalizing the scores for both core count and clock speed shows tight clustering in three groups. Using a ratio based on K8 Opteron gives:

K8 Opteron - 1.0
K10 Opteron - 1.6
C2D Xeon - 1.9

These ratios appear normal since C2D is nearly twice K8's speed. Also, K10's ratio is about what we would expect for code that is not optimized for K10. Properly optimized code for K8 should be a bit faster. This is similar to what we saw with Linpack.

Normalizing the scores for both core count and clock speed shows tight clustering in three groups. Using a ratio based on K8 Opteron gives:

K8 Opteron - 1.0
K10 Opteron - 1.9
C2D Xeon - 3.7

Barcelona's score compared to Opteron is what we would expect at roughly twice the speed. The oddball score is C2D since it is roughly four times Opteron's speed. This type of score might lead to the erroneous conclusion that C2D is nearly four times faster than K8 and still roughly twice the speed of Barcelona. Such a notion however doesn't stand up to scrutiny. A thorough comparison of K10 and K8 instructions shows that the only instructions that run twice as fast are 128 bit SIMD instructions. There is no doubt therefore that K10 is indeed using 128 bit SIMD. It is not of any particular importance whether K8 is using 64 or 128 bit Integer instructions since these run the same speed. This readily explains K10's speed. However, C2D's is still puzzling since its throughput for 128 bit Integer SIMD is roughly the same as K10's. The only clue to this puzzle is the poor code optimization seen in the previous SSE FP operations. Given that Sandra does exhibit poor optimization for AMD processors (and sometimes for Intel as well). We'll have to conclude that the slow speed of the code on AMD processors is simply due to poor code optimization. Someone might be tempted to attribute this to the difference actual differences in architecture such as C2D's greater stores capacity. However, this is clearly not the case as K8 has the same proportions and is still only one quarter the speed. In other words, if this were the case then K8 would be half the speed and then K10 would be bottlenecked at nearly the same speed. Given the K8/K10/C2D ratios the most likely culprit is code that is heavily optimized for C2D but which runs poorly on AMD processors.

We do have a second benchmark from the Anandtech review that is interesting.

zVisuel gives Intel an unfair advantage since it is, "a benchmark which is very SSE intensive and which is optimized for Intel CPUs." Barcelona nevertheless takes the lead. Although I don't care for using benchmarks that clearly favor Intel, I suppose I would at least agree with the author's assessment:

"The LINPACK and zVisuel benchmarks make it clear that Intel and AMD have about the same raw FP processing power (clock for clock), but that the Barcelona core has the upper hand when the application has to access the memory a lot."

Unfortunately, because these benchmarks did favor Intel we can't say for certain what Barcelona's top speed truly is. So, the initial benchmarks for Barcelona are not really indicative of its performance. Over time we should see reviews and benchmarks that show K10's true potential. If from no other source, HPC Linpack scores on supercomputers using K10 will eventually show K10's true SSE FP ability. Since SSE Integer operations have the same peak bandwidth with 128 bit width, this would also indicate maximum Integer speed as well. With K10's base architectural changes confirmed it should just be a matter of time before the reviews catch up. This should also mean that for AMD it is primarily a matter of increasing clock speed above 2.5 Ghz to be fully competitive with Intel.

Monday, September 10, 2007

AMD's K10 – A Good Start

K10 can be summed up pretty simply based on the benchmarks in reviews that popped up like mushrooms just after midnight Monday, September 10th as AMD's NDA expired. A 2.0Ghz Barcelona seems to be equal to a 2.33Ghz Clovertown. And, since Penryn is only showing a 5% increase in speed, this should make a 2.0Ghz Barcelona equal to a 2.2Ghz Penryn. Since AMD has promised 2.5Ghz in Q4 this should allow AMD to match the 2.83Ghz Penryn. This means that in Q4, Intel will still have two clock speeds, 3.0 and 3.16Ghz which are faster. In Q1 08, with 3.0Ghz, AMD should be able to match up to 3.33Ghz.

I've already seen analogies drawn with 2003 when AMD first launched K8. There are some similarities but some important differences as well. One difference is that K10 is slower at launch than Opteron was in 2003. Given the same speed ratios, Barcelona would need to be at 2.2Ghz. So, AMD's K10 launch is a bit worse than Opteron in 2003. However, nothing else is really the same. When K8 launched there was exactly one chipset (AMD's) that supported it and compatible motherboards only arrived very slowly. Today, there are several chipsets and dozens of boards that can handle K10 with nothing more than a BIOS update. Another difference though is process yield. K8 launched with a painfully low yielding 130nm SOI process (only about half the yield of AMD's K7 bulk 130nm process) and this new process took a year to reach yield maturity. Today, in spite of the low clocks, AMD's 65nm process for K10 has excellent yields. Intel's yields, in contrast, on its brand new 45nm process will take a couple of quarters to reach maturity. This plus Intel's much slower ramp gives AMD some breathing room until Intel catches up in Q2 08. Some people have overestimated Intel's volume and clock ramp because they were spoiled by the launch of C2D back in 2006. These two are not comparable however as Intel's 65nm process then already had six months of maturity from production of Presler in Q4 05. On the other hand, there is no reason to be pessimistic about Intel since Penryn is doing quite well and a long way from the painful start of Prescott (Pentium D) in 2004. AMD should be able to reach good volume and a 3.0Ghz clock speed by Q1 08 and if we use the same six month lead then Intel should be looking good with 45nm in Q2 08.

My view of processors was pretty good from 2003 up to the release of Yonah in early 2006. However, I can readily admit that I didn't see C2D coming. It was quite surprise when C2D was considerably faster than I had expected. The normal design cycle for AMD has been pretty clear since K5. Specifically, AMD had a small team doing upgrades on 486 and a major team that had designed the 29000 RISC processor. AMD dismantled the 29000 team and formed a new design team which created K5. As 5K was finished and the team free again a new major team started work on K7. K6 was never a major design team at AMD since it was designed by NexGen however it is clear that another small upgrade team was formed to see K6 through introduction and the upgraded versions K6-2 and K6-III. Having finished K8 the major team began K8 while the upgrade team created K7 upgrades like Athlon MP and Barton. Once K8 was finished, a major team began work on K10 while an upgrade team saw K8 through X2 and Revision F with its new memory controller, virtualization, and RAS features. K10 has been essentially finished since mid 2006 so the major team would be working on Bulldozer while an upgrade team handles both the launch and the later Shanghai version. Obviously, these teams aren't fixed but can be stripped or dismantled and then added to or nearly newly created from members of previous teams. The teams do change but the size and amount of work they do stays about the same. The only real change has been fairly recent. AMD has had a secondary team working on Turion. AMD also acquired some new design personnel with the Geode architecture. It looks like staff from these two groups have formed a major design team to work on Bobcat while a second upgrade team worked on Griffin upgrade of Turion. I would have to assume that these teams are smaller than the K10/Bulldozer teams since Griffin has only a few changes from K8 and the Bobcat architecture is simpler. Nevertheless this does give a stronger combined focus to the formerly separate Geode and Turion lines. Apparently, Intel has a similar secondary group working on Silverthorne.

Intel's design teams used to be similar. Intel's major team created Pentium Pro, Williamette, and then Prescott. The upgrade team had been working on Pentium and it moved on to the PII and PIII upgrades of Pentium Pro and then the Northwood upgrade of Williamette. Intel had another team working on Itanium and a small team in Israel working on Banias. The major team was supposed to be working on Nehalem after Prescott while the upgrade team would move to Tejas as the upgrade of Prescott. Then we started hearing about a new Indian team working on Whitefield. As far as I know the Whitefield team was disbanded and very little if any of their work was used in other designs. From what I can gather, Whitefield would have been a quad core version of Sossaman with an Integrated Memory Controller and Intel's CSI point to point interface. The talk had been that Conroe was an upgrade of Yonah so the speed was quite a surprise. Once it was released though it was clear that Conroe was not an upgrade at all but a completely new design. Yet, many months later I still saw web authors calling Conroe a derivative of Yonah. Others too were still under the mistaken notion that Intel only started working on Conroe when Whitefield was canceled. However, neither of these ideas is correct.

There is no way to know exactly what went on at Intel but given the fact that we know roughly what Intel knew we can make some reasonable guesses. It must have been clear to Intel in late 2002 that Prescott had serious problems. It is also clear that ES Banias samples were available at this same time. It therefore seems reasonable that Intel decided in 2002 to work on a Prescott successor based on Banias. The Whitefield team was apparently created as both a hedge and an attempt to get CSI out the door more quickly. So, it appears to be the case that the Banias team became a small upgrade team which then worked on Dothan and Yonah. The original Tejas team seems to have produced Smithfield and the 65nm shrink, Presler, plus the Tulsa upgrade. Taken in this perspective we can see that neither of these were major teams. This means that Intel had to have shifted away from the original P4 Nehalem design and on to Core 2 Duo back in late 2002. This would make three years to mid 2005 and just enough time to finish a new core for the 2006 release. I realize now that information about Core 2 Duo was both stifled and confused. It was stifled because Intel chose not to patent the architectural elements of Banias that went into C2D. And, it was confused because work on C2D got constantly mixed up with the work on Tejas and the original P4 based Nehalem and the work on Whitefield. However, there was also the problem of the FSB which was expected to limit performance. This notion was strongly bolstered by the Whitefield design which was to have an IMC. However, in a brilliant but unexpected move, Intel managed to overcome the crippling latency of the FSB in completely different fashion by increasing the L2 cache size and substantially beefing up the prefetch hardware. This along with the widened cache buses and SSE units is what made Conroe so fast.

Hopefully, I've now got enough understanding of how we got here to see where we are going. However, I have to mention that when others try to analyze the current situation I've seen a strange element coming into the evaluation. This is the idea of fairness. Many people suggest that it was fair that Intel got back into the lead because supposedly AMD was complacent. This notion however is incorrect. Intel got into the lead because they made the decision to change course way back in 2002 when it became apparent that Prescott wasn't going to work. Also, AMD was not complacent. As I've described their design teams I can't see where they could have gotten more staff to do more interim work. I'm sure that AMD felt that their dual core design was enough to tide them over until K10 was released and I'm certain they didn't foresee Intel's use of MCM to create a quad core. By the time AMD knew about Kentsfield it would have been pointless to work on an MCM design of their own. Also, I can't really fault their 65nm ramp even though it was a year behind Intel's. AMD ran 90nm tests at FAB 36 when it became operational in Q1 06. AMD ran 65nm tests in Q2 and then began actual 65nm production in Q3 which arrived for launch in Q4. I can't see how this would have happened sooner unless AMD had gotten FAB 36 operational sooner. Fairness has nothing to do with what happened. AMD did the best they could and Intel got the benefit of its hard work from the previous three years. It's really as simple as that.

But, these arguments persist. I find people insisting that K10 can't be a good design because it is late or that AMD can't possibly ramp clocks, volume, or yield faster than Intel because Intel has been in the lead. At their heart, these are all fairness arguments with no basis in reasoning. The quality of a design is not related to how late it is nor is the speed of a ramp related to how fast the previous generation of processor was. It has been argued that AMD would go bankrupt after Penryn's launch because AMD would not be able to handle Intel's big costs savings at 45nm. The problem is that it is precisely because AMD is using the older, more mature 65nm process that it will indeed be able to ramp yield, volume, and clocks for K10 faster than Intel can for Penryn on its new 45nm process. Intel will improve its 45nm process and this should pay off by Q2. The process will be mature when Nehalem launches in Q4 08. However, Nehalem loses both the die size advantage from MCM and has to compete with Shanghai as 45nm versus 45nm. It is also clear that AMD will be able to convert to 45nm in about half the time that it takes Intel. Intel may then enjoy a lead in Q1 09 due to its faster memory bandwidth and speed from SSE4 instructions.

However, the situation changes drastically in Q2 09 with Bulldozer. Bulldozer will not only have more memory bandwidth than Nehalem but it will be able to use ordinary ECC DIMMs in the same way as registered memory. This should give AMD a tremendous boost since it would allow Bulldozer to use the faster desktop chips in servers instead of waiting for the lagging registered memory. These desktop chips will be faster and cheaper. It is also clear that AMD intends to move agressively and bypass JEDEC which has up to now been heavily influenced by Intel. Bypassing JEDEC means that instead of watiting for memory to become an official standard, AMD will certify the memory when it becomes available from a manufacturer such as Micron directly whether it becomes an official standard or not. Had AMD been able to do this earlier there would have been less reason to work on the memory controller change since DDR was available in unnofficial speeds up to 600 Mhz which would easily have matched the current DDR2-666 speeds. DDR2 would only have been needed with quad core.

However, the biggest change with Bulldozer is without doubt, SSE5. I've seen some people trying to compare SSE5 with SSE4 but these two are not even remotely the same. SSE4 is another SSE update similar to SSE2 or SSE3. In contrast, SSE5 is an actual extension to the x86-64 ISA. And, either by design or by accident AMD has introduced SSE5 at the worst possible time for Intel. The SSE5 announcement is too close to the release of Nehalem for Intel to extend the architecture to take advantage of it. This puts Intel in a very difficult position. Adding SSE5 is such a big improvement that it greatly weakens Itanium. However, not adding SSE5 means giving substantial ground up to AMD. Since Intel has given every indication that it will beef up its x86 line when forced to by AMD we'll have to assume that it will add SSE5. Intel however faces the second problem that by the time Bulldozer is released it will already be supported. Intel could theoretically announce its own standard in the next six months however similar pressure as occurred with AMD64 is likely to prevent Intel from going its own way. It is a clear sign that AMD is serious about SSE5 since it has dropped its own 3DNow! Instructions with Bulldozer.

On the other hand, support for SSE5 will be difficult for Intel since it would have to be added to the 32nm shrink of Nehalem without any major changes in architecture. This basically means that Intel will have to substantially modify its predecoders and decoders to support the extended DREX byte decoding. This is an easier task for AMD since all of its decoders are complex whereas only one of Intel only uses one complex decoder. This could mean a major overhaul of Nehalem's decoders. It also means that Intel gets to watch the value of its macro-ops fusion get substantially weakened by the new SSE5 vector conditional move. This also tends to reduce some of C2D's features since 3-way instructions reduce loads. Finally, Intel has to figure out how to do 3-way (two source + one destination) operations without adding extra register move micro-ops. It can be done but it means tweaking the control logic way down in the core. I think Intel can do it but it wouldn't be easy. This could also mean modifying the current micro-ops structure to have room for the extra destination bits but a change like this most likely could not be done with just the 32nm die shrink of Nehalem. The bottom line is that Intel will not have a clear lead over AMD on into 2009 as some have suggested. Intel has some bright spots such as Q2 08 when it gets 45nm on track and later in Q1 09 when Nehalem will have memory bandwidth advantages. However, AMD has its own bright spots such as Q1 08 when K10 gets good volume and speed on the desktop, Q3 08 with the Shanghai release, and Q2 09 with Bulldozer and SSE5.

Thursday, August 23, 2007

2008 And Beyond

2007 is far from over but it seems that lately people prefer to talk about 2008. Perhaps this is because AMD is unlikely to get above 2.5Ghz with K10 and Penryn will only have a low volume of about 3%. I suppose this is not a lot to get excited about. So, we are encouraged to cast our gaze forward but what we see is not what we might expect.

AMD's server chip volume has dropped considerably since last year. So, there is little doubt that this trend will reverse in Q3 and Q4 of 2007 with Barcelona. This is true because even at lower clock speeds, Barcelona packs considerably more punch than K8 Opteron at similar power draw. The 2.0Ghz Q3 chips should replace around half of AMD's current Opterons and faster 2.5Ghz chips replacing even the fastest 3.0Ghz K8 Opterons in Q4. This should leave Intel with two faster server chip speeds in Q4 with this most likely falling to a single speed in Q1 08. However, Intel may be able to pull farther ahead in Q2 08. I'm sure this will be confusing to those who are comparing the Penryn launch with Woodcrest last year and assuming that the highest speed grades will be released right away. The problem with this view is that Penryn is leading 45nm in Q4 of this year whereas Woodcrest did not lead 65nm in 2006. Instead, Woodcrest was six months behind Presler which went into 65nm production in October 2005 and launched in December 2005. This explains why Woodcrest was able to hit the ground running and launch at 3.0Ghz. June 2006 was six months after 65nm Presler in December 2005. Taking this as the pattern for 45nm would mean top initial speeds wouldn't be available until Q2 2008. This seems true since Intel has been pretty quiet about Q1 08 release speeds. If the market expands in early 2008, Intel should get a boost as AMD feels the pinch in volume capacity caused by the scale down at FAB 30 and the increased die size of quad core K10. This combines with Intel's cost savings due to ramping 45nm to put Intel at its greatest advantage. However, by the end of 2008, this advantage will be gone and Intel won't see any new advantage until 2010 at the earliest.

To understand why Intel's window of advantage is so small you need to be aware of the differences in process introduction timelines, ramping speeds, base architecture speed, and changing die size advantages. A naiive assumption would be that: 1.) Intel's timeline maintains a process launch advantage over AMD, 2.) that Intel transitions processes faster, 3.) that Penryn is considerably faster than Conroe and that Nehalem is considerably faster than Penryn, and 4.) that Nehalem maintains Penryns's die size advantage. However, each of these assumptions would be incorrect.

1.) Timeline

Q2 06 - Woodcrest
Q3 07 – Barcelona Trailing by 5 quarters.

Q4 07 - Penryn
Q3 08 – Shanghai Trailing by 3 quarters.

Q4 08 - Nehalem
Q2 09 – Bulldozer Trailing by 2 quarters.

Q4 09 - Westmere
Q1 10 - 32nm Bulldozer Trailing by 1 quarter.

Intel's Tick Tock timeline is excellent but AMD's timeline steadily shortens Intel's lead over the next two and a half years. This essentially means that the dominance that C2D enjoyed for more than a year will not be repeated. I suppose it is possible that 45nm will be late but AMD continues to say that it is on track. The main reason I am inclined to believe them is the die size. When AMD moved to 90nm they only had a small shrink in die size at first and then they later had a second shrink. AMD only reduced Brisbane's die size to 70% and nine months later AMD could presumably do a second shrink. But they aren't; Barcelona shows the same 70% reduction as Brisbane. This suggests to me that AMD has skipped a second die shrink and is concentrating on the 45nm launch. I'm pretty certain that if 45nm were going to be late that we would be seeing another shrink of 65nm as a stopgap.

2.) Process Transition

Most people who talk about Intel's process development only know that Intel launches a process sooner than AMD. However, the amount of time it takes Intel to actually field a new process is also important. Let's look at Intel's 65nm history starting with an Intel Presentation concerning process technology. Page 2:

Announced shipping 65nm for revenue in October 2005

CPU shipment cross-over from 90nm to 65nm projected for Q3/06

And, from Intel's website, 65-Nanometer Technology:

Intel has been delivering 65nm processors in volume for over one year and in June 2006 reached the 90-65nm manufacturing "cross-over," meaning that Intel produced more than half of total mobile, desktop and server microprocessors using industry-leading 65nm process technology.

So, we can see that Intel did quite well and even beat its own projection by reaching crossover in late Q2 instead of Q3. October 2005 to June 2006 would be eight months to 50% conversion. For AMD, the INQ had a rumor for shipping in October and we know it officially launched December 5th 2006. Let's assume that this is true since it matches with Intel's October revenue shipping date with a December release in 2005. The AMD Q1 2007 Earnings Transcript from April 19th 2006 says:

100% of our fab 36 wafer starts are on 65 nanometer technology today

October 2006 to April 2007 would be 6 months. So, this would mean that AMD made a 100% transition in two months less than it took Intel to reach 50%. Intel's projection of 45nm is very similar with crossover not occuring until Q3 08. What this means is that even though Intel launches 45nm with a headstart in Q4 07, AMD should be completely caught up by Q1 09.

3.) Base Architecture Speed

Intel made grand claims of a 25% increase in gaming performance (40% faster for 3.33Ghz Penryn versus 2.93Ghz Kentsfield). However, according to Anandtech's Wolfdale vs. Conroe Performance review, Penryn is 4.81% faster while HKEPC gets 5.53% faster. A 5% speed increase is similar to what AMD got when it moved from 130nm to 90nm. The problem that I see is not with Intel's exageration but that Nehalem seems to use the same core. In fact, other than HyperThreading there seems to be no major changes to the core between Penryn and Nehalem. The main improvements with Nehalem seem to be external to the core like an Integrated Memory Controller, point to point communications, L3 cache, and enhanced power management. The real speed increases seem to come primarily from GPU processing and ATA instructions however like Hyperthreading these are not going to make for significant increases in general processing speed. And, since Westmere is the same core on 32nm this means no large general speed increases (aside from clock increases) for Intel processors until 2010 at the earliest. I suppose this then leaves the question of whether AMD will get a larger general speed increase with Bulldozer. Presumably if AMD can manage it they could then pull ahead of Nehalem. Both Intel and AMD are going to use GPU's on the die and both are going to go to more cores. Nehalem might get ahead of Shanghai since while both can do 8 cores Nehalem can also do HyperThreading. But Bulldozer moves back ahead again by allowing 16 actual cores. At the moment it is difficult to imagine a desktop application that could effectively use 8 cores, much less 16 but who knows how it will be in two years.

4.) Die Size

For AMD the goal is to get through the first half of 2008 because the game looks quite different toward the end of 2008. By the time Nehalem is released Intel will already have gotten most of the benefit of 45nm while AMD will only be starting. Intel will lose its small die size MCM advantage because Nehalem is a monolithic quad die like Barcelona. Intel only got a modest shrink of 25% on 45nm and so far has only gotten a 10% reduction in power draw so AMD can certainly stay in the game. It is also a certainty that Nehalem will have a larger die size than quad Penryn. This will be true because Nehalem will have to have both an Integrated Memory Controller and the point to point CSI interface. Nehalem will also add L3 cache. It would not be surprising if the Nehalem die is larger than AMD's Shanghai die. The one positive for Intel is that although yields will be worse with a monolithic die, their 45nm process should be mature by then. However, AMD has shown considerably faster process maturity so yields should be good on Shanghai in Q1 09 as well.

An Aside: AMD's True Importance

Finally, I have to say that AMD is far more important than many give them credit for. I recall a half-baked editorial by Ed Stroligo A World Without AMD where he claimed that nothing much would change if AMD were gone. This notion shows a staggering ignorance of Intel's history. The driving force behind Intel's advance from 8086 to Pentium was Motorola whose 68000 line was initially ahead. It had been Intel's intention all along to replace x86 and Intel first tried this back in 1981 with iAXP 432. It's segmented 16MB addressing looked pretty good compared to 8086's 1MB segmented addressing. However, it looked a lot worse than 68000's flat 16MB addressing which had been released the year before. The very next year iAXP 432 became the Gemini Project which then became the BiiN company. IAXP 432 continued in development with the goal of replacing x86 until 1989. However, this project could not keep up with the rapid pace of x86 as it struggled to keep up with each generation of 68000. When Biin finally folded, a stripped down version of iAXP 432 was released as the embedded i960 RISC processor. Interestingly, as the RISC effort ran into trouble Intel began working on VLIW and when BiiN folded in 1989 Intel released its first VLIW procesor, i860. HP began work on EPIC the same year and five years later, Intel was commited to EPIC VLIW as an x86 replacement.

In 1995 Intel introduced Pentium Pro to take on the established RISC processors and grab more share of the server market. The important point though is that there is no indication that Intel ever intended Pentium Pro to be used on the desktop. We can infer this for a couple of reasons. First, Itanium had been in development for a year when Pentium Pro was introduced and an Itanium release was expected in 1998. Second, with Motorola out of the way (68000 development ended with 68060 in 1994), Intel was not expecting any real competion on the desktop. AMD and Cyrix were still making copies of 80486 so Intel had only planned some modest upgrades to Pentium until Itanium was released. However, AMD released K5 which thoroughly stunned Intel. Although K5 was not that fast it did have a RISC core (courtesy of AMD's 29050 RISC processor) which put K5 in the same class as Pentium Pro and a generation ahead of Pentium. Somehow AMD had managed the impossible and had skipped the Pentium generation. So, Intel went to an emergency plan and two years later released a cost reduced version of Pentium Pro for the desktop, Pentium II. The two year timeline indicates that Intel was not working on a desktop version previous to K5's release. Clearly, we owe Pentium II to K5.

However, AMD purchased Nexgen and released the powerful K6 (which also had a RISC core) just two years later meaning that it arrived at the same time as PII. Once again Intel was forced to scramble and release PIII two years later. We owe PIII to K6. But, AMD had been hard at work on a K5 successor and with the added technology from K6 and some Alpha tech it released K7. Intel was even more shocked this time because K7 was a generation ahead of Pentium Pro. Intel was out of options so it was forced to release the experimental Williamette processor and then follow up with the improved Northwood two years later. We owe P4 to K7. That P4 was experiemental and never expected to be released is quite clear from the pipeline length. The Pentium Pro design had a 14 stage pipeline which was reduced to 10 stages in PII and PIII. Interestingly Itanium also used a 10 stage pipeline. However, P4's pipeline was even bigger than the original Pentium Pro's at 20 stages. Itanium II has an even shorter pipeline at 8 stages so it is clear that Intel does not prefer long pipelines. We can then see that P4 was an aberration caused by necessity and Prescott at 31 stages was a similar design of desperation. Without K8 there would be no Core 2 Duo today and without K10 there would be no Nehalem.

There is no doubt whatsoever that just as 8086's rapid advance against competition from Motorola 68000 stopped the iAXP 432 and shutdown Biin, Intel's necessity of advancing Pentium Pro rapidly on the desktop stopped Itanium. Intel already had experience with VLIW from i860 and would have delivered Merced on schedule in 1998. Given Itanium's speed it could have been viable at as little as 150Mhz. However, Pentium II was already at 450Mhz in 1998 with faster K7 and PIII speeds due the next year. The pace continued rapidly going from Pentium Pro's 150Mhz to PIII's 1.4Ghz. Itanium development simply could not keep up and the grand plans of 1997 for Itanium to become the dominant processor fell apart. The pace has been no less relentless since PIII and Itanium has been kept in a niche server market.

AMD is the sole reason why today Itanium is not the primary processor architecture. To suggest that nothing would change if AMD were gone is an extraordinary amount of self delusion. Intel would happily stop developing x86 and would put its efforts back into Itanium instead. The x86 line is also without any serious desktop replacement. Alpha, MIPS, and ARM stopped being contenders long ago. Power was the last real competitor but it fell out of the running when its desktop chips couldn't keep up and were dropped by Apple. This means that without AMD, Intel's sole competition for desktop processors is VIA. And, just how far behind is VIA? No AMD would mean higher prices and slower development and the eventual phase out of x86. Of course, I guess people can always hope that Intel has given up its goal of more than a quarter century of dropping the x86 line and moving the desktop to a completey proprietary platform.

Friday, August 17, 2007

2007: The Second Half

Amid all the rumblings and rumors there signs of fundamental differences between this year and last. In almost every aspect of processors AMD and Intel have swapped places. This has left a virtual vacuum of analogy for AMD and Intel supporters alike since both are reluctant to compare their favorite to the competition. The situation today is not exactly the same but some comparisons do provide a view of where things are likely to go.

We can add up the various ways that Intel and AMD have swapped places and there are quite a few. In early 2006, AMD's K8 was the undisputed leader ahead of Intel's Presler and Yonah offerings. Today, C2D is the undisputed leader ahead of AMD's Opteron and Athlon 64 offerings. In late 2006, Intel introduced quad core which AMD has taken nearly a year to match. Today, AMD is ready to offer native quad core which it will take Intel about a year to match. In early 2006, Intel previewed the native dual core 2.93Ghz Conroe which looked great and then it was a matter of waiting for Intel to actually get them out the door in volume. Today, AMD has previewed native quad core 3.0Ghz K10 which looks great and once again it is a matter of waiting for AMD to get them out the door in volume. In 2006, Intel was recovering from revenue shocks caused by AMD's K8. Today AMD is recovering from revenue shocks caused by Intel's C2D. In 2006, Intel introduced a new architecture that was far ahead of its previous generation offerings while AMD was only able to offer secondary upgrades such as small clock increases, virtualization, and faster memory speeds. Today, AMD is offering a new architecture that is far ahead of its previous generation offerings while Intel is only able to offer secondary upgrades such as small clock increases, SSE4, and faster bus speeds.

It really is remarkable how similar each company's situation is to its competitor's last year. This is most fundamentally true on the desktop. I suppose Intel supporters would point out that AMD is not likely to take the top performance spot in Q4 when Phenom is launched as Intel did when Conroe was launched. That is true. However, I suppose AMD supporters could point out that AMD was never in the heat and power crunch that Prescott was. Mobile is the most fundamentally different. Intel took mobile by storm when it launched the Centrino platform and since that time AMD has only been slowly chipping away with Turion. Merom was nearly the opposite of Conroe. While Conroe added tremendous value to the desktop as it replaced the sagging P4 line, Merom on the other hand actually had worse power draw than Yonah. Intel finds that having conquered the battery life and wireless LAN issue years ago that it has no place left to take mobile to gain an advantage. Turion has only had a small effect on Intel's mobile share but AMD should be fully competitive in 2008 with Griffin and Puma. It also looks like most of Intel's tweaks with Penryn are to try to stave off the coming attack from K10 Opteron. Intel is putting up a good fight with lower power draw, more cache, and faster FSB but it won't be enough. The fact is that when you've taken back as much server share as Intel has the only place left to go is down. SSE4 could be a big boost in HPC however Intel has already made most of its gains in the HPC low range with Woodcrest so Penryn would most likely be an upgrade to existing systems. SSE4 could be a boost in the top range but currently Intel has little presence there.

With Intel certain to have small losses in server share and no real change from the previous situation in mobile that leaves the desktop as the main battleground. AMD's average volume share in 2005 was 18% up noticeably from about 16.5% average for the previous several years. AMD's average volume share in 2006 was 23% and even though Intel has been fighting hard it remains at 23% in Q2 07. The actual price cuts have been a lot less than most people imagine. Intel's overall ASP is only down 16% from the nearly steady value of about $99 that it had been for three quarters. AMD's drop is similarly down 17% from the previous three quarter average of $60. AMD has had a desktop ASP drop of 42% since Q1 06 while Intel has dropped 38% in the same time. In Q2 07 AMD's desktop ASP was $49 versus Intel's $83. AMD's desktop ASP is substantially lower than Intel's but if it remains steady AMD will make more money as its margins improve with cost savings from 65nm. Although Intel's reorganization has so far only brought tiny changes in cost reduction it could see more in the 2H of 2007.

This does bring up the question of whether Intel will be able to bring additional price pressure to bear against AMD. Intel's Q2 07 earnings suggest that Intel reached its lower limit in pricing in Q2 and that it would need additional cost savings to be able to price lower. This plus the Q2 07 reductions in ASP for both server and mobile make it unlikely that we will see much in the way of lower prices during the rest of 2007. However, Intel is almost certain to resurrect this tactic in some fashion in 2008 as its ramping 45nm production reduces costs again. This should be interesting since AMD's ramping K10 desktop production should raise its desktop ASPs. It wouldn't surprise me to see a substantial bump of AMD's desktop ASP to $60 with a small cut of Intel's down to $79. This is possible if everything goes well and Intel still wants to keep prices down. Otherwise I would expect Intel to pull its desktop ASP back up to its preferred level of $99 and AMD to increase its to a preferred level of $70. This will likely be dependent on Intel's flash spin-off not being a $300 Million a quarter drain and AMD's new chipset division earning a profit. Mostly this means that Q4 07 will be more of a skirmish than the major battle that was expected. Presumably this will become a genuine battle in 2008 as Intel ramps Penryn while AMD ramps K10.

Some people seem to assume that Intel will ramp 45nm quickly and have large volumes available in 2007. However, the following ramp graph from Intel shows that 45nm will only be about 3% of production before the end of 2007. 45nm won't be a significant desktop volume for Intel until Q2 08 with crossover occuring the following quarter. Again, this why 2008 will be the real battle.

Sunday, August 05, 2007

AMD: Limited Options

AMD trailed Intel's 65nm process by ten months in 2006 but the recent speed of conversion at FAB 36 has been impressive. Likewise, the announcement of only 2.0Ghz for Barcelona's release this quarter was disappointing. However, Fuad's article shows normal improvement for Barcelona with each stepping. This is good news because moving forward is really AMD's only option.

The microprocessor volume shares from the last seven years are interesting:

2000 - AMD 16.5%, Intel 83.5%
2001 - AMD 20.7%, Intel 79.3%
2002 - AMD 16.4%, Intel 83.6%
2003 - AMD 17.0%, Intel 83.0%
2004 - AMD 16.1%, Intel 83.9%
2005 - AMD 18.5%, Intel 81.5%
2006 - AMD 23.4%, Intel 76.6%
2007 - AMD 21.0%, Intel 79.0%

We can see that up to 2004, AMD averaged about 16.5% to Intel's 83.5%. AMD had a slight bump in 2001 during K7's competitive period against PIII and Williamette P4 but then dropped back down again in 2002 against Northwood P4. Some people mistakenly think that AMD's fortunes improved right after it released K8 in 2003. This was clearly not the case since AMD's volume share was lower in 2004 than it was in 2003. In 2004 Intel was on its 90nm process and gaining volume from 300mm FABs like FAB 11X and FAB 24 plus D1C which had been freed up for production by the new development FAB, D1D. AMD, in contrast, was expanding FAB 30 as fast as it could in the new clean room space and trying not to lose volume with a K8 die that was twice the size of the K7 die. This did not really come together for AMD until 2005 when the late 2004 move to AMD's own 90nm process allowed a much smaller die and the new expansion ultimately allowed a 50% increase in capacity. Ho weer, the 2006 numbers are misleading because the volume fell off from 2005. Without this dropoff, AMD's volume share would have been only around 19-20% since FAB 30 was topped out and the new production from FAB 36 did not arrive until the second half of 2006.

Some will no doubt see the latest numbers as just another bump like 2001. They will most likely assume that AMD will drop back down to its previous volume of 16%. However, that is not possible. There is a scene in the movie Vertical Limit where a woman has to jump from the tiny ledge she is standing on out into empty air and jam the crampon that is attached to her harness in a crack in the cliff face before she falls to her death. This is pretty much what AMD did when it built FAB 36 and purchased ATI. But, AMD had good reason for such a radical move. When AMD first introduced K7 and K8, Intel used its considerable muscle to prevent 3rd party companies from supporting them. Thus, no motherboard makers attended K8's launch and even companies like Acer were told not to attend. In both cases, AMD had to supply its own chipsets to get these platforms off the ground and there were still delays and areas not covered. For example, although AMD's K6 chip was capable of dual socket operation, there was no supporting chipset that allowed it. Likewise, AMD's 760 MP chipset was the only dual socket chipset produced for K7. Considering that K7's were used to build supercomputers, it would be difficult to suggest that the lack of other MP chipsets was due to K7 itself. Essentially, AMD's Dresden Design Center was able to produce a bare minimum chipset support for AMD's processors. However, this has inadequate compared to the much greater support that Intel provided to its processors. AMD only got out of this box by purchasing ATI. That AMD gained a much more robust capability to develop its own supporting infrastructure can be seen with the 690G, RD790, and 780M chipsets. AMD's choice was to either buy ATI and gain competitive support for its processors or die a slow death.

FAB 36 was also not much of a choice. Processor volume grows slowly most years at 3-11%. AMD did a very good job in increasing FAB 30's capacity by 50%. However expansion alone could not erase Intel's 300mm wafer advantage or prevent FAB 30 from eventually topping out. There was no doubt that AMD badly needed 300mm wafer facilities to remain competitive. It is theoretically possible that AMD could have added 300mm tooling to FAB 30 during its expansion in 2003 and 2004 and this would have reduced costs. However, FAB 30 would still have topped out and AMD would again be faced with the prospect of slipping every year in volume share until it became marginalized. For AMD, to not gain the necessary processor support nor to increase capacity meant marginalization and the eventual loss of any capability to compete in front line processors. AMD's ability to fund research and development would have slipped until AMD was like VIA is today or like AMD used to be before K6.

Unfortunately, neither of these could be done in stages or phased in. You can't really build 10% of a FAB nor can you buy 10% of a graphics company. This is why AMD had to take a leap out into empty air and hope it didn't plummet to the rocks below. I've seen some people try to lay the blame for AMD's financial troubles entirely on the ATI purchase. Seemingly, these people do not realize that AMD will spend $1.8 Billion in this year alone for tooling to outfit FAB 36; this is in addition to what it spent in 2006 and will spend in 2007 plus what the FAB itself originally cost. You can reasonably look at the issue either way. AMD could have easily afforded ATI if it hadn't built FAB 36 or AMD could easily have afforded FAB 36 if it hadn't bought ATI. Having to pay for both is what is difficult. I've wondered many times if there was any other option that could have gotten AMD out of the trap it was in. However, delaying the ATI purchase would have meant delayed benefit as well and AMD desperately needs the 780 mobile chipset that will come in 2008. Similarly, Fusion will not pay off until 2009 but, with Intel's moving in the same direction, pushing this back any further would have meant even greater competitive disadvantage to Intel. So, having built FAB 36 and bought ATI, AMD can now no longer survive on 16% of the market. AMD must move forward with at least 25% volume share or else it has no future.

That AMD intends to gain volume share is without question since in spite of the losses in Q4, Q1, and Q2 AMD's production plan is the same as it was in October 2005. AMD still plans to make 100 Million chips in 2008. This is more than double the number of chips that AMD made in 2005. However, the market is not growing that fast. The only way that AMD can sell 100 Million chips in 2008 is if it takes share from Intel. Naturally, Intel would prefer otherwise and will do its best to prevent this from happening. Looking at the volume share, one might easily think that nothing has changed over many years. However, this impression would be wrong. There has indeed been a strong dynamic element to the market even though Intel has garnered an advantage at each step. For example, Intel originally benefited by being ahead in terms of processor performance. This lasted until AMD introduced K6 in 1997. However, by that time Intel was making Pentium Pro and chipsets so it gained income from server and chipset sales that AMD couldn't match. It would be another four years before AMD released Athlon MP in 2001 and had a server processor offering of its own. However, with only a single chipset for support, Athlon MP never gained much marketshare and it would be another three years until AMD actually began taking server share in 2004 with K8.

By 2002 Intel had a steadily increasing advantage from 300mm wafer manufacturing compared to AMD's 200mm facility. AMD is only now benefiting from a majority of 300mm wafers in production. Likewise, it has only recently purchased ATI to gain stronger chipset support. Intel still has some advantages in mobile but these will essentially be gone in 2008 with AMD's new mobile platform. AMD will finally release a small volume of K10 server chips this quarter. In Q4, AMD should have more chipset offerings plus more K10 chips. However, Intel will begin releasing 45nm chips in Q4 and should hold onto the highest clock speed. It gets more interesting in Q1 08 because Intel will have a good volume of 45nm chips by then while AMD should have both faster K10 chips and a reasonable number of desktop K10 chips. On the face of it Intel could lose its speed advantages in Q1 08 but with 45nm ramping it should experience lower costs. AMD is unlikely to match 45nm until Q3 08. Intel's entire strategy seems to be based on lower cost 45nm chips plus low prices to put pressure on AMD. This was certainly effective up to now but may be less so as AMD's chip ratios move more strongly towards 65nm and 300mm wafers. Presumably, Intel swaps its 65nm to 90nm advantage for a 45nm to 65nm advantage. However, Intel does lose the 300mm advantage. This is a gain for AMD but then AMD hasn't made any profit for three quarters. The final factor though is that K10 should have more value versus Penryn than K8 did against Conroe.

With AMD's recent $1.5 Billion losses the notion of any gain for AMD seems counterintuitive. Yet if we list Intel's advantages:

Intel Chipsets – Declining as AMD moves forward with ATI
Intel mobile – Less than before, gone in 2008
300mm wafers – Declining rapidly, gone by 2008
65nm – Currently declining but will begin increasing again with 45nm in Q4
Quad core – Gone this quarter
C2D higher value – Declining once K10 is released, gone by Q2 08

So, except for the 45nm process, Intel is losing most of its advantages. This should make AMD much stronger by the start of 2008. Intel does still have Nehalem which could be a great processor and put Intel back in the lead. However, it looks like AMD may be better prepared this time with Shanghai than it was with K8. On the face of it I can't see any reason why AMD would not be able to gain share. And, since I've already shown that AMD has no middle ground it is a certainty that they will be striving their best to do just that. I think we will also find out conclusively in Q4 whether Intel has truly cut costs and leaned out the company. I think the last thing that Intel wants is to have a bad 4th quarter and they will have to have excellent cost control to maintain a price war with AMD in Q4. If AMD's only options are to gain share or go out of business we also have to look at who would lose if AMD failed. It's pretty obvious that neither IBM, Sun, Cray, HP, Dell, Gateway, nor Apple would be happy with the absence of AMD. Other than Intel, the only companies I can think of that would benefit might be VIA and possibly Silicon Graphics. With that number of companies benefiting from AMD the idea that it would go out of business soon does seem unrealistic.

Scientia's Blog