Monday, November 26, 2007

AMD: All Dressed Up, But No Place To Go

AMD's current position is frustrating at best. Although AMD seems to have gotten its ATI offerings in order, its K10 offerings lag expectations in almost every way.

AMD's previous 65nm ATI 29xx GPU offerings looked a bit dismal compared to nVidia's. The high power draw and low performance justified more than one lukewarm review. The best ATI product, HD 2900XT, was only close to the nVidia 8800GT (at higher power draw) but nowhere near the 8800GTX. AMD's response to this seemed more than a little strange. They said that they intended to compete against GTX with Crossfire. However, it seemed difficult to imagine anyone really wanting double the power draw and heat of 2900XT. It seems though that with the new 38xx series on the 55nn process, AMD has gotten the power draw under control. And, with the reduced die size they can probably sell them at the reduced 2900 price and still make money. And, it looks like 38xx will actually make Crossfire a viable competitor against GTX.

AMD's new 55nm based 7xx chipsets look like winners of the first order. And, AMD's Overdrive utility looks very good. It is amazing to think about tweaking the overclock on a system with a factory control panel. The GPU's, the chipsets, and the Overdrive utility dress up any new AMD system fit for a ball. The problem is that these systems require an equally good processor but, so far, AMD has failed to deliver.

But, it isn't just one problem with K10; it is many. From lagging volumes to a clock ceiling of just 2.3Ghz to slow NorthBridge and L3 cache speeds to no clear indication of improvement over K8. Not only is there no clear indication of when things might be fixed there is not even a clear indication of what the problem actually is. Low volumes would suggest poor yields and ramping problems. Low clocks could suggest either process problems or architecture problems. The lack of 2.4Ghz speeds was blamed on a bug that won't be fixed until the next revision. This is in contrast with both reviews that overclocked to 2.6Ghz with no trouble and AMD's own 3.0Ghz demo. Under normal circumstances AMD should be able to deliver a demo chip in volume about 6 months later. That AMD does not appear ready to deliver 3.0Ghz K10's anytime in Q1 suggests a big problem of some kind. The poor performance also contrasts with statements from different sources who suggested that K10 had great performance at higher speeds.

The reviews aren't much help either. OC Workbench shows performance for Phenom close to that of Kentsfield at the same clock with a few exceptions. For example the Multimedia - Int x8 iSSE3 score indicates either a compiler problem or an architecture problem. Phenom's score is only about 1/3rd what it should be after turning reasonably good scores in the other categories. The Cinnebench scores are odd since Phenom appears to speed up more than 4X as all four cores are used compared to 3.53X for Kentsfield. A really good speedup would be 3.8-3.9X while 4.0 should be impossible. Something funky is definitely going on to get 4.35X. Phenom also falls off quite a bit in the TMPGenc 4.0 Express test.

The Anandtech Phenom Review suggests that Intel might be having some small difficulty with clocks on Penryn. However, this difficulty would be so small compared to AMD's that it is hardly noticeable. It only means that 3.2Ghz desktop chips won't be out until Q1. Since this is also when AMD is releasing 2.4Ghz this could easily put Intel at a 33% faster clock. Other statements in this review suggest that Intel may have streamlined its internal organization considerably which would also be bad news for AMD. This review also mentions that K10's L3/NB can't clock higher than 2.0Ghz. Anandtech does however confirm how nice AMD's Overdrive Utility is. The Anandtech benchmarks show generally slower performance for Phenom at the same clock as Kentsfield with higher power draw for Phenom. So, the Anandtech scores don't really show us where a problem might be.

I was hoping that when the November 2007 Top 500 HPC list came out that I would get some new test scores that would settle the question about whether K10 was better than K8. I downloaded the latest Top 500 list as a spreadsheet. I was delighted to see a genuine Barcelona score. HPC systems typically have well tuned hardware, operating systems, and software so benchmarks from these should be accurate. I was hoping the HPC numbers would avoid any question of unfavorable hardware, OS, or compiler.

I began crunching numbers. I normalized the scores based on the number of processors and clock speed. I was happy to see very tight clustering at 2.0 for dual core Opteron. Then I ran the numbers for Barcelona and it showed 4.0. This was double the value for dual core. This seemed to be a big problem to me since with twice as many cores and double the SSE width it would seem that K10's top SSE speed should be twice per core or four times larger for Barcelona. In other words, I was expecting something around 8.0 but didn't see that. This would suggest a problem.

However, I then ran the Woodcrest and Clovertown numbers. Woodcrest clustered tightly at 4.0. This was not a surprise since it too has twice the SSE width of K8. Unfortunately, the Clovertown numbers continued to cluster at 4.0. This was a surprise since (with twice as many cores) I was expecting 8.0 for Clovertown as well. So, unfortunately, these scores are inconclusive. K10 is showing the same normalized score as Clovertown but neither is showing any increase in speed over Woodcest. The questions still remain unanswered.

The bottom line is that AMD is having problems. Whether these problems are due to process, design flaws, or unfixed bugs is hard to say. AMD is also under the gun on time. For example if Phenom is only hitting 2.4Ghz in Q1 then that leaves no time to start production of Shanghai in Q2 for a Q3 release. I suppose it is possible that AMD could fix a serious design flaw with the Shanghai release but it is by no means certain. What is certain though is that AMD will have to do much better to have any chance of increasing its share up to profitable levels in 2008.

Monday, November 05, 2007

Has Intel's Process Tech Put Them Leagues Ahead?

There has been a lot of talk lately suggesting that Intel is way ahead of AMD because of superior R&D procedures. Some of the ideas involved are rather intriguing so it's probably worth taking a closer look.

You have to take things with a grain of salt. For example, there are people who insist that it wouldn't matter if AMD went bankrupt because Intel would do its very best to provide fast and inexpensive chips even without competition. Yet these same people will, in the next breath, also insist that the reason why Intel didn't release 3.2Ghz chips in 2006 (or 2007) was because, "they didn't have to". I don't think I need to go any further into what an odd contradiction of logic that is. At any rate, the theory being thrown around these days is that Intel saw the error of its ways when it ran into trouble with Prescott. Then it scrambled and made sweeping changes to its R&D department, purportedly mandating Restricted Design Rules so that the Design staff stayed within the limitations of the Process staff. The theory is that this has allowed Intel to be more consistent with design and to leap ahead of AMD which presumably has not instituted RDR. The theory also is that AMD's Continuous Transistor Improvement has changed from a benefit to a drawback. The idea is that rather than continuous changes allowing AMD to advance, these changes only produce chaos as each change spins off unexpected tool interactions that take months to fix.

The best analogy of RDR that I can think of is Group Code Recording and Run Length Limited recording. Let's look at magnetic media like tape or the surface of a floppy disk. Typically a '1' bit is recorded as a change in magnetic polarity while a '0' is no change. The problem is that this medium can only handle a certain density. If we try to pack too many transitions too closely together they will blend and the polarity change may not be strong enough to detect. Now, let's say that a given magnetic medium is able to handle 1,000 flux transistions per inch. If we record this directly then we can do 1,000 bits per inch. However, Frequency Modulation puts an encoding bit between data bits to ensure that we don't get two 1 bits in a row. This means that we can actually put 2,000 encoded bits per inch and of this 1,000 bits is actual data. We can see that although FM expanded the bits by a factor of 2 there was no actual change in data density. However, by using more complex encoding we can actually increase density. By using (1,7) RLL we can record the same 2,000 encoded bits per inch but we get 1,333 data bits. And, by using (2,7) RLL we space out the 1 bits even further and can double the recording density to 4,000 encoded bits per inch. This increases our data bits by 50% to 1,500. GCR is similar as it maps a group of bits into a larger group which allows elimination of bad bit patterns. You can see a detailed description of MFM, GCR, and RLL at Wikipedia. The important point is that although these encoding schemes initially make the data bits larger they actually allow greater recording densities.

RDR would be similar to these encoding schemes in that while it would initially make the design larger it would eliminate problem areas which would ultimately allow the design to be made smaller. Also, RDR would theoretically greatly reduce delays. When we see that Intel's gate length and cache memory cell size are both smaller than AMD's and we see the smooth transition to C2D and now Penryn we would be inclined to give credit to RDR much as EDN editor, Ron Wilson did. You'll need to know that OPC is Optical Proximity Correction and that DFM is Design For Manufacturability. One example of OPC is that you can't actually have square corners on a die mask so this is corrected by rounding the corners to a minimum radius. DFM just means that Intel tries very hard not to design something that it can't make. Now, DFM is a good idea since there are many historical examples of designs from Da Vinci's attempts to cast a large bronze horse to the Soviet N1 lunar rocket that failed because manufacturing was not up to design requirements. There are also numerous examples from the first attempts to lay the Transatlantic Telegraph Cable (nine year delay) to the Sidney Opera House (eight year delay) that floundered at high cost until manufacturing caught up to design.

I've read what both armchair and true experts have to say about IC manufacturing today and to be honest I still haven't been able to reach a conclusion about the Intel/RDR Leagues Ahead theory. The problems of comparing the manufacturing at AMD and Intel are numerous. For example, we have no idea how much is being spent on each side. We could set upper limits but there is no way to tell exactly how much and this does make a difference. For example, if the IBM/AMD process consortium are spending twice as much as Intel on process R&D then I would say that Intel is doing great. However, if Intel is spending twice as much then I'm not so sure. We also know that Intel has more design engineers and more R&D money than AMD does for the CPU design itself. It seems that this could be the reason for smaller gate size just as much as RDR. It is possible that differences between SOI and bulk silicon are factors as well. On the other hand, the fact that AMD only has one location (and currently just one FAB) to worry about surely gives them at least some advantage in process conversion and ramping. I don't really have an opinion as to whether AMD's use of SOI is good idea or a big mistake. However, I do think that the recent creation of the SOI Consortium with 19 members means that neither IBM nor AMD is likely to stop using SOI any sooner than 16nm which is beyond any current roadmap. I suppose it is possible that they see benefits (from Fully Depleted SOI perhaps) that are not general knowledge yet.

There is at least some suggestion in Yawei Jin's doctoral dissertation that SOI could have continuing benefits. The paper is rather technical but the important points are that SOI begins having problems at smaller scale.

"we found that even after optimization, the saturation drive current planar fully depleted SOI still can’t meet 2016 ITRS requirement. It is only 2/3 of ITRS requirement. The total gate capacitance is also more than twice of ITRS requirement. The intrinsic delay is more than triple of ITRS roadmap requirement. It means that ultra-thin body planar single-gate MOSFET is not a promising candidate for sub-10nm technology."

The results for planar double gates are similar: "we don’t think ultra-thin body single-gate structure or double-gate structure a good choice for sub-10nm logic device."

However, it appears that "non-planar double gate and non-planar triple-gate . . . are very promising to be the candidates of digital devices at small gate length." But, "in spite of the advantages, when the physical gate length scales down to be 9nm, these structures still can’t meet the ITRS requirements."

So, even though AMD and IBM have been working on non-planar, double gate FinFET technology, this does not appear sufficient. Apparently this would have to be combined with novel materials such as GaN in order to meet the requirements. It then appears that it is possible for AMD and IBM to continue using SOI down to a scale smaller than 22nn. So, it isn't clear that Intel has any longterm advantage by avoiding SOI based design.

However, even if AMD is competitive in the long run that would not prevent AMD from being seriously behind today. Certainly when we see reports that AMD will not get above 2.6Ghz in Q4 that sounds like anything but competitive. When we combine these limitations with glowing reports from reviewers who proclaim that Intel could do 4.0Ghz by the end of 2008 this disparity seems insurmountable. The only problem is that the same source that says that 2.6Ghz Phenom will be out in December or January also says Fastest Intel for 2008 is 3.2GHz quad core.

"Intel struggles to keep its Thermal Design Power (TDP) to 130W and its 3.2GHz QX9770 will be just a bit off that magical number. The planned TDP for QX9770 quad core with 12MB cache and FSB 1600 is 136W, and this is already considered high. According to the current Intel roadmap it doesn’t look like Intel plans to release anything faster than 3.2GHz for the remainder of the year. This means that 3.2 GHZ, FSB 1600 Yorkfield might be the fastest Intel for almost three quarters."

But this is not definitive: "Intel is known for changing its roadmap on a monthly basis, and if AMD gives them something to worry about we are sure that Intel has enough space for a 3.4GHz part."

So, in the end we are still left guessing. AMD may or may not be able to keep up with SOI versus Intel's bulk silicon. Intel may or may not be stuck at 3.2Ghz even using 45nm. AMD may or may not be able to hit 2.6Ghz in Q4. However, one would imagine that even if AMD can hit 2.6Ghz in December that only 2.8Ghz would be likely in Q1 versus Intel's 3.2Ghz. Nor does this look any better in Q2 if AMD is only reaching 3.0Ghz while Intel manages to squeeze out 3.3 or perhaps even 3.4Ghz. If AMD truly is the victim of an unmanageable design process then they surely realized this by Q2 06. However, even assuming that AMD rushed to make changes I wouldn't expect any benefits any sooner than 45nm. The fact that AMD was able to push 90nm to 3.2Ghz is also inconclusive. The fact that AMD was able to get better speed out of 90nm than Intel was able to get out of 65nm could suggest more skill on AMD's part or it could suggest that AMD had to concentrate on 90nm because of greater difficulty with transistors at 65nm's smaller scale. AMD was delayed at 65nm because of FAB36 while Intel needs a fixed process for distributed FAB processing. Too often we end up with apples to oranges when we try to compare Intel with AMD. Also, we have to wonder why if Intel is doing so well compared to AMD with power draw then why did Supermicro just announce World's Densest Blade Server with Quad-Core AMD Opteron Processors.

To be honest I haven't even been able to determine yet if the K10 design is actually meeting the design parameters. There is a slim possibility that K10's could show up in the November Top 500 Supercomputer list. This would be definitive because HPC code is highly tuned for best performance and there are plenty of K8 results for comparison. Something substantially less than twice as fast per core would indicate a design problem. Time will tell.