Wednesday, September 19, 2007

The Top Developments Of 2007

It looks like both AMD and Intel have been as forthcoming as they are likely to be for awhile about their long range plans. The most significant items however have little to do with clock speeds or process size.

The two most significant developments have without doubt been SSE5 and motherboard buffered DIMM access. AMD has already announced its plan to handle motherboard buffered DIMMs with G3MX. This is significant because it means the end of registered DIMMS for AMD. With G3MX, AMD can use the fastest available desktop DIMMs with its server products. This is great for AMD and server vendors because desktop DIMMs tend to be both faster and cheaper than register DIMMs. This is also good news for DIMM makers because it would relieve them making registered DIMMs for a small market segment and allow them to concentrate on the desktop products. Intel may have the same thing in mind for Nehalem. There have been hints by Intel but nothing firm. I suppose Intel has reason to keep this secret since this would also mean the end of FBIMM in Intel's longterm plans. If Intel is too open about this it could make customers think twice about buying Intel's current server products which all use FBDIMM. So, whether this is the case with Nehalem or perhaps not until later it is clear that both FBDIMM and registered DIMMs are on their way out. This will be a fundamental boost to servers since their average DIMM speed will increase. However, this could also be a boost to desktops since adding the server volume to desktop DIMMs should make them cheaper to develop. This also avoids splitting the engineering resources at memory manufacturers so we could see better desktop memory as well.

SSE5 is also remarkable. Some have been comparing this with SSE4 but this is a mistake. SSE4 is just another SSE upgrade like SSE2 and SSE3. However, SSE5 is an actual extension to the x86 ISA. If AMD had been thinking clearer they might have called it AMD64-2. A good indication of how serious AMD is about SSE5 is that they will drop 3DNow support in Bulldozer. This clears away some bit codes that can be used for other things (like perhaps matching SSE4). Intel has already stated that they would not support it. On the other hand, Intel's statement means very little. We know that Intel executives openly lied about their intentions to support AMD64 right up until they did. And, Intel has every reason to lie about SSE5. The 3-way instructions can easily steal Itanium's thunder and Intel is still hoping (and praying) that Itanium will not get gobbled up by x86. Intel is also stuck in terms of competitiveness because it is too late to add SSE5 to Nehalem. This means that Intel would have to try to include it in the 32nm shrink which is difficult without making core changes. This could easily mean that Intel is behind in SSE5 until 2010. So, it wouldn't help Intel to announce support until it has to since supporting SSE5 now would only encourage development for an ISA extension that it will be behind in. Intel is taking the somewhat deceptive approach of working on a solution quietly while claiming not to be. Intel can hope that SSE5 won't become popular enough that it has to support it. However, if it does then Intel can always claim to be giving in to popular demand. It's dishonest but it is understandable for a company that has been painted into a corner.

AMD understands about being painted into a corner. Intel has had the advantage with MCM quad cores since separate dies mean both higher yields and higher clock speeds. For example, on a monolithic quad die you can only bin as high as the slowest core. However, Intel can pick and choose individual dies to put the highest binning ones together. Also, Intel can always pawn off a dual core die with a bad core as a lowly Conroe-L but it would be a much bigger loss for AMD to sell a quad die as a dual core. AMD's creative solution was the Triple Core announcement. This means that any quads with a bad core will be sold as X3's instead of X4's. This does make AMD's ASP look a bit better. I doubt Intel will follow suit on this but then it doesn't have to. For AMD, having an X4 knocked down to an X2 is a big loss but for Intel it just means having a Conroe knocked down to Conroe-L which is not so big. Simply put, AMD needs triple cores but Intel doesn't. On the other hand, just as AMD was forced to release a faster FX chip on the older 90nm process so too it seems Intel has been forced to deliver Tigerton not with the shiny new Penryn core but with the older Clovetown core. Tigerton is basically just Clovertown on a quad FSB chipset. This does suggest at least a bit of desperation since after working on this chipset for over a year Intel will be lucky if it breaks even on sales. To understand what a stumble Tigerton is you only have to consider the tortured upgrade path. In 2006 and most of 2007 Intel's 4-way platform meant Tulsa. Now we get Tigerton which uses the completely incompatible Caneland chipset. No upgrades from Tulsa. And, for anyone who buys a Tigerton system, oops, no upgrade to Nehalem either. In constrast, 4-way Opteron systems should be upgradable to 4-way Barcelona with just a BIOS update. And, if attractive, these should be upgradable to Shanghai as well. After Nehalem though, things become more even as AMD introduces Bulldozer on an incompatible platform. 2009 will without doubt be the year of new sockets.

For the first time in quite awhile we see Intel hitting its limits. Intel's 3.33Ghz demo had created the expectation of cool running 3.33Ghz desktop chips with 1600Mhz FSBs. It now appears that Intel will only release a single 45nm desktop chip in 2007 and it will only be clocked at 3.0Ghz. The chip only has a 1333Mhz FSB and draws a whopping 130 Watts. Thus we clearly see Intel's straining to deliver something faster much as AMD did recently with its 3.2Ghz FX. However, Intel is not straining because of AMD's 3.2Ghz FX chip (which clearly is no competition). Intel is straining because of AMD's server volume share. In the past year, AMD's sever volume has dropped from about 25% to only 13%. Now with Barcelona, AMD stands to start taking share back. There really isn't much Intel can do to prevent this now that Barcelona is finally out. But any sever chip share that is lost is a double blow because server chips are worth about three times as much as desktop chips. This means that any losses will hurt Intel's ASP and boost AMD's by much more than a similar change in desktop volume would. So, Intel is taking its best and brightest 45nm Penryn chips and allocating them all to the server market to try to hold the line against Barcelona. Of the 12% that Intel has gained it is almost certain to lose half back to AMD in the next quarter or two, but if it digs in, then it might hold onto the other half. This means that the desktop gets the short end of the stick in Q1 2008. However, by Q2 2008, Intel should be producing enough 45m chips to pay attention to the desktop again. I have to admit that this is worse than I was expecting since I assumed Intel could do a 3.33Ghz desktop chip by Q1. But now it looks like 3.33Ghz will have to wait until Q2.

AMD is still a bit of a wild card. It doesn't appear that they will have anything faster than 2.5Ghz in Q4 but 3.0Ghz might be doable by Q1. Certainly, AMD's demo would suggest a 3.0Ghz in Q1 but as we've just seen, demos are not always a good indicator. Intel's announcement that Nehalem has taped out is also a reminder that AMD has made no such announcement for Shanghai. AMD originally claimed mid 2008 for Shanghai and since chips normally appear about 12 months after tapeout we really should be seeing a tapeout announcement very soon if AMD is going to release by Q3 2008. There is little doubt that AMD needs 45nm as soon as possible to match Intel's costs as Penryn ramps up. A delay would seem odd since Shanghai seems to have fewer architecture changes than Penryn. AMD needs a tapeout announcement soon to avoid rumors of problems with its immersion scanning process.

Sunday, September 16, 2007

Untying The Gordian Knot -- Initial Barcelona Benchmarks

I can't recall the last time that there was so much controversy surrounding the launch of a new product nor can I recall the last time that testing was so poor and analysis so far off the mark. Barcelona launches amid a flurry of conflicting praise and criticism. The problem right now is that many ascribe the positive portrayals of Barcelona to simple acceptance of AMD slide shows and press releases while those who are strongly critical seem to only have the poor quality testing for backing.

I can't really criticize people for misunderstanding the benchmarks since these have included similar poor analysis by the testers themselves. It is unfortunate indeed that there is not one single review that has enough testing to be informative. It is equally unfortunate that proper analysis requires some technical understanding of Barcelona (K10) which has been lacking among typical review sites such as Tech Report and Anandtech.

K8 is able to handle two 64 bit SSE operations per clock. It takes the capacity of two of the three decoders to decode the two instructions each clock. The L1 Data cache bus is able to handle two 64 bit loads per clock. Finally, K8 has two math capable SSE execution units, FMUL and FADD and is able to handle one 64 bit SSE operation on each per clock. Core 2 Duo, however, does better but it is important to understand how. Simply doubling the number of 64 bit operations from two to four would be impossible since this would require four simple decoders and four SSE execution units. C2D only has three available simple decoders and two execution units. C2D gets around these limitations by having execution units that are 128 bits wide which allows two 64 bit operations to be run in pairs on a single execution unit. Consequently, by using 128 bit SSE instructions, C2D only needs two decoders plus two execution units to double what K8 can do. Naturally, though, C2D's L1 Data bus also has to be twice as wide to handle the doubled volume of SSE data. This 2:1 speed advantage for C2D over K8 is not really debatable as it has been demonstrated time and time again on HPC Linpack tests. Any sampling of the Linpack peak scores will show that C2D is twice as fast. Today though, Barcelona's SSE has been expanded in a very similar way to Core 2 Duo's. Whereas two of K8's decoders could process two 64 bit instructions per clock, Barcelona's decoders can process 128 bit SSE instructions at the same rate. The execution units have been widened to 128 bits and are capable of handling pairs of 64 bit operations at each clock. Likewise, the L1 Data cache bus has been widened to allow two 128 bit loads per clock.

The reason it is necessary to understand the architecture is to understand how the code needs to be changed in order to see the increase in speed. Suppose you have an application that uses 64 bit SSE operations for its calculations. This code will not run any faster on K10 or C2D. Since both K10 and C2D have only two SSE execution units, the code would be bottlenecked at the same speed as K8. The only way to make these processors run at full speed is to replace the original 64 bit operations with 128 bit operations allowing the 64 bit operations to be executed in pairs. Without this change, there is no change in SSE speed. It becomes very important therefore to fully understand whether or not a given benchmark has been compiled in a way that will make this necessary change in the code.

Let's dive right into the first fallacy dogging K10: the claim that K10's SSE is no faster than what is seen on K8. This naive idea is easily proven false even with code that is optimized for C2D, Anandtech's Linpack test:



It is clear that this code is using 128 bit SSE for Xeon because it is more than twice the speed of K8. It also seems clear that Opteron 2350 (Barcelona) is using 128 bit SSE as well because it is also more than twice as fast as the pair of Opteron 2224's (K8). However, we can see that both paired Xeon 5160 (dual core C2D) and Xeon 5345 (quad core C2D) are significantly faster than Barcelona. This is because the code order is arranged to match the latency of instructions in C2D. However, when we look for a better test, Techware Labs Linpack test:

Barcelona 2347 (1.9Ghz) 37.5 Gflop/s
Intel Xeon 5150(2.6Ghz) 35.3 Gflop/s


The problem with this data is that the article does not make clear exactly what the test configuration is. We can infer that Barcelona is dual since the motherboard is dual. We could also infer that 5150 is dual since 5160 was dual. But this is not stated explicitly. So, we either have a comparison where the author is correct and Barcelona is faster or we have Barcelona using twice as many cores to achieve a similar ratio to what we saw at Anandtech.



This Sandra benchmark shows no increase in speed for C2D so it is clear that it is not using 128 bit instructions. Since the SSE code is not optimized there is no reason to assume that the Integer code is either. Therefore, we have to discard both the Whetstone and Dhrystone scores.

Tech Report Sandra Multimedia



Normalizing the scores for both core count and clock speed shows tight clustering in three groups. Using a ratio based on K8 Opteron gives:

K8 Opteron - 1.0
K10 Opteron - 1.6
C2D Xeon - 1.9

These ratios appear normal since C2D is nearly twice K8's speed. Also, K10's ratio is about what we would expect for code that is not optimized for K10. Properly optimized code for K8 should be a bit faster. This is similar to what we saw with Linpack.



Normalizing the scores for both core count and clock speed shows tight clustering in three groups. Using a ratio based on K8 Opteron gives:

K8 Opteron - 1.0
K10 Opteron - 1.9
C2D Xeon - 3.7

Barcelona's score compared to Opteron is what we would expect at roughly twice the speed. The oddball score is C2D since it is roughly four times Opteron's speed. This type of score might lead to the erroneous conclusion that C2D is nearly four times faster than K8 and still roughly twice the speed of Barcelona. Such a notion however doesn't stand up to scrutiny. A thorough comparison of K10 and K8 instructions shows that the only instructions that run twice as fast are 128 bit SIMD instructions. There is no doubt therefore that K10 is indeed using 128 bit SIMD. It is not of any particular importance whether K8 is using 64 or 128 bit Integer instructions since these run the same speed. This readily explains K10's speed. However, C2D's is still puzzling since its throughput for 128 bit Integer SIMD is roughly the same as K10's. The only clue to this puzzle is the poor code optimization seen in the previous SSE FP operations. Given that Sandra does exhibit poor optimization for AMD processors (and sometimes for Intel as well). We'll have to conclude that the slow speed of the code on AMD processors is simply due to poor code optimization. Someone might be tempted to attribute this to the difference actual differences in architecture such as C2D's greater stores capacity. However, this is clearly not the case as K8 has the same proportions and is still only one quarter the speed. In other words, if this were the case then K8 would be half the speed and then K10 would be bottlenecked at nearly the same speed. Given the K8/K10/C2D ratios the most likely culprit is code that is heavily optimized for C2D but which runs poorly on AMD processors.

We do have a second benchmark from the Anandtech review that is interesting.

zVisuel gives Intel an unfair advantage since it is, "a benchmark which is very SSE intensive and which is optimized for Intel CPUs." Barcelona nevertheless takes the lead. Although I don't care for using benchmarks that clearly favor Intel, I suppose I would at least agree with the author's assessment:

"The LINPACK and zVisuel benchmarks make it clear that Intel and AMD have about the same raw FP processing power (clock for clock), but that the Barcelona core has the upper hand when the application has to access the memory a lot."

Unfortunately, because these benchmarks did favor Intel we can't say for certain what Barcelona's top speed truly is. So, the initial benchmarks for Barcelona are not really indicative of its performance. Over time we should see reviews and benchmarks that show K10's true potential. If from no other source, HPC Linpack scores on supercomputers using K10 will eventually show K10's true SSE FP ability. Since SSE Integer operations have the same peak bandwidth with 128 bit width, this would also indicate maximum Integer speed as well. With K10's base architectural changes confirmed it should just be a matter of time before the reviews catch up. This should also mean that for AMD it is primarily a matter of increasing clock speed above 2.5 Ghz to be fully competitive with Intel.

Monday, September 10, 2007

AMD's K10 – A Good Start

K10 can be summed up pretty simply based on the benchmarks in reviews that popped up like mushrooms just after midnight Monday, September 10th as AMD's NDA expired. A 2.0Ghz Barcelona seems to be equal to a 2.33Ghz Clovertown. And, since Penryn is only showing a 5% increase in speed, this should make a 2.0Ghz Barcelona equal to a 2.2Ghz Penryn. Since AMD has promised 2.5Ghz in Q4 this should allow AMD to match the 2.83Ghz Penryn. This means that in Q4, Intel will still have two clock speeds, 3.0 and 3.16Ghz which are faster. In Q1 08, with 3.0Ghz, AMD should be able to match up to 3.33Ghz.

I've already seen analogies drawn with 2003 when AMD first launched K8. There are some similarities but some important differences as well. One difference is that K10 is slower at launch than Opteron was in 2003. Given the same speed ratios, Barcelona would need to be at 2.2Ghz. So, AMD's K10 launch is a bit worse than Opteron in 2003. However, nothing else is really the same. When K8 launched there was exactly one chipset (AMD's) that supported it and compatible motherboards only arrived very slowly. Today, there are several chipsets and dozens of boards that can handle K10 with nothing more than a BIOS update. Another difference though is process yield. K8 launched with a painfully low yielding 130nm SOI process (only about half the yield of AMD's K7 bulk 130nm process) and this new process took a year to reach yield maturity. Today, in spite of the low clocks, AMD's 65nm process for K10 has excellent yields. Intel's yields, in contrast, on its brand new 45nm process will take a couple of quarters to reach maturity. This plus Intel's much slower ramp gives AMD some breathing room until Intel catches up in Q2 08. Some people have overestimated Intel's volume and clock ramp because they were spoiled by the launch of C2D back in 2006. These two are not comparable however as Intel's 65nm process then already had six months of maturity from production of Presler in Q4 05. On the other hand, there is no reason to be pessimistic about Intel since Penryn is doing quite well and a long way from the painful start of Prescott (Pentium D) in 2004. AMD should be able to reach good volume and a 3.0Ghz clock speed by Q1 08 and if we use the same six month lead then Intel should be looking good with 45nm in Q2 08.

My view of processors was pretty good from 2003 up to the release of Yonah in early 2006. However, I can readily admit that I didn't see C2D coming. It was quite surprise when C2D was considerably faster than I had expected. The normal design cycle for AMD has been pretty clear since K5. Specifically, AMD had a small team doing upgrades on 486 and a major team that had designed the 29000 RISC processor. AMD dismantled the 29000 team and formed a new design team which created K5. As 5K was finished and the team free again a new major team started work on K7. K6 was never a major design team at AMD since it was designed by NexGen however it is clear that another small upgrade team was formed to see K6 through introduction and the upgraded versions K6-2 and K6-III. Having finished K8 the major team began K8 while the upgrade team created K7 upgrades like Athlon MP and Barton. Once K8 was finished, a major team began work on K10 while an upgrade team saw K8 through X2 and Revision F with its new memory controller, virtualization, and RAS features. K10 has been essentially finished since mid 2006 so the major team would be working on Bulldozer while an upgrade team handles both the launch and the later Shanghai version. Obviously, these teams aren't fixed but can be stripped or dismantled and then added to or nearly newly created from members of previous teams. The teams do change but the size and amount of work they do stays about the same. The only real change has been fairly recent. AMD has had a secondary team working on Turion. AMD also acquired some new design personnel with the Geode architecture. It looks like staff from these two groups have formed a major design team to work on Bobcat while a second upgrade team worked on Griffin upgrade of Turion. I would have to assume that these teams are smaller than the K10/Bulldozer teams since Griffin has only a few changes from K8 and the Bobcat architecture is simpler. Nevertheless this does give a stronger combined focus to the formerly separate Geode and Turion lines. Apparently, Intel has a similar secondary group working on Silverthorne.

Intel's design teams used to be similar. Intel's major team created Pentium Pro, Williamette, and then Prescott. The upgrade team had been working on Pentium and it moved on to the PII and PIII upgrades of Pentium Pro and then the Northwood upgrade of Williamette. Intel had another team working on Itanium and a small team in Israel working on Banias. The major team was supposed to be working on Nehalem after Prescott while the upgrade team would move to Tejas as the upgrade of Prescott. Then we started hearing about a new Indian team working on Whitefield. As far as I know the Whitefield team was disbanded and very little if any of their work was used in other designs. From what I can gather, Whitefield would have been a quad core version of Sossaman with an Integrated Memory Controller and Intel's CSI point to point interface. The talk had been that Conroe was an upgrade of Yonah so the speed was quite a surprise. Once it was released though it was clear that Conroe was not an upgrade at all but a completely new design. Yet, many months later I still saw web authors calling Conroe a derivative of Yonah. Others too were still under the mistaken notion that Intel only started working on Conroe when Whitefield was canceled. However, neither of these ideas is correct.

There is no way to know exactly what went on at Intel but given the fact that we know roughly what Intel knew we can make some reasonable guesses. It must have been clear to Intel in late 2002 that Prescott had serious problems. It is also clear that ES Banias samples were available at this same time. It therefore seems reasonable that Intel decided in 2002 to work on a Prescott successor based on Banias. The Whitefield team was apparently created as both a hedge and an attempt to get CSI out the door more quickly. So, it appears to be the case that the Banias team became a small upgrade team which then worked on Dothan and Yonah. The original Tejas team seems to have produced Smithfield and the 65nm shrink, Presler, plus the Tulsa upgrade. Taken in this perspective we can see that neither of these were major teams. This means that Intel had to have shifted away from the original P4 Nehalem design and on to Core 2 Duo back in late 2002. This would make three years to mid 2005 and just enough time to finish a new core for the 2006 release. I realize now that information about Core 2 Duo was both stifled and confused. It was stifled because Intel chose not to patent the architectural elements of Banias that went into C2D. And, it was confused because work on C2D got constantly mixed up with the work on Tejas and the original P4 based Nehalem and the work on Whitefield. However, there was also the problem of the FSB which was expected to limit performance. This notion was strongly bolstered by the Whitefield design which was to have an IMC. However, in a brilliant but unexpected move, Intel managed to overcome the crippling latency of the FSB in completely different fashion by increasing the L2 cache size and substantially beefing up the prefetch hardware. This along with the widened cache buses and SSE units is what made Conroe so fast.

Hopefully, I've now got enough understanding of how we got here to see where we are going. However, I have to mention that when others try to analyze the current situation I've seen a strange element coming into the evaluation. This is the idea of fairness. Many people suggest that it was fair that Intel got back into the lead because supposedly AMD was complacent. This notion however is incorrect. Intel got into the lead because they made the decision to change course way back in 2002 when it became apparent that Prescott wasn't going to work. Also, AMD was not complacent. As I've described their design teams I can't see where they could have gotten more staff to do more interim work. I'm sure that AMD felt that their dual core design was enough to tide them over until K10 was released and I'm certain they didn't foresee Intel's use of MCM to create a quad core. By the time AMD knew about Kentsfield it would have been pointless to work on an MCM design of their own. Also, I can't really fault their 65nm ramp even though it was a year behind Intel's. AMD ran 90nm tests at FAB 36 when it became operational in Q1 06. AMD ran 65nm tests in Q2 and then began actual 65nm production in Q3 which arrived for launch in Q4. I can't see how this would have happened sooner unless AMD had gotten FAB 36 operational sooner. Fairness has nothing to do with what happened. AMD did the best they could and Intel got the benefit of its hard work from the previous three years. It's really as simple as that.

But, these arguments persist. I find people insisting that K10 can't be a good design because it is late or that AMD can't possibly ramp clocks, volume, or yield faster than Intel because Intel has been in the lead. At their heart, these are all fairness arguments with no basis in reasoning. The quality of a design is not related to how late it is nor is the speed of a ramp related to how fast the previous generation of processor was. It has been argued that AMD would go bankrupt after Penryn's launch because AMD would not be able to handle Intel's big costs savings at 45nm. The problem is that it is precisely because AMD is using the older, more mature 65nm process that it will indeed be able to ramp yield, volume, and clocks for K10 faster than Intel can for Penryn on its new 45nm process. Intel will improve its 45nm process and this should pay off by Q2. The process will be mature when Nehalem launches in Q4 08. However, Nehalem loses both the die size advantage from MCM and has to compete with Shanghai as 45nm versus 45nm. It is also clear that AMD will be able to convert to 45nm in about half the time that it takes Intel. Intel may then enjoy a lead in Q1 09 due to its faster memory bandwidth and speed from SSE4 instructions.

However, the situation changes drastically in Q2 09 with Bulldozer. Bulldozer will not only have more memory bandwidth than Nehalem but it will be able to use ordinary ECC DIMMs in the same way as registered memory. This should give AMD a tremendous boost since it would allow Bulldozer to use the faster desktop chips in servers instead of waiting for the lagging registered memory. These desktop chips will be faster and cheaper. It is also clear that AMD intends to move agressively and bypass JEDEC which has up to now been heavily influenced by Intel. Bypassing JEDEC means that instead of watiting for memory to become an official standard, AMD will certify the memory when it becomes available from a manufacturer such as Micron directly whether it becomes an official standard or not. Had AMD been able to do this earlier there would have been less reason to work on the memory controller change since DDR was available in unnofficial speeds up to 600 Mhz which would easily have matched the current DDR2-666 speeds. DDR2 would only have been needed with quad core.

However, the biggest change with Bulldozer is without doubt, SSE5. I've seen some people trying to compare SSE5 with SSE4 but these two are not even remotely the same. SSE4 is another SSE update similar to SSE2 or SSE3. In contrast, SSE5 is an actual extension to the x86-64 ISA. And, either by design or by accident AMD has introduced SSE5 at the worst possible time for Intel. The SSE5 announcement is too close to the release of Nehalem for Intel to extend the architecture to take advantage of it. This puts Intel in a very difficult position. Adding SSE5 is such a big improvement that it greatly weakens Itanium. However, not adding SSE5 means giving substantial ground up to AMD. Since Intel has given every indication that it will beef up its x86 line when forced to by AMD we'll have to assume that it will add SSE5. Intel however faces the second problem that by the time Bulldozer is released it will already be supported. Intel could theoretically announce its own standard in the next six months however similar pressure as occurred with AMD64 is likely to prevent Intel from going its own way. It is a clear sign that AMD is serious about SSE5 since it has dropped its own 3DNow! Instructions with Bulldozer.

On the other hand, support for SSE5 will be difficult for Intel since it would have to be added to the 32nm shrink of Nehalem without any major changes in architecture. This basically means that Intel will have to substantially modify its predecoders and decoders to support the extended DREX byte decoding. This is an easier task for AMD since all of its decoders are complex whereas only one of Intel only uses one complex decoder. This could mean a major overhaul of Nehalem's decoders. It also means that Intel gets to watch the value of its macro-ops fusion get substantially weakened by the new SSE5 vector conditional move. This also tends to reduce some of C2D's features since 3-way instructions reduce loads. Finally, Intel has to figure out how to do 3-way (two source + one destination) operations without adding extra register move micro-ops. It can be done but it means tweaking the control logic way down in the core. I think Intel can do it but it wouldn't be easy. This could also mean modifying the current micro-ops structure to have room for the extra destination bits but a change like this most likely could not be done with just the 32nm die shrink of Nehalem. The bottom line is that Intel will not have a clear lead over AMD on into 2009 as some have suggested. Intel has some bright spots such as Q2 08 when it gets 45nm on track and later in Q1 09 when Nehalem will have memory bandwidth advantages. However, AMD has its own bright spots such as Q1 08 when K10 gets good volume and speed on the desktop, Q3 08 with the Shanghai release, and Q2 09 with Bulldozer and SSE5.