Monday, July 14, 2008

Reviews And Fairness Or How To Make Intel Look Good

I've had people complain that I've been too tough on Anand but in all honesty Anandtech is not the only website playing fast and loose with reviews.

Anand has made a lot of mistakes lately that he has had to correct. But, aside from mistakes Anand clearly favors Intel. This is not hard to see. Go to the Anandtech home page and there on the left just below the word "Galleries" is a hot button to the Intel Resource Center. Go to IT Computing and the button is still there. Click on the CPU/Chipset tab at the top and not only is the button still there on the left but a new quick tab to the Intel Resource Center has been added under the section title right next to All CPU & Chipset Articles. Click the Motherboards tab and the button disappears but there is the quick tab under the section title. There are no buttons or quick tabs for AMD. In fact there are no quick tabs to any company except Intel. Clearly Intel enjoys a favored status at Anandtech.

What we've seen since C2D was released is a general shift in benchmarks that favor Intel. In other words instead of shooting at the same target we have reviewers looking to see where Intel's arrows strike and then painting a bullseye that includes as many of them as possible. For example, encryption used to be used as a benchmark but AMD did too well on this so it was dropped and replaced with something more favorable to Intel. There has been a similar process for several other benchmarks. Of course now it isn't just processors. Reviewers have carefully avoided comparing the performance of high end video cards on AMD and Intel processors. Reviews are typically done only on high end Intel quad cores. The claim is that this is for fairness but it also avoids showing any advantage that AMD might have due to HyperTransport. It is a subtle but definite difference where review sites avoid testing power draw and graphic performance with just Integrated Graphics where AMD would have an advantage. They then test performance with only a mid range graphic card to avoid any FSB congestion which again might give AMD an advantage. Then high end graphics cards are tested on Intel platforms only which avoids showing any problems that might be particular to Intel. We are also now hearing about speed tests being done with AMD's Cool and Quiet turned on which by itself is good for a 5% hit. I suppose reviewers could try to argue that this is a stock configuration but these are the same reviewers who tout overclocking performance. So, by shifting the configuration with each test they carefully avoid showing any of Intel's weaknesses. This is actually quite clever in terms of deception.

As you can imagine the most fervent supporters of this system are those like Anand Lal Shimpi who strongly prefer Intel over AMD. I had one such Intel fan insist that Intel will show the world just how great it was when Nehalem is released. However, I have a counter prediction. I'm going to predict that we will see another round of benchmark shuffling when Nehalem is released. And, I believe we will see a concerted effort to not only make Nehalem look good versus AMD's Shanghai but also to make Nehalem look good compared to Intel's current Penryn processor. It would be a disaster for reviewers to compare Nehalem and conclude that no one should buy it because Penryn is still faster . . . so that isn't going to happen.

An example is that since AMD uses separate memory areas for each processor it needs an OS and applications that work with NUMA. In the past reviewers have run OS's and benchmarks alike oblivious to whether they worked with NUMA or not. If anything seems to be overly slow they just chalk it up to AMD's smaller size, lack of money, fewer engineers, etc. Nehalem however also has separate memory areas and needs NUMA as well. I predict that these reviewers will suddenly become very sensitive to whether or not a given benchmark is NUMA compatible and will be quick to dismiss any benchmark that isn't. This may extend so far as to purposefully run NUMA on Penryn to reduce its performance. This would easily be explained away as a necessary shift while ignoring that it wasn't done for K8 or Barcelona. That would be explained away as well by saying that the market wasn't ready for it yet when K8 was launched. That was what happened with 64 bit code which was mostly ignored. However, if Intel had made the shift to 64 bits reviewers would have fallen all over themselves to do 64 bit reviews and proclaim AMD as out of date just as they did every time Intel launched a new version of SSE.

We see this today with single threaded code. C2D and Penryn work great with single threaded code but have less of an advantage with multi-threaded code and no actual advantage with mixed code. It is a quirk of Intel's architecture that sharing is much more efficient when the same code is run multiple times. If you compared multi-tasking by running a different benchmark on each core Intel would lose its sharing advantage and have to deal with more L2 cache thrashing. Even though mixed code tests would be closer to what people actually do with processors reviewers avoid this type of testing like the plague. The last thing they want to do is have AMD to match Intel in performance under heavy load or worse still actually have AMD beat a higher clocked Penryn. But Nehalem uses HyperThreading to get its performance so I predict that reviewers will suddenly decide that single threaded code (as they prefer today) is old fashioned and out of date and not so important after all. They will decide that the market is now ready for it (because Intel needs it of course).

Cache tuning is another issue. P4EE had a large cache as did Conroe. C2D doubled the amount of cache that Yonah used and Penryns have even more. However, reviewers carefully avoid the question of whether or not processors benefit from cache. This is because benchmark improvements due to cache tend to be paper improvements that don't show up on real application code. So, it is best to avoid comparing processors of different cache sizes to see benchmarks are getting artificial boosts from cache. I did have one person try to defend this by claiming that programmers would of course write code to match the cache size. That might sound good to the average person but I've been a programmer for more than 25 years. Try and guess what would happen on a real system if you ran several applications that were all tuned to use the whole cache. Disastrous is the word that comes to mind. But you can avoid this on paper by never doing mixed testing. A more realistic test for a quad core processor is to run something like Folding@Home on one core and graphic encoding on another while using the remaining two to run the operating system and perhaps a game. Since the tests have to be repeatable you can't run Folding@Home as a benchmark but that isn't a problem since it is the type processing that needs to be simulated rather than the specific code. For example you could probably run two different levels of Prime95 tests in the background while running a game benchmark on the other two cores to have repeatable results. And, if you do run a game benchmark on all four cores then for heavens sake use a high graphic card like a 9800X2 instead of an outdated 8800.

Cache will be an issue for Nehalem because it not only has less than Penryn but it has less than Shanghai as well. It also loses most of its fast L2 in favor of much slower L3. My guess is that if any benchmarks are faster on Penryn due to unrealistic cache tuning this will be quickly dropped. That reviews shift with the Intel winds is not hard to see. Toms Hardware Guide went out of its way to "prove" that AMD's higher memory bandwidth wasn't an advantage and that Kentsfield's four cores were not bottlenecked by memory. But now that Nehalem has 3 memory channels the advantage of more memory bandwidth is mentioned in every preview. We'll get the same thing when Intel's Quick Path is compared with AMD's HyperTransport. Reviewers will be quick to point to raw bandwidth and claim that Intel has much more. They of course will never mention that the standard was derated from the last generation of PCI-e and that in practice you won't get more bandwidth than you would with HyperTransport 3.0.

I could be wrong; maybe review sites won't shift benchmarks when Nehalem appears. Maybe they will stop giving Intel an advantage. I won't hold my breath though.

Addition:

We can see where Ars Technica discovers PCMark 2005 error. Strangely the memory score gets faster when PCMark thinks the processor is an Intel than when it thinks it is a an AMD. Clearly the bias is entirely within the software since the processor is the same in all three tests:



I've had Intel fans claim that it doesn't matter if Anandtech cheats in Intel's favor because X-BitLabs cheats in AMD's favor. Yet, here is the X-Bit New Wolfdale Processor Stepping: Core 2 Duo E8600 Review from July 28, 2008. In the overclocking section it says:

At this voltage setting our CPU worked stably at 4.57GHz frequency. It passed a one-hour OCCT stability test as well as Prime95 in Small FFTs mode. During this stress-testing maximum processor temperature didn’t exceed 80ÂșC according to the readings from built-in thermal diodes.

This sounds good but the problem is that to properly test you have run two separate copies of Prime95 with core affinity set so that it runs on each core. This article doesn't really say that they did that. There is a second problem as well dealing with both stability and power draw testing:

We measured the system power consumption in three states. Besides our standard measurements at CPU’s default speeds in idle mode and with maximum CPU utilization created by Prime95

This is actually wrong; the Prime95 test they peformed was not maximum power draw; it was Prime95 in Small FFTs mode. But this doesn't agree with Prime95 itself which clearly states in the Torture Test options:

In-place large FFTs (maximum heat, power consumption, some RAM test)

So, the Intel processors were not tested properly. Using the small FFTs does not test either maximum power draw or temperature and therefore doesn't really test stability. If any cheating is taking place at X-Bit, it seems to be in Intel's favor.

Wednesday, July 02, 2008

Anand's Competence Reviewed: Crash And Burn

Anandtech's latest review, AMD's Phenom X4 9950, 9350e and 9150e: Lower Prices, Voltage Tricks and Strange Behavior shows a lot more about Anand's ability as a tester than anything about AMD's hardware.

My older brother used to work on aircraft avionics on the weekends while he was going to college at Purdue. Theoretically he was one of the junior technicians at the small, commuter airline. However, he had gotten in his experience in the Marine Corps working on HAWK missile systems which required an elaborate set of five separate radar units to operate. On one occasion he accompanied a senior technician to another airport where one of their planes was down for maintenance. The other technician worked on the avionics for three hours without success and then quit to go to lunch. By the time he got back from lunch my brother had diagnosed and fixed the problem. I have to say that Mr. Lal Shimpi reminds me a lot of that senior technician. Maybe because I've never heard of anyone or any other review site wrecking systems like Anand does.

But it isn't just destroying systems that is troubling. In this article Anand says:

The first processor is the Phenom X4 9950 Black Edition. Clocked at 2.6GHz, the Black Edition moniker indicates that it ships completely unlocked. Unfortunately the unlocked nature doesn’t really help you too much as the 65nm Phenoms aren’t really able to scale much beyond 2.7GHz consistently

This is a strange claim indeed because everyone else seems to be able to clock these chips to 3.2Ghz with no trouble, and I've seen claims as high as 3.6Ghz. He also insists that Intel's quads will overclock to 4.0Ghz on air. But, in reality when you stress all four cores on an Intel quad they will overheat at 4.0Ghz without water cooling. So, the real difference between recent Intel quads and recent AMD quads when overclocking on air seems to be between 300Mhz and 500Mhz. This means that Anand has taken a probable 500Mhz advantage for Intel and stretched it to a completely fictitious 1.3Ghz advantage. No wonder Anand is seen as a minor deity among the Intel faithful.

This article shows Anand's preferences. He likes Intel. He likes powerful chips. He likes newer chips. And, he likes a low price but he doesn't like giving up anything to get it. This is easy to see from statements like:

The new $133 golden boy, Intel's Core 2 Duo E7200, is actually selling for $129 these days - making it the new value leader from the boys in blue.

He chooses this 2.53Ghz, 45nm chip because he is ignoring the lower priced Allendales that are older, 65nm chips and have less cache. The lower priced Allendales are good chips but since they have less cache and don't overclock as well they are not clearly a better value than AMD chips. The more expensive E7200 is a bargain if you do overclock because it can easily run as fast as Intel's 3.2Ghz E8500 which costs twice as much. Similarly, the Q6600 had enjoyed the status of being the best value quad since it was introduced in 2006. But, Anand doesn't heap the same praise on the 2.5Ghz Q9300 that he does on E7200 perhaps because while Q6600 has dropped to just $210, Q9300 is still running $270. Anand's biggest problem at this point is that AMD is eroding Intel's lead and with it his perceived value of these chips. Keep in mind though that this is mostly in Anand's head; the Q6600 is still a good chip as are the Allendales. The problem for him is that he doesn't want a good chip; he wants a chip that is clearly better than AMD's, and that smug feeling of Intel superiority is getting a lot harder to come by.

For example, you could match the $130 (2.53Ghz) E7200 with an AMD $126 (2.9Gh) 5600X2. Both chips are 65 watt but even at 400Mhz slower the Intel chip will still be a bit faster. The problem is though that Intel motherboards have poor integrated graphics. An equivalent Intel motherboard would need at least a low graphic card to match AMD and you could apply this savings to a $160 (3.2Gh) 6400X2 which with its 26% faster clock is not markedly slower. In other words if you stay with integrated graphics and stock speeds then Intel has no advantage because you will pay more money to get a faster system. However, adding a robust discrete graphic card neutralizes AMD's superior integrated graphics motherboards. And the better overclocking on Intel duals tends to neutralize AMD's lower dual price. This is probably why Anand always pushes a system with overclocking and discrete graphics.

But, things are slowly changing and I think Anand is catching disturbing glimpses of the handwriting on the wall. As the price of AMD's tri-cores gets lower the power of three cores tends to remove Intel's higher clocking dual advantage. Secondly, the B3 stepping greatly improves the overclockability of both AMD's tri-cores and quads. For example, multiply the E7200's $130 by 1.5 for a third core and you get $195. This makes the slightly slower 2.4Ghz Phenom 8750 X3 arguably a good value at $175. And, with its higher pricetag Q9300 is not a bargain unless it can overclock significantly better than 9950 X4. Perhaps this is why Anand is in such denial about how well the newer AMD B3 chips can overclock.

I've also never heard of anyone wrecking so many motherboards during testing. This admission by Anand is a bit shocking:

Let's just say in the motherboard section of the labs that a halon fire extinguisher is now a standard item on the test bench. Call us unlucky, abusive, or having just dumb luck, but our results these past few weeks when overclocking IGP setups has not been good. In fact, it has been downright terrible as of this week.

You see, it is not every week when you can go through five boards in less than 48 hours while trying to make an article deadline.


However, we know that Anand has destroyed motherboards and systems before including systems that were working perfectly before he got his hands on them. This has been going on for a long time, not just this week as he implies. This explains why Anand does not work in a computer repair shop.

Finally, his power testing is typical Anand. He tests completely different motherboards with different integrated graphics and measures nothing but total system power draw. However, he then strangely claims that his results show only differences in CPU power draw:

With the exception of the Q9300, Intel's competing chips draw less power at idle than even the new energy efficient AMD chips

He then pretends he has a fair comparison because he is using top IGP boards from each:

The next set of tests is particularly interesting as we are comparing Intel's top integrated graphics platform (G35) to AMD's (780G). No external graphics card was used, this is strictly an IGP comparison

However, this comparison is laughable since Intel's G35 graphics are considerably less powerful than AMD's 780G and therefore probably draw less power. Anand then makes certain that he covers up Intel's weaker IGP by using discrete graphics cards for games testing. This bait and switch testing scheme is clearly and knowingly deceptive and shows that Anand is not merely incompetent but dishonest as well. Testing based on price, stock clock speeds, stock heatsinks, and integrated graphics would often favor AMD which is why these tests never end up in Anands cherry basket.