Scientia's Blog: Reviews

Showing posts with label Reviews. Show all posts

Tuesday, November 10, 2009

Laying The Ground Work For Proper Testing

I've seen reviews of AMD's Phenom II and Intel's Nehalem. These reviews have varied a lot in quality but none have really provided comprehensive results. It's time to find out.

I originally bought an AMD Phenom II X3 720 Black Edition. Mainly I did this to have something to try out while I was deciding what quad core to buy. The 955 BE looked pretty good. The 965 had a higher base clock but was also more expensive and was rated at 140 watts. On the Intel side, the i7-920 was still much more expensive than the PII 965. But, recently, this all changed. AMD released a new C3 stepping of PII 965 that is rated at 125 watts. Surprisingly, it was released at $200 instead of the $250 that most had been expecting. So, I ordered one. And, I ordered an Asus M4A79X motherboard which is similar to my MA4785 board but without graphics. I also purchased an ATI HD 4650 and couple of ATI HD 5770 graphic cards.

Then I noticed that the i5-750 was the same price, $200, and that I could get an Asus P55 motherboard without graphics that was almost identical to the 79X board and the same $120 price. I ordered those as well. This will give me two almost identical systems. Both systems will be native quad core with onboard memory controller and two memory channels. This should be an excellent head to head, dollar for dollar test. I'll use DDR3-1600 memory rated at CL 8. This makes the most sense because CL 7 memory is still less common, and faster memory rated at 1800 - 2100 Mhz tends to be twice the cost. I'm getting a couple of moderate sized third party coolers to test overclocking although I'm also interested in how much headroom there is with the stock HSF. Moderate sized is close to 500 gram weight, under 130mm's tall, and using 92mm fans. This compares with the heavy coolers which tend to be closer to 160mm's tall, weigh upwards of 700 grams, and use 120mm fans.

The hardware is perfect; this is the closest match of AMD and Intel hardware that I've seen in a number of years. I don't think we've had this close of a comparison since the K7/K8 single core days. The question now is how to test. I'm working on that. My game Dawn of War has a graphic check so see what level is playable. With my IGP 785 graphics the game is only playable with minimum settings. I can check this again with HD 4650, HD 5770 and HD 5770 Crossfire. To be honest, I don't expect to see much difference between the i5 and PII 965 systems. I'll also compare with my X3 720 to see if having another core makes any difference. PassMark has Peformance test which also includes both 2D and 3D graphic tests. I can say that my 785 graphics fail miserably on the last two tests. I'll give these a try but I wouldn't be surprised if they are low enough stress that even the 4650 card passes. I'm hoping that Dawn of War will be require a bit more although it too may top out before reaching the level of Crossfire.

For Integer testing I'm thinking about something based on GMP since this library shouldn't be tuned for either AMD or Intel. Use of the Intel Compiler is obviously out since this would contaminate the metrics. I have Visual Studio but this compiler is only middle of road in terms of what it produces. The next version is looking much better and it is just now available in Beta so we'll see. Better code would be nice but bugs in the Beta version could also contaminate the metrics. However, even the current version should be adequate with Integer code; it is the SSE code that is more of a concern. SSE2 is getting a bit dated. SSE3 is about the minimum level that would be nice to test. Better would be SSE4a versus some or all of Intel's SSE4. I don't have a requirement for full SSE4 testing since a fair bit of this will be replaced with Intel's next upgrade much as SSE became less important as wider SSE functions were added.

My operating system is 64 bit. I'm using 8 GB's memory and see no reason to waste time with 32 bits. Anything that I compile will be 64 bit. I do plan to test with both 2 DIMMs and 4 DIMMs to see if there is any difference. With my system I haven't seen any significant difference in timing or top speed. The two standard cases that I'll be using are both CoolerMaster cases with 200mm fans in front and top and a 120mm fan in the back. I do have a smaller case that only has a single 120mm fan in the back which I could test with. Personally, the notion of putting a $200 processor in a $60 case seems a little goofy but this could show what type of environment would be acceptable. Frankly I wouldn't be surprised if the 125 watt PII 965 were too much for that case.

I'm also glad that I got the X3 720 first since it is rated at 95 watts just like the i5-750. This should give me a pretty comparison of two 95 watt systems although since the i5 is a quad core I would expect it to be more powerful. I suppose if the i5 turned in thermals similar to my 720 while matching the performance of the 965 that would be quite a feather in Intel's cap since it would mean that i5's could be used in smaller cases with less cooling. Overall, it wouldn't be any great victory though since there is no price advantage. Of course, it has been suggested that Intel's power rating is bogus and is actually higher than they claim. Others have tested and insisted that Intel draws less power. Again, I don't really care about previous power draw tests since I have a 95 watt X3 and a 125 watt X4 to compare with. I suspect that since i5 is on the lower end of the Nehalem range it will actually fall in between these two but again I don't know without testing.

And, I have two graphics-free motherboards so I can test without the contaminating effect of integrated graphics. Given the huge gap between AMD and Intel integrated graphics there really is no way to directly compare them. The simplest solution is to discard the integrated graphics and use the same discreet graphic card. For lower level tests or small case tests I would use the HD 4650 which is a pretty solid, middle of the road card. I wouldn't expect a small case with one 120mm fan to be able to handle one 5770, much less two; nor would I expect it to handle an overclocked 125 watt processor. I know that I haven't had any thermal problems with my X3 720 but the case is well ventilated and it is only a 95 watt processor with integrated graphics. If i5 with 4650 does not pass a small case test then I can still project how well i5 would do with integrated graphics by comparing with 720 and the 785 motherboard. And, if the tests are borderline I picked up a heftier, 40 CFM, 120mm Scythe fan which would boost cooling in the small case. This should allow a pretty good inference for cooling with two regular 120mm fans. At any rate, thermal comparisons should settle any question of thermal issues either for Intel or AMD.

I still don't know if my thoughts about case testing are clear. I take issue with the open case, huge cooler testing that they do over at Anandtech. Likewise I take issue with the schizophrenic testing they do over at Toms Hardware Guide. I mean, who in their right mind would put four graphic cards in a small case? I don't really care that much about testing power draw. If electicity is really a concern you could always buy a lower power system with slower memory, 65 or even 45 watt processor and use integrated graphics. But most people are not that concerned about it. Of more concern is whether or not a given case will work with a given system. Generally, everything that people do to increase speed also increases heat. Voltage is increased on the memory, on the CPU, and even on the graphics. Higher clock speeds, more memory, and faster graphics all use more power. We all know that, at various times in the past, thermals were an issue. AMD had K7's before Barton that ran hot, Intel had Prescott which ran very hot and begat the BTX case as a desperate solution. There had been rumors of higher clocked AMD dual and Intel quad core 65nm processors running hot. Of course, now everyone is on 45nm but the top end chips are still rated at 125 or even 140 watts. My results should have much more practical value to people who would like to build lower and midrange systems rather just people at the very top end.

Keep in mind that thermal testing is somewhat separate from performance testing. You can't really begin benchmarking unless you know a given system is reliable. Too often, it seems that reviewers achieve a hasty estimate of maximum clock and then run their benchmarks without really knowing how stable the system is (unless it crashes during the tests). I suppose that and time pressure is why they take so many shortcuts. It takes hours just to run memory stability tests and hours more to run system and stress tests. And, this has to be repeated when trying to find a maximum overclock. Lots of variables like memory voltage, northbridge speed, CPU voltage, base clock versus multiplier all add up to hours and hours of added testing. And this before any actual benchmarks are run. I am confident in the settings on my system. I run the CPU at 3.4 Ghz. I've tested it much higher. I run with auto voltage on both the CPU and NB. I run with the base clock at 200 Mhz and the NB at 2.6 Ghz. I have the memory at 1333 Mhz with CL 7 and 1.545 volts. Auto doesn't work with the memory since auto is 1.5 volts and it will get errors at 1.515 volts with these settings. I'm confident that this is maximum performance for my system. I've tested overclocking the graphics but this isn't really worthwhile since you can get many times better performance without stressing the chipset just by putting in a moderate graphic card like the HD 4650. I expect both the i5 and 965 to show improved performance over my current system.

Monday, July 14, 2008

Reviews And Fairness Or How To Make Intel Look Good

I've had people complain that I've been too tough on Anand but in all honesty Anandtech is not the only website playing fast and loose with reviews.

Anand has made a lot of mistakes lately that he has had to correct. But, aside from mistakes Anand clearly favors Intel. This is not hard to see. Go to the Anandtech home page and there on the left just below the word "Galleries" is a hot button to the Intel Resource Center. Go to IT Computing and the button is still there. Click on the CPU/Chipset tab at the top and not only is the button still there on the left but a new quick tab to the Intel Resource Center has been added under the section title right next to All CPU & Chipset Articles. Click the Motherboards tab and the button disappears but there is the quick tab under the section title. There are no buttons or quick tabs for AMD. In fact there are no quick tabs to any company except Intel. Clearly Intel enjoys a favored status at Anandtech.

What we've seen since C2D was released is a general shift in benchmarks that favor Intel. In other words instead of shooting at the same target we have reviewers looking to see where Intel's arrows strike and then painting a bullseye that includes as many of them as possible. For example, encryption used to be used as a benchmark but AMD did too well on this so it was dropped and replaced with something more favorable to Intel. There has been a similar process for several other benchmarks. Of course now it isn't just processors. Reviewers have carefully avoided comparing the performance of high end video cards on AMD and Intel processors. Reviews are typically done only on high end Intel quad cores. The claim is that this is for fairness but it also avoids showing any advantage that AMD might have due to HyperTransport. It is a subtle but definite difference where review sites avoid testing power draw and graphic performance with just Integrated Graphics where AMD would have an advantage. They then test performance with only a mid range graphic card to avoid any FSB congestion which again might give AMD an advantage. Then high end graphics cards are tested on Intel platforms only which avoids showing any problems that might be particular to Intel. We are also now hearing about speed tests being done with AMD's Cool and Quiet turned on which by itself is good for a 5% hit. I suppose reviewers could try to argue that this is a stock configuration but these are the same reviewers who tout overclocking performance. So, by shifting the configuration with each test they carefully avoid showing any of Intel's weaknesses. This is actually quite clever in terms of deception.

As you can imagine the most fervent supporters of this system are those like Anand Lal Shimpi who strongly prefer Intel over AMD. I had one such Intel fan insist that Intel will show the world just how great it was when Nehalem is released. However, I have a counter prediction. I'm going to predict that we will see another round of benchmark shuffling when Nehalem is released. And, I believe we will see a concerted effort to not only make Nehalem look good versus AMD's Shanghai but also to make Nehalem look good compared to Intel's current Penryn processor. It would be a disaster for reviewers to compare Nehalem and conclude that no one should buy it because Penryn is still faster . . . so that isn't going to happen.

An example is that since AMD uses separate memory areas for each processor it needs an OS and applications that work with NUMA. In the past reviewers have run OS's and benchmarks alike oblivious to whether they worked with NUMA or not. If anything seems to be overly slow they just chalk it up to AMD's smaller size, lack of money, fewer engineers, etc. Nehalem however also has separate memory areas and needs NUMA as well. I predict that these reviewers will suddenly become very sensitive to whether or not a given benchmark is NUMA compatible and will be quick to dismiss any benchmark that isn't. This may extend so far as to purposefully run NUMA on Penryn to reduce its performance. This would easily be explained away as a necessary shift while ignoring that it wasn't done for K8 or Barcelona. That would be explained away as well by saying that the market wasn't ready for it yet when K8 was launched. That was what happened with 64 bit code which was mostly ignored. However, if Intel had made the shift to 64 bits reviewers would have fallen all over themselves to do 64 bit reviews and proclaim AMD as out of date just as they did every time Intel launched a new version of SSE.

We see this today with single threaded code. C2D and Penryn work great with single threaded code but have less of an advantage with multi-threaded code and no actual advantage with mixed code. It is a quirk of Intel's architecture that sharing is much more efficient when the same code is run multiple times. If you compared multi-tasking by running a different benchmark on each core Intel would lose its sharing advantage and have to deal with more L2 cache thrashing. Even though mixed code tests would be closer to what people actually do with processors reviewers avoid this type of testing like the plague. The last thing they want to do is have AMD to match Intel in performance under heavy load or worse still actually have AMD beat a higher clocked Penryn. But Nehalem uses HyperThreading to get its performance so I predict that reviewers will suddenly decide that single threaded code (as they prefer today) is old fashioned and out of date and not so important after all. They will decide that the market is now ready for it (because Intel needs it of course).

Cache tuning is another issue. P4EE had a large cache as did Conroe. C2D doubled the amount of cache that Yonah used and Penryns have even more. However, reviewers carefully avoid the question of whether or not processors benefit from cache. This is because benchmark improvements due to cache tend to be paper improvements that don't show up on real application code. So, it is best to avoid comparing processors of different cache sizes to see benchmarks are getting artificial boosts from cache. I did have one person try to defend this by claiming that programmers would of course write code to match the cache size. That might sound good to the average person but I've been a programmer for more than 25 years. Try and guess what would happen on a real system if you ran several applications that were all tuned to use the whole cache. Disastrous is the word that comes to mind. But you can avoid this on paper by never doing mixed testing. A more realistic test for a quad core processor is to run something like Folding@Home on one core and graphic encoding on another while using the remaining two to run the operating system and perhaps a game. Since the tests have to be repeatable you can't run Folding@Home as a benchmark but that isn't a problem since it is the type processing that needs to be simulated rather than the specific code. For example you could probably run two different levels of Prime95 tests in the background while running a game benchmark on the other two cores to have repeatable results. And, if you do run a game benchmark on all four cores then for heavens sake use a high graphic card like a 9800X2 instead of an outdated 8800.

Cache will be an issue for Nehalem because it not only has less than Penryn but it has less than Shanghai as well. It also loses most of its fast L2 in favor of much slower L3. My guess is that if any benchmarks are faster on Penryn due to unrealistic cache tuning this will be quickly dropped. That reviews shift with the Intel winds is not hard to see. Toms Hardware Guide went out of its way to "prove" that AMD's higher memory bandwidth wasn't an advantage and that Kentsfield's four cores were not bottlenecked by memory. But now that Nehalem has 3 memory channels the advantage of more memory bandwidth is mentioned in every preview. We'll get the same thing when Intel's Quick Path is compared with AMD's HyperTransport. Reviewers will be quick to point to raw bandwidth and claim that Intel has much more. They of course will never mention that the standard was derated from the last generation of PCI-e and that in practice you won't get more bandwidth than you would with HyperTransport 3.0.

I could be wrong; maybe review sites won't shift benchmarks when Nehalem appears. Maybe they will stop giving Intel an advantage. I won't hold my breath though.

Addition:

We can see where Ars Technica discovers PCMark 2005 error. Strangely the memory score gets faster when PCMark thinks the processor is an Intel than when it thinks it is a an AMD. Clearly the bias is entirely within the software since the processor is the same in all three tests:

I've had Intel fans claim that it doesn't matter if Anandtech cheats in Intel's favor because X-BitLabs cheats in AMD's favor. Yet, here is the X-Bit New Wolfdale Processor Stepping: Core 2 Duo E8600 Review from July 28, 2008. In the overclocking section it says:

At this voltage setting our CPU worked stably at 4.57GHz frequency. It passed a one-hour OCCT stability test as well as Prime95 in Small FFTs mode. During this stress-testing maximum processor temperature didn’t exceed 80ºC according to the readings from built-in thermal diodes.

This sounds good but the problem is that to properly test you have run two separate copies of Prime95 with core affinity set so that it runs on each core. This article doesn't really say that they did that. There is a second problem as well dealing with both stability and power draw testing:

We measured the system power consumption in three states. Besides our standard measurements at CPU’s default speeds in idle mode and with maximum CPU utilization created by Prime95

This is actually wrong; the Prime95 test they peformed was not maximum power draw; it was Prime95 in Small FFTs mode. But this doesn't agree with Prime95 itself which clearly states in the Torture Test options:

In-place large FFTs (maximum heat, power consumption, some RAM test)

So, the Intel processors were not tested properly. Using the small FFTs does not test either maximum power draw or temperature and therefore doesn't really test stability. If any cheating is taking place at X-Bit, it seems to be in Intel's favor.

Wednesday, March 21, 2007

Could Toms Hardware Guide Get Its Soul Back?

Toms Hardware Guide sold its soul back in 2002 when Intel released the Northwood version of P4. The reviews and testing in THG have been slanted in Intel's favor ever since. Strangely though, a recent article in TGDaily actually questions Intel's ethics.

There is no doubt that the "testing" at Toms Hardware Guide has been biased since 2002. Toms once commendable scepticism of Intel vanished and they've been little more than a PR site for Intel ever since. But recently, this article, Did Intel rig its integrated graphics demo against AMD? was published in TGDaily.

"but we were puzzled by how bad AMD fared. Our personal experiences have shown that the X1600 card is very capable of video playback, when configured correctly. The graphics guys at Tom's Hardware Guide are now testing out their X1600 cards and will try to duplicate Intel's results.

But we aren't the only ones who were puzzled at the side-by-side comparison because Intel showed the same demo to other journalists, journalists who have told us that they will also run their own tests. Scott Wasson, editor in chief of The Tech Report, emailed us saying that his lab guys are testing out the card and will publish the results in a future article."

I haven't seen Tom's Hardware Guide question Intel's integrity like this since Intel tried to push the flaky 1.13Ghz PIII on the market. Pressure from THG and other review sites caused Intel to pull the chips until stable versions were available six months later. But, it's been many years now since THG showed that kind of concern for consumers. If indeed THG is finally becoming serious about fair testing and reviews again it would require a number of changes from the past practices at THG.

Use the Portland Group Compiler. Using Intel's compiler to test its and its competitor's products is an obvious conflict of interest.

Use proper DIMMs. THG has a habit of using high speed but high latency DIMMs in its tests. This works great for Intel, especially when the FSB is overclocked. However, it works against AMD because additional DIMM speed doesn't help but with the Integrated Memory Controller AMD's processors can make good use of low latency. When THG does this you can typically find lower speed, lower latency DIMMs on NewEgg for the same or lower price than what THG used. Match each processor to the DIMMs that they need rather than putting AMD at a disadvantage by using the DIMMs that Intel needs.

Compare stock with stock and overclock with overclock. It is very annoying when THG publishes a general review and the only overclocked chips are Intel. There have been many times in the past when AMD would have won nearly every benchmark but THG threw in an overclocked Intel chip. General reviews should only be stock Intel versus stock AMD because this is the way that the vast majority of these chips will be run by consumers. Then in a separate overclocking article you crank the chips up and whoever wins wins. You don't put nitrous injected, supercharged machines up against stock street machines on the drag strip and you shouldn't do it in a review.

Properly load the cores. When reviewing multi-core processors there is a lot of room for confusion and THG's testing methods have only added to the confusion. Typically THG uses almost all single threaded benchmarks and then throws in some token multi-core loading at the end. This is like comparing touring buses by having them carry six passengers. The proper way of testing is to fully load every core and compare them multi-core to multi-core. Then give a comparison with single core in a separate section. This should go along with commentary of whether there is even any reason to buy a multi-core.

Publish the CPU activity meters. THG should follow Tech Report's example and publish the CPU activity meters so that the readers can tell that all of the cores are fully loaded. This would make it clear at a glance just how many cores a given benchmark was using.

Do both interleaved and non-interleaved testing. AMD needs NUMA for dual and quad socket systems. When the OS is non-NUMA this can impose a severe penalty on AMD processors. The best way to demonstrate the quality of the NUMA support is to do both interleaved and non-interleaved testing. If the interleaved testing is faster then the NUMA support is poor or non-existent. If the interleaved testing is slower then the OS has at least some NUMA support. As long as THG gives some initial results with both interleaved and non-interleaved then the rest of the tests can be done with whichever is better.

It is unethical to throw overclocked chips in with standard reviews because the process of overclocking is neither simple nor without risks. It is entirely possible that a less experienced user could destroy the processor trying to duplicate THG's results. Overclocking is no different from putting street mods on your car to boost horsepower. These can be done safely and give a boost in performance but these can also blow the engine. Overclocking should be done in a separate review. This would clearly divide the results into different categories for typical users and overclocking enthusiasts.

Having multi-core reviews with mostly single threaded code is clearly unethical. Although some have tried to argue that THG is merely using benchmarks that reflect typical applications even this argument is bogus. If typical applications cannot make use of the extra cores then rather than hiding this fact in the review by using single threaded code, the conclusions at the end (and probably the first page remarks) should clearly state that there is no reason to buy a multi-core. However, I have yet to see this conclusion in a THG review. You cannot insist that a multi-core CPU is worth buying while simultaneously under-testing the chips. If a given dual core or quad core processor is worth buying then it will perform better than the competition when all cores are loaded.

Let's hope that THG will change its behavior and will start doing real reviews that properly and fairly test the chips. Let's hope that THG starts caring about buyers of computer hardware again rather than the best source of advertising dollars. A new Tom's Hardware Guide would be a great benefit to the computing community; we can hope.

Addendum: I made these same suggestions to Wolfgang Gruener after he suggested that the testing at Tom's Hardware was unbiased. TGDaily -- Opinion: Is Intel copying AMD? . Let's see what happens.

Monday, September 25, 2006

Anandtech Melts Down

Anandtech used to be a good and honest website. However, since 2000 the opinions of Anand Lal Shimpi have changed nearly 180 degrees. This change in viewpoint has been accompanied by a similar change in quality and integrity.

It is difficult to place any crediblity in a website that says two completely different things. For example, Tom Pabst at Tom's Hardware Guide said in 2000 Intel Admits Problems With PIII:

On Intel's VC820 platform Sysmark 2000 crashed consistently. I was unable to finish even one run of Sysmark with this CPU and I certainly tried about 20 times. As soon as I plugged a Pentium III 1 GHz into the system the benchmark would run all the way through.

The most consistent error I got however was with my timed Linux kernel compilation. Even on the VC820 the Pentium III 1.13 GHz was utterly unable to finish the compilation even once. All other CPUs I used finished the compilation without the slightest flaw.

Interestingly, stress tests as Prime95 or CPUburn under Windows98 would not get my 1.13 GHz processor to fail on the VC820.

Today, however, in Green Machine Test THG says:

This also lets us to limit the scope of this article to measuring power consumption at maximum and minimum CPU load, using our Prime95 torture test.

This is why Tom's Hardware Guide has little credibilty today. Unfortunately, Anand has done the same thing. For example Anand also used to criticize Intel on their paper launches.

Prior to Intel’s downward spiral, AMD would be the one we would accuse of “paper launching” processors, since you could never find a newly “released” AMD CPU until after its launch. Intel’s policy was exactly the opposite, upon the introduction of a new CPU, systems based on that CPU would be available the very same day.

Since the release of AMD’s Athlon, things have changed. Slowly but surely the roles of the two companies have reversed, now, Intel is the one being accused of “paper launching” processors while AMD CPUs are readily available and definitely affordable. These “paper launches” were at their worst with the release of the 1GHz Pentium III (March 2000) before the 850, 866 and 933MHz Pentium IIIs in an attempt to compete with AMD’s 1GHz Athlon that was released just days before. What began to make the community characterize Intel’s CPU releases as “paper launches” was the fact that you couldn’t actually go out and buy a 1GHz Pentium III whereas, by the end of the month, the Athlon was already available in speeds from 500MHz up to 1GHz in 50MHz increments.

Yet, Anand's criticism of this had entirely vanished when Intel sent out its P4 EE for review in 2003. The P4 EE was specifically sent out to compete with AMD's FX review but Intel didn't actually deliver it until months later in 2004 while the FX was available shortly after. Today, AMD's chips are always available the day of release while Intel's don't show up for as much as three months.

Anand's point of view has completely switched. Today he no longer criticizes Intel for delivering chips late after "release". He began being biased against AMD some time in 2002 when he began complaining about the late release of K8. When Athlon 64 was released in September 23, 2003 he said:

Fast forward to almost two years and the Hammer is just finally being released on the desktop as the Athlon 64 and the Athlon 64 FX. AMD has lost a lot of face in the community and in the industry as a whole, but can the 64 elevate them back to a position of leadership?

AMD has also priced the Athlon 64 and Athlon 64 FX very much like the Pentium 4s they compete with, which is a mistake for a company that has lost so much credibility. AMD needed to significantly undercut Intel (but not as much as they did with the Athlon XP) in order to offer users a compelling reason to switch from Intel. However, given the incredible costs of production (SOI wafers are more expensive as well) and AMD's financial status, AMD had very little option with the pricing of their new chips.

What Anand is complaining about is that AMD originally had K8 listed on their unofficial roadmaps as being released 1st Half 2002. However, just one month later K8 had moved to 2nd Half 2002. It was actually released Q2 2003. These roadmaps are not official documents so complaining about changes seems a bit silly. It is remarkable too that these types of complaints would be made at all because both Tom and Anand had the opposite view for Intel in 2000. Both said that it would have been better for Intel to have delayed the release of the PIII 1.13Ghz chip rather than shipping a defective product. Yet when AMD delayed launch of K8 to ensure quality and availability on the new SOI process Anand was critical.

Anand was also critical because AMD had planned to release the desktop Clawhammer first and then the server Sledgehammer later. Anand didn't like it when AMD switched and released the server version first. This is why he makes a point of saying "on the desktop" when Opteron had already been out for months. However, this contradicts what he said in May 14, 2002 after the release of Athlon XP(4):

The MP server market is a very lucrative business for AMD to get into since the profit margins are so high, just look at the profit margins off of Intel's Pentium II Xeon and Pentium III Xeon parts to see the potential for AMD there. However the Athlon 4 will only be a stepping stone for AMD into this market; AMD's 64-bit solutions will truly be the ones to lead the company in this area.

His criticism on price didn't make any sense either. Just a few years earlier he was concerned that AMD would be hurt by too low of a price as he said in October 17, 2000:

we were afraid at the end of 1999 that Intel would begin to compete with the Athlon in a price war, something which AMD, being a smaller company than Intel would have some serious problems with.

His criticism is even more ludicrous considering that AMD had problems with profitability all during 2002 and into the beginning of 2003. Yet, presumably he wanted AMD to sell its best and still low volume K8 at a bargain price. K7's were still the main chip even two quarters after this article was written.

In contrast, his views in 2003 have become much more optimistic about Intel and don't change even when his optimism is unwarranted. For example, both he and Tom believed that Itanium would be a desktop processor and would compete directly with Opteron. There was no criticism when this never materialized. There was no criticism when Tejas was canceled. No criticism when Whitefield was canceled.

Anand was optimistic about Prescott. In February 1, 2004 he said:

Prescott becomes interesting after 3.6GHz; in other words, after it has completely left Northwood’s clock speeds behind.

Yet, he took it in stride when Prescott topped out at 3.8Ghz.

I have to admit that I find this one particularly interesting because nearly a year earlier in 2003, I had said that I didn't believe that Intel could add another generation onto P4. At that time, everyone that I knew of was saying the same thing as Anand, that Prescott would be great, that it would clock as high as 5.0 Ghz and put Intel back into the lead. I don't recall anyone besides me who doubted Prescott before its release. My crystal ball has been pretty good since 2003. And, none of the big websites has had a track record anywhere near mine. That often amazes me because the big websites should have much more information than I do. I don't know what the reason for this would be unless a pervasive bias leads them to consistently overestimate Intel.

Anand Lal Shimpi reached his personal low when he put his name on Spring IDF 2006 Conroe Preview: Intel Regains the Performance Crown . In this article he tosses away whatever ethics he had remaining and essentially becomes a spokesperson for Intel. Both the Intel Conroe system and the AMD FX-60 system were built by Intel. Intel would not allow Anand to look inside the case or even look at the BIOS settings. They would not allow him to bring any of his own benchmarks and only let him use what they had installed. Yet, based on this entirely controlled Intel testing, Anand, nevertheless proclaims, "Intel Regains the Performance Crown". Gone is Anand's once strong critism for Intel processors that were not available four months after review. Instead Anand cheerfully comments, "keep in mind that we are over six months away from the actual launch of Conroe, performance can go up from where it is today."

Anandtech's credbility as a whole has continued to deteriorate since 2002. Today, they too use THG's highly unethical technique of comparing overclocked Intel chips against stock AMD chips as they do in Intel Core 2 Duo E6300 & E6400: Tremendous Value Through Overclocking. A fair comparison would have been to include overclocked X2 3800+ and 4200+ but this was not done.

However, the latest sad chapter in Anandtech's increasing bias and incompetence was this comparison of Woodcrest, Opteron, and Sparc.

The Intel Woodcrest system used the excellent Intel Server Board S5000. This motherboard uses the robust Intel 5000 dual bus chipset. This chipset gives each processor its own Front Side Bus to the Northbridge. This allows each processor to have excellent memory bandwidth.

The AMD Opteron system, however, used the MSI K8N Master2-FAR. This choice of a motherboard for AMD shows either extreme incompetence or an outright attempt to cheat in Intel's favor. From the time of Opteron's release in 2003, they have always had independent memory buses for each processor. However, this is a new thing for Intel and has only been available since late 2005. It makes sense that Woodcrest would use Intel's best dual bus chipset. However, the motherboard chose for Opteron was not (and still isn't) approved by AMD for use in servers. In spite of the fact that this board has two sockets for Opterons, it only has a single memory bus. In other words, Intel's chips got the newest and best Intel dual bus motherboard whereas AMD's chips which have always had dual buses were put into a stripped down, single bus motherboard. This forced the two Opterons to share the single memory bus and greatly reduced the speed of the second processor. This comparison pretty much stripped Anandtech of whatever shreds of crediblity they had left after Anand's participation in Intel's promotion of Conroe.

Friday, September 22, 2006

Tom's Hardware Guide Sells It's Soul

The professionalism and objectivity of a website is reflected in its general tone. If you find that a website tends to treat various manufacturers equally then it is reasonable to assume that their testing is equal as well. However, if a website always seems to have a positive outlook for one company regardless of their actual products then that website is probably also putting the same positive spin on testing.

It is always a surprise to think that a once respected source has changed. I used to think very highly of Compute! magazine. However, Compute! was bought by ABC publishing and their reviews changed to reflect their desire for advertising revenues. This was made very clear when Compute! reviewed a new word processing and layout application called "Outrageous Pages". Their review was positive and everything seemed to be fine. However, Info magazine reviewed the same software and had a completely different view. They said the software was too slow and hard to work with and they couldn't recommend it. The truth of which magazine was being honest was revealed when the software release was canceled. Somehow, Compute! had stopped being a good, objective source of information and had lost its credibility. Unfortunately, today, this is also true of Tom's Hardware Guide.

The differences are not hard to see. There was no sign of bias in Tom's review of the Athlon in 1999. New Athlon Processor

There is no sign of bias in Tom's comments about the Intel 1.13 Ghz PIII.

The very worst thing in terms of prestige damage happened back in Spring 2000, when AMD was the first x86-processor maker to introduce a CPU that runs at 1 GHz = 1000 MHz clock speed. Big Chipzilla countered with the release of the Pentium III at 1 GHz two days later, but this CPU was so unavailable that not even the press was equipped with any samples. Today, some four months later, the Giga-Pentium III is still hardly available anywhere

While the normal users out there might not know about this, people in the hardware reviewing scene are well aware of the fact that AMD has shipped their 1.1 GHz Thunderbird samples to publications already weeks ago, while Intel was just able to get the rare samples of the Pentium III 1.133 GHz to the reviewers in the second half of last week. AMD is planning to launch their Thunderbird-Athlon 1.1 GHz in late August, giving us the chance to review the sample with ample time. Intel however shipped out their samples in the last minute, which proves who of the two companies is really able to actually produce 'Beyond-Giga-Processors' right now.

Apparently, Tom was so fed up with "releases" from Intel with no chips available that he called the review, Intel's Next Paper Release: The Pentium III at 1133 MHz

The contrast today is obvious. For the last three years Intel has been doing paper releases while AMD's chips are available the day of the release. Tom's criticism for Intel's paper releases, however, has vanished.

Another THG flaw is positional marketing. The concept of association is well known in marketing and surely this concept is known to the people at THG. It is interesting that they always seem to make sure that an Intel chip is on top regardless of what is being compared. For example, in this review of a Celeron in 2002 The New Generation

The presence of the Athlon XP 1600, 1800, and 1900 is reasonable since by this time AMD has dropped the Duron line and these are comparable in price to the Celerons. To try to make Intel look better, however, THG first cheats by overclocking the Celeron from 2.0Ghz to 3.0Ghz. However when this still isn't enough to move Celeron enough to the top of the chart, THG cheats again by including the Pentium 4 2.26 Ghz. The presence of the P4 in this chart makes no sense because it is much more expensive than the other chips. The only reason it was put into the chart seems to be to prevent AMD from having the top spot. By using both the severely overclocked Celeron and the Pentium 4, THG ensured that Intel had the top spot in most of the tests. This tends to distract from the real purpose of comparing a standard Celeron at 1.7, 1.8, and 2.0Ghz with the comparable Athlon XP 1600, 1800, and 1900's from AMD. The ethical way of handling overclocking is to do a separate article. But the bottom line is that if you have chips from AMD and Intel, they either both have to be overclocked or both base clock so that you get a genuine comparison. Having only one overclocked creates a false association.

We know that THG is perfectly capable of following these ethical guidelines when it wants to as it does in these two examples:

This is a proper overclocking comparison, only Intel's chips are shown.
Pentium D

This is a proper competitive comparison, no chips are overclocked.
Athlon 64 FX

However, THG cannot seem to stay on an ethical track. Here an overclocked P4EE 955 is compared with AMD chips, December 2005. Extreme Edition

The question at this point is whether or not THG shows a general bias and lack of professional reviews after December 2005.

Here we have a good review of the AMD X2 in May 2005. X2 is compared with the AMD 4000+ and Intel 840 and 660 processors. All are stock speeds. AMD X2

Then in January 2006 we have the FX-60 review with another confusing mass of 28 different processors including OC's against FX-60. For example, in the DivX test the OC's do manage to steal the top position from FX-60. FX 60

The Pentium D 900 review contains no OC's but is another jumble of 22 different processors, January 2006. Pentium D 900

This review of Core Duo is much better. It compares with a Pentium M and a Turion, January 2006. Core Duo

The AM2 review is straightfoward comparing AM2 with 939. Socket AM2

The P4 EE 3.73 is proper because it oveclocks both however, it is another confusing jumble of 27 processors. Extreme Edition 965

Then we get to the Core 2 Duo review and THG's ethics plunge again. So, we have overclocked Core 2 Duos up against stock AMD chips. This isn't as badly cluttered as the previous articles. However, the article would make a lot more sense if we dropped the OC's and eliminated all of the lower clocked comparison processors. We don't really need the 4800+, 840, and FX-60 because there are a 5000+, 960, and FX-62. Core 2 Duo

It is clear that THG is cheating and knows that it is cheating because it never puts OC'ed AMD chips up against stock Intel chips in its reviews but frequently puts OC'ed Intel chips up against stock AMD chips. There are some troubling lapses in technical knowledge such as:

There is the issue of memory coherency, but e.g. the Opteron is smart enough to deal with it at up to four processors.

It really seems that nearly three years after Opteron's release Mr Schmid should know that Opterons can handle 8-way.

In spite of THG's problems with technical aspects and obvious bias toward Intel, nevertheless, there are many people who will suggest that the testing done by THG can still be relied on. So, let's look at the actual tests. Let's look again at the recent overclocking comparison between X6800 and FX-62. Overclocked X6800 and FX-62 In this review, for a change, THG puts the overclocks in a separate article as it should. This suggests that THG will give an honest and fair comparison. However, let's look in detail.

I'm not so sure that the top clock for AMD's FX-62 chip is fair. THG claimed they could only reach 3048 Mhz with an HTT of 254 Mhz whereas Neoseeker FX-62 says: I was quite pleased by reaching 3.1GHz air cooled; and the 345MHz HT speed was very impressive as well.
Given THG's bias it certainly raises the question of whether they put as much effort into overclocking the AMD chip.

The memory speed is also not so clear. The Intel memory is clocked to 555 Mhz whereas the AMD memory is only clocked to 508 Mhz. It is not clear why THG didn't use a divider of 11 instead of 12 and clock the AMD memory to 554 Mhz (nearly identical to Intel's). Even with the handicap of slower DIMMs, AMD still manages to outdo Intel:
This increases memory throughput from 9.2 GB/s to 10.7 GB/s - a noticeable improvement over what Intel can deliver. This is where the built-in memory controller really pays big dividends for AMD.

In the temperature and load tests it is not stated what THG did to load the systems. I've found that the term "under high load" can vary quite a a bit from tester to tester. Since tests have been done that reduced the power draw of higher drawing systems we need to know the actual procedure to give it any crediblity.

Now, we'll look at the benchmarks themselves. Benchmarks are only useful if they show a significant spread among processors of varying speeds, and if faster processors are always faster than slower processors of the same model. Bencharks can also be affected by large cache but this is not always easy to detect.

Call of Duty 2 - We see an anomaly where a 4800+ beats an FX-60. Both are dual core, both are socket 939, and both have 2 x 1MB L2 cache. However, the 4800+ is clocked at 2.4Ghz while the FX-60 is clocked at 2.6Ghz. They don't tell in this article what cores these two chips are. However, just one month earlier in another review both were listed as Toledo cores, so presumably they would still be in this review. Since an identical but lower clocked chip cannot truly be faster we have to conclude that this is a symptom of a faulty benchmark, improper testing, or sloppy test records. Any of these would invalidate the tests however if the rest of the data is good we can assume that it is not the benchmark itself. We can also see that the score spread is not proportionate for either AMD or Intel. This makes the benchmark itself faulty regardless of procedure.

Quake 4 – same anomaly. Disproportionate spread for AMD and Intel. This benchmark is faulty.

Unreal Tournament 2004 – same anomaly. The scores also show other anomalies on the midrange AMD scores. The top and bottom scores are good and the scores for Intel are good. Therefore, this is indicative of sloppy testing but the benchmark is good.

Serious Sam – same anomaly. Disproportionate spread for AMD and Intel. This benchmark is faulty.
Fear min. – same anomaly. Disproportionate spread for AMD and Intel. This benchmark is faulty.
Fear av. - same anomaly. Disproportionate spread for AMD and Intel. This benchmark is faulty.

Xvid 1.1.0 – same anomaly.
Divx 6.22 – the data looks good.
Main Concept H.264 Encoder – the data looks good.
Windows Media Encoder – the data looks good.
Pinnacle Studio DV to Mpeg2 – Disproportionate spread for AMD and Intel. This benchmark is faulty.

Premier Pro 2.0 – same anomaly
Clone DVD 2.8 - the data looks good.
Lame 3.97 - the data looks good.
Ogg Vorbis - the data looks good.
Windows Media Encoder 9 - the data looks good.
iTunes - the data looks good.
WinRar 3.60 – same anomaly
Photoshop CS 2 rendering 5 pictures – the middle AMD scores are anomalous. However, the highest and lowest scores appear to be good so the anomalies are probably due to sloppy testing.

Photoshop CS 2 converting 150 pictures - the data looks good.
3D Studio Max - the data looks good.
MS Word 2003 pdf – same anomaly
MS PowerPoint pdf – same anomaly
AVG anti Virus - the data looks good.
Multitasking 1 -same anomaly. However the use of AVG is also improper as this software is severely I/O bandwidth restricted and will not properly load the second core. This benchmark was poorly designed and not useful.

Multitasking 2 – Given that there are anomalies for both Intel and AMD processors this benchmark appears to have been poorly designed and executed. This benchmark is not useful.

Sandra arithmetic ALU – the data looks good..
Sandra aritmetic MLOPs – the Pentium EE 965 score has a 17% anomalous increase. This could be due to the additional memory bandwidth. The rest of the scores seem good.
Sandra Multmedia Integer – the C2D scores are amazingly high.
Sandra Multimedia FP – the data looks good.

It is not clear why the Multimedia Integer scores are so high. It can't be due to the 4 instruction issue or it would have appeared in the ALU test. It can't be due to the faster SSE because it's an Integer test. This really only leaves the faster cache bus speed as an explanation. Unfortunately if this benchmark is faster because of the cache bus then the benchmark is useless. Also, real benchmarks show clustering of scores for similar benchmarks. The other benchmarks for MP3, MPEG, and DVD conversion all show similar patterns for C2D. The Multmedia Integer benchmark however has to be considered faulty.

We'll skip the PC and 3D Mark tests.

The conclusions are not accurate. If we take the test data at face value then an increase of 16.8% for X6800 would actually be very poor compared to the FX's 7.2%. If the data were correct then FX would be showing 100% scaling while X6800 would only be showing 67%. If the data were correct then the analysis would be attrocious. For example, XviD is listed as a 20.3% increase when in reality the increase is nearly 25%. However, some of the benchmark scores for things like Call of Duty are faulty and should not be used. If you drop the bad benchmarks then X6800 will be at least 99% scaling.

Tom's Hardware Guide used to be a website with quality and integrity. Today, the testing is sloppy, the conclusions may not even match their own data, and they use benchmarks that are clearly faulty. Not all bencharks are good and the benchmarks themselves would have to be tested to determine their quality. However, when THG is willing to use clearly faulty benchmarks we can have no confidence that they've done any testing on the benchmarks to sort which are good and which are not. Finally, THG's clear bias toward Intel greatly reduces the credibility of the website. Tom's Hardware Guide today is merely a shadow of the ethics and quality that the website used to be.

Scientia's Blog