Sunday, October 22, 2006

Tigerton or Kittenton? Memory Amnesia.

The glowing reports about Intel's roadmap for 2007 continue. Central in this enthusiasm is the native quad core Tigerton which is supposed to be the platform that will keep Intel ahead of AMD's K8L technology. However, in evaluating this platform most people seem to have forgotton the memory bandwidth lessons from P4 and Athlon 64. This amnesia of memory requirements has led to exagerated expectations of performance for Tigerton.



The diagram above for Tigerton looks impressive. Surely four buses with an aggregate bandwith of 34 GB/sec would be enough. Or would it? Just think back to 2002 and 2003 when P4 was so strong against K7. I remember how Intel enthusiasts bragged about the 800Mhz FSB for P4 which was well ahead of what K7 had. Curiously, while these same Intel enthusiasts thought 800Mhz was necesary for one core they now seem to be insisting that 1066Mhz is enough for four cores. Let's consider this more carefully.

We know that when Intel increased its FSB from 400Mhz to 800Mhz that it didn't use all of the bandwidth. We also know that Athlon 64 seemed to get along okay on socket 754 which only contained one memory channel. One memory channel on socket 754 was basically the same as the 400Mhz FSB for P4. It is difficult to imagine that anyone today would suggest that this was more bandwidth than these processors needed. I think clearly that this level for one core is the most reasonable starting point. Let's see where this takes us.

If we need one memory channel with DDR 400 for one Athon 64 core then the dual memory channels with socket 939 and 940 would seem to be enough for X2. Likewise, an 800Mhz FSB with two channels of DDR 400 or DDR2 400 should be enough for Intel's dual core Preslers, Conroes, and Woodcrests. If this is the standard then both AMD and Intel clearly have more bandwidth than necessary for dual cores. AMD's Integrated Memory Controller (IMC) allows expansion beyond two sockets and Intel's 5000 series chipset with two buses again allows expansion to two sockets with similar memory bandwidth. Dual core today is well in hand. Let's look at more cores.

Intel will soon release the quad core Kentsfield and AMD will release the quad core Barcelona in Q2 07. The above diagram is for Tigerton which most likely would arrive in Q4 07 but could arrive in Q3 07. Although Intel does say Q3 07 the chips may not arrive in volume for another two months if recent releases are any indication. At any rate, we know that dual core works well but will quad core work the same? If an 800Mhz FSB is required for dual core then reasonably a 1600Mhz FSB would be needed for Kentsfield or Tigerton. We know that Kentsfield will have a 1333Mhz FSB at most and may only have 1066Mhz. A FSB of 1333Mhz would only be equivalent to a Conroe with a 667Mhz FSB. This would be equivalent to a 333Mhz FSB for P4. We can probably skip comparing with a 333Mhz FSB single core P4 because it is possible that the longer pipeline changed the bandwidth requirements. Since Core 2 Duo (C2D) is not available in single core it seems reasonable to compare with the single cored Athlon 64 which also has shorter pipeline. For Athlon 64 this would be the same as using the single channel socket 754 and using DDR 333.

It does seem curious that no one uses a Conroe with the FSB set at 667Mhz since this is what is being claimed is adequate for Kentsfield. However, the question of whether or not Kentsfield can do 1333Mhz on its FSB is moot since the above diagram for Tigeron only shows an aggregate of 34 GB/sec for four buses. 34 GB/sec for four buses is only 1066Mhz per bus. This is only equivalent to a Conroe with a 533Mhz FSB or a P4 with a 266Mhz FSB. The equivalent for AMD would be an X2 or Athlon 64 with DDR 266 on socket 754. To make things worse, the memory speed listed in the diagram is incorrect. To keep up with four FSB's running at 1066Mhz, Intel will need DDR2 (or DDR3) running at 1066Mhz as well. The only way that the above diagram would work with 667 Mhz memory would be if the number of memory buses is wrong and it actually has six memory buses instead. However, even if the memory is able to keep up the FSB speed is inadequate for four sockets.

As far as I can tell the only reason for this recent outbreak of amnesia for memory requirements is the sloppy testing over at Tom's Hardware Guide. Apparently, THG is on some kind of crusade to prove that Intel has the best hardware even if THG has to fake the tests to prove it. It doesn't seem to do any good to criticize the testing at THG because the excuses from Intel enthusiasts seem to grow exponentially in proportion to how bad the testing is. However, not matter how sloppy or unprofessional or biased the testing is at THG there really is no escaping common sense. A Tigerton with four buses of 34 GB/sec aggregate bandwidth is only equivalent to a Conroe with a 533Mhz FSB. I'm curious why THG doesn't run a Woodcrest test with the FSB set to 533Mhz to show what the expected performance with Tigerton will be. Tigerton should be double a Woodcrest with 533Mhz FSB. The comparative proof for Kentsfield is easy as well since you only need to set the FSB to half the FSB speed for Kentsfield to get the same memory bandwidth.

The notion that a Tigerton can run four cores with a FSB of 1066Mhz is beyond ridiculous. This would only be 533Mhz for Conroe and equivalent to DDR 266 for both X2 and Athlon 64 on socket 754. Even the old Barton Athlon XP 3200+ used a FSB of 400. To get as low as Tigerton we would have to go all the way back to the Athlon 2600+ which used a 266Mhz FSB. None of the P4's were ever this low as even the original 1.4Ghz Williamette's had 400Mhz FSB's. We would have to go back to the PIII to get down to 266Mhz on the FSB.

It is clear that memory bandwidth is going to be a serious problem for Intel and is going to set Tigerton's performance back badly. The question then is how well AMD stacks up in the memory bandwidth race. The socket 753 Athlon 64 used one channel of DDR400 memory. This was doubled to two channels with socket 939 and 940 so this would be the same equivalent bandwidth for X2. Presumably, Barcelona would need double this or 800Mhz memory. This shouldn't be a problem as even the memory controller on Revision F is about to handle DDR2 800. This would be equivalent to 1600Mhz and well ahead of the 1066Mhz for Tigerton. In fact, a little math shows that this is fully 50% more bandwith than Tigerton will have. This means that in memory intensive operations Tigerton will bog down to 33% slower than Barcelona or you could also say that Barcelona will be 50% faster.

Looking at the Caneland chipset I can say that moving to quad buses is a step up from the limited dual bus 5000 chipset. I can also say that the 64 MB snoop filter cache is a great idea. This is basically a map that prevents cache coherency traffic from being retransmitted across to other buses unnecessarily. However, this is bad because it indicates that the FSB will still carry the cache coherency messages. The snoop filter won't help when the cache coherency messages must be transmitted to another bus. It is not nearly as good as AMD's design where none of the cache coherency messages effect memory bandwidth. It is also disappointing to see that Intel will still be using the outdated MESI protocol rather than a more advanced protocol like MOESI as AMD uses. This too would reduce cache coherency saves and loads. When another chip needs to access a memory location that is in another chip's cache, MESI requires that chip that has the data save it to memory whereas MOESI can allow sharing without having to save.

The Tigerton server chip is an upgrade to Woodcrest and will at least give Intel something to replace its aging Cedar Mill Xeon chips and finally let C2D go beyond 2-way. However, because of memory bandwidth constraints, Tigerton will be proportionately slower than Woodcrest. Secondly, the octopus-like design of the Caneland Northbridge with four FSB's and four memory buses is not going to be compact or inexpensive. Finally, this limited four bus design will not allow Tigerton to move up to bigger 8-way and 16-way server designs as AMD will have. In conclusion, it appears that although Tigerton looks good, it will be held back by the memory bandwidth of the Caneland platform which is as low as what was used by PIII and Athlon 2600+. This will cause Tigerton to get beaten badly in memory intensive applications, making it more of a kitten than a tiger. Clearly, memory bandwidth will be Intel's biggest problem into 2009.

55 comments:

ashenman said...

It doesn't matter if the processor can do 1333 when the rest of the hardware can't. Kinda like trying to run an athlon xp 3200 on an nforce1 motherboard.You wont get to use a 400 mhz fsb, and thus wont actually have a 3200. Maybe it's just an error on the slide, or maybe it's something Intel didn't think about before the presentation was compiled and has since remedied.

I don't think your method of comparing these really works, scientia. AMD basically has a fsb for the memory, and an fsb for the rest of the system. Intel has only 1. So not only does all the memory bandwidth have to work over this constrained bus, but so do all the other tasks that are sent to the processor. This makes it really hard to compare a socket 754 processor with a pentium 4 and so on. But I guess you're just being a bit conservative since AMD basically has more bus than necessary for its processors.

I'm just saying, that it's likely that as better hardware comes along, people will write code that better utilizes it. However, this is just as true for Intel, if they're constrained by memory access. I guess it was kinda a stupid point, but one I didn't really think that much about making well.

He, as in scientia, because that's the other person who is a part of this debate, since it seems to be just us three and erlindo now.

ashenman said...

*edit* I guess he didn't do calculations, but he pointed out the problem of considering it.

Scientia from AMDZone said...

THG has tried to claim that it doesn't make any difference if you run Kentsfield with a 1066Mhz FSB or 1333Mhz. However, if this were actually true then it should have been possible for them to run Conroe with the same clock and 533Mhz FSB to show that it had exactly half the performance. It is this type of in-depth testing that I think will be avoided during 2007. I just do not understand how anyone can argue that Conroe would run just as fast with a 533Mhz FSB or perhaps 667Mhz if Kentsfield is truly 1333Mhz.

Yes, it is true that the I/O and cache coherency traffic also go over the bus for Intel. But, they do now with with Woodcrest and they did with P4. Again, how can someone seriously argue that P4 or Athlon 64 would have worked the same with 266 (or 333Mhz) FSB.

Anonymous said...

Once again, your analysis of Tigerton includes other stuff beyond the technical. Why did you bother writing about those if you don't let commenting of it?

I did not think of running with 533. Creative thinking;) They were trying to show that though Xeon has more, it does not necessarily hurt the desktop side.

Scientia from AMDZone said...

All of the past benchmarking for P4 and Athlon 64 and Dothan indicate that more bandwidth is faster. One of the reasons that Conroe is faster than Dothan is that Dothan is limited to DDR 333. However, this would be the same memory bandwidth as Kentsfield now has. But, according to the Tigerton diagram a single core would only have 266Mhz. This cannot be as fast.

Scientia from AMDZone said...

I'm saying that you could project the performance for Tigerton today if you ran Woodcrest with a 533Mhz FSB.

There is still is a big question of whether or not the memory itself could keep up but assuming it did this would give you a good estimate.

Anonymous said...

THG is an enthusiast site. I don't believe most of us really care about Xeons, Itanium, servers, and nor do they;) I do not know how you got to your Tigerton numbers[not saying they're wrong]..But I can at least conclude that 12.8/8=1.6 and 34/16=2.125.

I'm not saying that memory bandwidth does not matter, but the reason that we're still arguing about it is because you said Kentsfield choked on DivX and others because of its supposed lack of bandwidth:)

Scientia from AMDZone said...

Your math has nothing to do with my article. Your calculations simply compare Caneland with the current Bensley platform.

This only shows that Bensley is worse for dual socket quad core than Caneland is for quad socket quad core. This does not show that Caneland allows Tigerton to be any better with 1033Mhz FSB than Woodcrest with 533Mhz FSB. Nor does it show that Tigerton would be competitive with Barcelona.

Azary Omega said...

Good analyses Scientia.
However you forgot to take in to the account the performance of each core, for example, each core in 2.66GHz Conroe will perform as 6.1GHz Pentium 4, But wait haven't you calculated that this Pentium 4 will need to run (in comparison to this yet to come server Quady) on 266MHz FSB?

Scientia from AMDZone said...

Yes, I would imagine that a Tigerton would need more bandwidth than a Yonah or P4 to keep up with the better SSE. I just think it is very dishonest for someone to claim that 1066Mhz is fine when it is clear that this would only be 266 per core. Even in late 2002 the processors had more bandwidth than that. I would like to see Conroe benched at 666Mhz and Woodcrest at 533Mhz and see if they can maintain the performance. My guess is that they would drop off sharply.

ashenman said...

Scientia, I know you like to run a real tight blog, but I don't think it was necessary to block red's first post. I don't remember any inflammatory remarks in it.

Erlindo said...

I found this other link that will contribute to your thread Scientia:
Intel bares Tigerton

The fanboys at THG believes that CSI will be available with Clarksboro chipset and they even believe that K8L won't have HT 3.0.
Linky

Anonymous said...

I believe that peopled should have the freedom to like one product over another or one company rather than the company rather than the other, even if it is 'inferior' as long as one can still see the reality of the situation. Scientia, you claim to hold no bias for AMD nor Intel..But you're not fooling anyone when all you do is rant about Intel;) I can declare that AMD will gain marketshare into 2007, that does not prove I'm not biased towards Intel. I don't believe that I am, but with all this Sharikou, MMM, AMD fanbois BSing everywhere..Maybe if I offer a different perspective to you you will also not be blinded about AMD's holiness and in turn can influence folks like The_Ghost:)

I doubt that Tigerton has CSI, but how is implying that it does, any different then those claiming K8L will have a fourth decoder or whatever? One random person declaring AMD won't have HT3 is also misinfo and does not represent THG as a whole when there are many in that thread that debunk that.

Anonymous said...

Again, I don't know how you came up with your Tigerton FSB numbers..Please explain:) And I don't see why technical details of Caneland vs Bensley is off topic..But do you find anything wrong with my statement that 1.6 < 2.125? Seems like a 38% improvement.

ashenman said...

Scientia and I are not biased. We simply love the Industry, and hate what Intel has done to it. Unlike Scientia, I'm a bit more obsessed with capitalism, which is why I don't mind as much of what Intel has done.

However, I realize the industry would be further along if Intel hadn't done what it has done in its marketing division. We would have better products, better pricing, and a more stable market from both AMD and Intel. I do hold a bias AGAINST Intel, not necessarily for AMD. However, wishing a company would do better, and actually watching it do better are completely different things, and I think both Scientia and I know that.

The other part of loving capitalism is watching a companies tactics backfire on them. That's another reason I enjoy this market, and don't mind as much about what Intel did.

ashenman said...

Scientia's numbers come from dividing the the aggregate bandwidth by the number of cores and then dividing that by the size of those cores interface. This gives you front side bus. In this case I think the interface is 16 bit? But I don't remember and have to go.

Darth Solarion said...

A question, I noticed there was a mention of a cache in the diagram. How does that figure in the overall scheme?

Pop Catalin Sever said...

Maybe you want to check this out. It's an article about Core 2 Shared cache efficiency.

http://www.digit-life.com/articles2/cpu/rmmt-l2-cache.html

As a note their X2 3800+ memory bandwidth test seems totally sqewd to me even though they compare 400 MHz DDR bandwidth of an X2 3800+ with 800 MHz DDR2 of an X6800??? This is the kind of testing that has no real meaning except to put one of the sides in a very favorable light.

But anyways even so the test indicate some of the Core 2 weaknesses and is and interesting read, so if you happen to have any time I'd sure like to hear an opinion about this... tks

Scientia from AMDZone said...

The reason I deleted Red's first post is because I do not want to get off on a tangent about whether or not and how many and proportionately how many Intel fans are on Tom's Hardware Guide. That is exactly what happened in the last two articles and it is going to stop.

In a similar vein, I am not going to spend paragraph after paragraph explaining who "they" or "the fans" or "the enthusiasts" are. If Red doesn't understand who they are then that is too bad. Again, these types of sidetracking discussions will stop. We are also not going to start discussions about freedom of speech or the environment of this blog or the fairness of implementation. This is only a comment section, not an actual forum and off topic discussions of this nature quickly bury the thread.

Red, your comparison of Bensley with Caneland was not off topic which was why I didn't delete it. However, my article was not about Bensley. You are correct to say that Caneland has more bandwidth than Bensley however Bensley is also contstrained on bandwidth for quad core. Bensley only allows 200Mhz of FSB per core which is half of what I believe is adequate. Even PIII had 266 for a single core so 200 cannot be enough. The ratio that you calculated moves the FSB in Bensley up from 200Mhz to 266 in Caneland. Again, 266 is not enough; we need at least 400. 400 was what Barton Athlon XP 3200+ used as well as Athlon 64 on socket 754. 400 is also what P4 had at introduction. I don't think anything lower than 400 is reasonable.

To get the numbers you have to remember that PIII, Barton and socket 754 only have a single memory channel. P4 and C2D use a quad pumped FSB which doubles the bandwidth. K8 on 939, 940, AM2 and F is twice as wide so it again doubles the bandwidth.

Kentsfield would be 333Mhz per core if it indeed has a 1333Mhz FSB. I think this is moot though if Tigerton only has 1066Mhz. Cache is the same strategy that Intel used and is still using with P4 Xeon. They end up with 8MB's of L2 and the performance still degrades. They use as much as 24MB's with Itanium but this makes the dies very expensive. This strategy no longer works like it did in 2002 because AMD's movement into servers has lowered the price.

Intel is ahead on the desktop and will continue to be until Q3 07. However, this is not the topic and not mentioning it does not show bias on my part. Tigerton is not a desktop processor; it is a server processor. Clovertown will be outclassed in Q2 07 and I believe because of the memory bandwidth problems Tigerton is going to have problems as well even though it is a native quad design.

Scientia from AMDZone said...

Scientia, you claim to hold no bias for AMD nor Intel..But you're not fooling anyone when all you do is rant about Intel;)

Maybe if I offer a different perspective to you you will also not be blinded about AMD's holiness


Red, I've been trying to be somewhat lenient with your posts but this is your last warning. Future posts of yours may be deleted if they contain simmilar comments even if the rest of the comment is on topic. STOP.

Scientia from AMDZone said...

Well, I suppose the people over at Forumz can think what they want even if it doesn't make any sense. AMD bumped up HT 3.0. They were going to introduce HT 3.0 and DDR3 at the same time but instead they moved HT 3.0 forward.

I suppose someone can sit in the pumpkin patch waiting for CSI as Linus does for the Great Pumpkin however this notion doesn't match the diagram. The diagram clearly shows a Northbridge for Caneland. The diagram also states that this Northbridge has a snoop filter. The snoop filter would be 100% unnecessary if Tigerton had CSI. There is also no real reason to have CSI unless you also use an integrated memory controller. Obviously, the use of a Northbridge is not compatible with an IMC. So, if the folks over at Forumz are right then the diagram is completely wrong.

Erlindo said...

Maybe you want to check this out. It's an article about Core 2 Shared cache efficiency.

http://www.digit-life.com/articles2/cpu/rmmt-l2-cache.html

As a note their X2 3800+ memory bandwidth test seems totally sqewd to me even though they compare 400 MHz DDR bandwidth of an X2 3800+ with 800 MHz DDR2 of an X6800??? This is the kind of testing that has no real meaning except to put one of the sides in a very favorable light.

But anyways even so the test indicate some of the Core 2 weaknesses and is and interesting read, so if you happen to have any time I'd sure like to hear an opinion about this... tks


You are looking at it the wrong way.

1) They didn't include socket AM2 in those tests. These processors have greater bandwidth than socket 939 becasue of DDR2.

2) Although, conroe's cache isn't more efficient than AMD's dedicated L2 caches. In conclusion, in real world scenarios, conroe suffers from cache trashing (read the conclusions).

Pop Catalin Sever said...

"ou are looking at it the wrong way.

1) They didn't include socket AM2 in those tests. These processors have greater bandwidth than socket 939 becasue of DDR2.

2) Although, conroe's cache isn't more efficient than AMD's dedicated L2 caches. In conclusion, in real world scenarios, conroe suffers from cache trashing (read the conclusions)."

No, :) I was looking at the tests the right way, the way which Intel's use of cache or bandwidth use isn't the greatest ... you thought I was looking the other way around ...

Scientia from AMDZone said...

As far as the Digit Life article goes there are some points that are important and some are not. If the data is shared then shared cache is faster. If the cache is stressed by streaming data or lots of branches, independent caches prevent thrashing. The test procedure itself is somewhat silly because they compare an X2 with 1MB of L2 total against an X6800 with 4MB's L2 total.

However, one point that is true is that they doubled the cache bus from Yonah to Conroe which helps a lot. This got rid of the 40% slowdown that Yonah had. K8 still has the same cache bandwidth as Yonah. However, with K8L, AMD gets not only double the cache bus bandwidth but independent buses as well which should be more efficient than just doubling the bus width. K8L also gets a dual independent/shared cache structure which should be the best of both worlds. Even with Tigerton, Intel is not going to get this if they just share the L2 among four cores (as has been suggested). Sharing L2 among four cores will simply increase the problem of thrashing.

So, the L2 on C2D works pretty well today (better than on Yonah) and is better than K8's in some circumstances. However, the cache on Tigerton may be inferior to K8L's in the great majority of cases.

Erlindo said...

No, :) I was looking at the tests the right way, the way which Intel's use of cache or bandwidth use isn't the greatest ... you thought I was looking the other way around ...

Sorry if I misunderstood you then. ;)

Anonymous said...

I see no where that indicates native quad Tigerton. Those should come with Nehalem I believe.

If Tigerton does have a supposed 38% improvement in bandwidth compared to Bensley, how does that compare to Barcelona? And how does Bensley compare to whatever AMD has now?

Since Erlindo is free to post his 'Forumz are fanbois!' and you also go 'off topic' I hope I'm granted the same privilege:) Also, why do you place so much emphasis on memory bandwidth? Surely you should be able to show us some apps that would take advantage of this? If Kentsfield is indeed starved as you say, shouldn't it stop gaining after a certain clock, when it can't crunch any faster than data moves in? You say Kentsfield only has a paltry 60% gain, yet I also showed some benches with similar gains for Opteron from just a year ago.

ashenman said...

Just going to point out quickly that part of the pd's problem was able to be attributed to a slower fsb. I think Intel's branch prediction is able to compensate for this much better, now that they have much shorter pipelines. You wont actually see any of these problems arise until programs are able to use all 4 cores, which is sad, because that means that either way, you're wasting the technology.

Scientia from AMDZone said...

I may have misunderstood; is Tigerton just the MP version of the DP Clovertown?

As far as bandwidth goes, Barcelona can do 400 Mhz per core by using 8 memory channels of DDR2-800. Caneland would max out at 6 channels of DDR2-667 assuming that it has that many memory channels. This would give only 266 Mhz per core.

Scientia from AMDZone said...

Since Erlindo is free to post his 'Forumz are fanbois!'

Where is Erlindo's comment of this on this page?

Anonymous said...

The fanboys at THG believes that CSI will be available with Clarksboro chipset and they even believe that K8L won't have HT 3.0.
Linky


There he goes, trying to prove the existence of Intelliots, yet I am shunned for trying to disprove them:)

Tigerton is the MP quad core Core 2 version of Tulsa. Sure, MP version of Clovertown if you like:)

Only 266? That's just a number to me:) High end Clovertowns are running 1333 and Intel is updating Core 2 to 1333 also.
http://www.dailytech.com/article.aspx?newsid=4589

So how about a comparison of current[you said Bensley, slides says Truland, Google agrees;)] vs Caneland, current vs whatever AMD has and Caneland vs Barcelona? Those FSB numbers don't mean much to me:)

Scientia, you don't dispute with me and ashenman's assesment that 'low' DivX scores are a result of threading not bandwidth?

ashenman said...

He's not trying to prove they exist, he's simply stating a commonly held misconception of a group of people he names intelliots. You should watch it red, this conversation does not need to go this way, and your posts will probably end up being deleted if all you argue about is whether or not they should be deleted. If you think this opinion is not common, read chips and salsa's latest posting.

On the 233 just being a number, I think you show bias by stating that, after what I said about pentium d's.

I didn't assess (as far as I remember) that Divx was thread limited. I did state, however, that it could be a bit of either. I almost completely agree on Scientia's analysis of thread and bandwidth limiting on those benchmarks, however.

Scientia from AMDZone said...

Erlindo was giving a link; there is a difference. I'm not trying to police every word; I just don't want sidetracking discussions about unimportant things nor do I wish to be insulted on my own blog.

Only 266? That's just a number to me:)

Then I don't know how else to explain it to you. You need 400 per core and you only have 266 with Tigerton. Barcelona is capable of 400 per core.

Current doesn't help much. K8 can currently do 800 per core and X6800 can do 667 per core. However, K8's is just room for K8L on the same socket. Opteron was 400 per core with X2 on 939 and 940 and 400 per core with 754. Again, no change.

Scientia, you don't dispute with me and ashenman's assesment that 'low' DivX scores are a result of threading not bandwidth?

I don't know based on the THG scores. I'm sure that some tests would show a bandwidth dropoff. Maybe we'll get better testing in the future.

Anonymous said...

Chip dude is a person, not a group. Of course there are always going to be some people that are going to be pro this and anti that:) I still haven't seen a thread of Forumz with ultimate AMD is suckage:)

Linking doesn't seem that different to me and my topic on Intelliots was about as sidetracky as Li Net;)

It is just a number, at least comparing FSB vs HT:) PD had 1066 at higher end later which didn't do so hot anyhows. I'm guessing you got 266 by 1066/4? There are some Tulsas with 'just' 667/2, including the fastest 2. Seeing as how Clovertown still has 1333 and Conroe will be refreshed with 1333, any reason you chose 1066? 1333/4=667/2

ashenman said...

1066 is the bus speed of the Tigerton platform described by Intel. Thus, regardless of processor speed, the best any system based on this can do is a 1066 fsb, even if the processor has a faster bus speed.

I was under the impression that the Tulsa platform was very slow and very hot, so the example given actually proves the point that this system will be very slow.

Core2Dude is just one person, but the majority of his commentors are not just one person;).

Yes, PD had a higher fsb later, which did lead to near performance parity with some x2s, so your point also strengthens mine.

Scientia from AMDZone said...

I didn't realize that you didn't understand the calcuations.

34 GB/sec with 4 buses
34 / 4 = 8.5
8.5 GB/sec per bus
8.5 GB/sec = 8500 MB/sec

The FSB is 64 bits wide or 8 bytes
8500 / 8 = 1062.5
1062.5 Mhz

This is rounding error and it would actually be 1066 Mhz

1066Mhz for 4 cores
1066 / 4 = 266
266Mhz per core
266Mhz would be equal to two memory channels of DDR 133 or one memory channel of DDR 266 for a single core processor.

K8 on socket 754 uses one channel of DDR 400.

Barton 3200+ on socket A used one channel of DDR 400.

Even Williamette at 1.4Ghz had a 400Mhz bus.

Erlindo said...

Talking about bias, have you guys seen the recent review that THG made with the Woodcrest vs the s940 Opteron?

This will finally put the nail in the coffin to THG's credibility. ;)

THG's misleading Xeon vs Opteron review

...Looks like some of the members on forumz are complaining about the unfair tests made by THG's sibblings while on the other side the fanboys worship it:
forumz

In conclusion, you have to be an @sshole to not know that socket-F Opterons were available a while ago. ;)

Darth Solarion said...

It's amazing how these ad hominen attacks go around, and it is really interesting when people don't think rationally and use silly words like "cool" and what not and disregarding the criticism.

Emperor's clothes indeed.

Wallachian said...

Scientia,

Please take a look at these results. Looks like Tulsa (ML570G4) performance is way better and cheaper than the Opteron (DL585G2), and both are from HP.

http://www.tpc.org/tpcc/results/tpcc_last_ten_results.asp

Pop Catalin Sever said...

"In conclusion, you have to be an @sshole to not know that socket-F Opterons were available a while ago. ;)"

Yes, I totally hate them!!! I wish they did benchmarks the old way were no aspect of a problem was left unanalyzed compared to now where the "isn't relevant" remark ca be found to justify almost all of they sad testing.

TH is such a sad review site ! I used to be a huge TH fan and supporter but back then they totally deserved that, now all they do is to feed their "Intel fanboys mob" with biased reviews. I really can't express my disappointment in words...

ashenman said...

TPC is just a benchmark, and while most people only pay attention to that, there are many applications (arguably most) where the AMD platform outperforms the Tulsa platform.

Scientia from AMDZone said...

Please take a look at these results. Looks like Tulsa (ML570G4) performance is way better and cheaper than the Opteron (DL585G2), and both are from HP.

The Xeon box is more efficient in TPC-C. However, this box is not cheaper; it costs an additional $50,000. This extra money buys a better RAID controller and better RAID array. This is why it is faster.

Scientia from AMDZone said...

Talking about bias, have you guys seen the recent review that THG made with the Woodcrest vs the s940 Opteron?

Well, they bought an older Opteron system instead of socket F. Then they made sure that they only populated the memory sockets half way so that FBDIMM wouldn't have too horrible of latency. Both of these are unprofessional. Servers are normally tested with full memory slots since this is how they are used.

Also, they of course avoided running actual business, scientific, or engineering benchmarks.

Anonymous said...

Zomg, this entry's gone off topic;)

With (667/800)*2/8 for Truland..You get 166.75/200 per core. With (1066/1333)*4/16 for Caneland..You get 266.5/333.25. So what are these memory intensive apps that'll bog down Caneland? Your negative article seemed to imply that doubling the cores would result in poor memory performance compared to the past.

That article was terrible, but hey, I've never said THG was good, just said they're reporting for daddy;)
http://tweakers.net/reviews/638/5
Just like AM2 though, Socket F doesn't seem to make a difference anyways.
http://www.tomshardware.com/2006/09/25/green_machine/
Anyone remeber this regurgitated AMD press kit:)

Scientia from AMDZone said...

http://tweakers.net/reviews/638/5
Just like AM2 though, Socket F doesn't seem to make a difference anyways.


Actually, it does. However, this doesn't show up with a dual core 2.4Ghz chip. You would see a very small difference with 2.8Ghz and a little more with 3.0Ghz. With quad core however, it will make a big difference.

Wallachian said...

The Xeon box is more efficient in TPC-C. However, this box is not cheaper; it costs an additional $50,000. This extra money buys a better RAID controller and better RAID array. This is why it is faster.

The costs of the AMD systems is around $549,000, and the Intel system is $596,000. Yet it says, the Intel system costs just $1.88/tpmc while the AMD system costs $2.09 per tpmc. Isnt that the metric that matters a lot to corporations that buy these MP systems - how much money is being spent per transaction.

ashenman said...

While the money spent per transaction matters, the point is that if you added that same $50,000 worth of hardware upgrades into the AMD system, it would probably perform almost as well if not better. The comparison really isn't fair or representative of a real comparison of the platforms. Whoever does that system needs to readjust how they run those tests, because it seems like all they're doing is choosing random systems, and not bothering to try to show which one has more potential or which one has better scalability.

Scientia from AMDZone said...

I've mentioned this problem before. IBM, HP, and Dell only had 2-way systems before this year. It hasn't been that long since these three created 4-way systems and it will take some time for these to show up in the TPC numbers. Dell tends to have the lowest priced TPC.

All of the indicators are that AMD systems perform well in normal server operations. However, the RAID subsystems need to be configured the same for comparison.

Vavutsikarios said...

Barton 3200+ on socket A used one channel of DDR 400

If memory serves, nforce 2 ultra had 2 memory controllers 64b each.

ashenman said...

They did, but it barely helped performance. I had said chipset, and saw almost performance benefit, even in benchmarking programs that should pick that up alot better.

Vavutsikarios said...

True. DDR in single channel provided more bandwith than AXP could take advantage of. Simple SDR memory did OK most of the time.

Having dual channel (or two channels) of DDR memory was more a matter of bragging rights those days. Even more than that, incorporating dual channel was a necessary advancement for AMD chipsets in order to compete with intel chipsets on even marketing ground.

Scientia from AMDZone said...

I think 400 Mhz or 400 DDR per core is a reasonable assumption for memory bandwidth. I think performance for any processor is going to drop with less. Since this has been shown over and over again since late 2002 I really have to assume that the THG benchmarks were faulty. Supposedly, Kentsfield with 1066Mhz was able to run full speed. However, this just does not seem reasonable as this would only be 266 Mhz per core.

Anonymous said...

Ooh, an anonymous poster;) Dupe post btw^^.

You did not dispute my 166.75/200 for Truland. With 266.5/333.25 for Caneland, that would be 30-60% improvement on the known specs, to 70-100% on my assumption based on the upcoming 1333 refreshes. Your whole article reminds us of how tiny 266 is but you fail to mention that Truland has less. So what exactly is this article ranting about when Tigerton will gain more bandwidth and AMD is still not using its excess?

R2K said...

: )

Ho Ho said...

Can anyone test how big difference does it make to some CPU's with dualchannel and singlechannel memory? I bet the difference is almost always <5%.

How big speed increase did it give to move to IMC when moving from K7 to K8? How much of the speed incrase was from doubling the memory bandwidth when moving from s754 to s939? What benefits more on the average, lower memory latency or higher bandwidth?


Lowering FSB speed will affect average performance much more since memory latency instantly gets worse. You can't say that with four cores on 1G FSB every core has 4x of the latency of single core on the same FSB. It will probably be a bit less but not that much less since FSB is not constantly fully loaded. Of cource if every core would use the FSB to the max they would have 1/4'th of the bandwidth but I know now real-world usage scenarios when all cores would need that much bandwidth. Feel free to educate me, some nice examples would be nice, especially if I can duplicate them on Linux.

2.1GB/s is quite a bit of data to move around. I would understand if someone tried to rasterize on CPU it would need massive memory bandwidth but fortunately there are better ways to render stuff.

Scientia from AMDZone said...

Can anyone test how big difference does it make to some CPU's with dualchannel and singlechannel memory? I bet the difference is almost always <5%.

You must think very little of Intel then. Why then did they waste so much time increasing the FSB speed of P4 from 400 to 800Mhz when it was entirely unnecessary?

How big speed increase did it give to move to IMC when moving from K7 to K8?

On 4-way systems the difference was a factor of 2.

You can't say that with four cores on 1G FSB every core has 4x of the latency of single core on the same FSB. It will probably be a bit less but not that much less since FSB is not constantly fully loaded.

There is something very self deceptive to suggest that it makes sense to have four cores and then argue that the four cores won't be fully utilized. The memory bus with some effort can be saturated today with just one core; it is trivial to do it with four.

I suppose it could be argued that typical software would not saturate the bus but then typical use would not need four cores. If you are going to impose artificial limitations when testing then there is no point in testing.