Tuesday, February 20, 2007

Chipsets And Chess Pieces

Coming into 2007 the chipset market has shifted dramatically with AMD's acquisition of ATI. This not only makes AMD's position considerably stronger but will have effects on Intel's chipset direction as well.

While nVidia and ATI have been slugging it out in the discrete graphics market, Intel has enjoyed a fairly secure position in terms of desktop and mobile integrated chipsets. Intel's nearest integrated competitor was ATI, however ATI was not only a distant second but often depended on orders from Intel to have significant sales. Intel has enjoyed its ability to flex its muscle in this area as it did when it stopped purchases from VIA when they would not pay bus licensing fees. This has reduced VIA from being a strong chipset and graphics player to its current status as not much more than a supplier in the budget niche market. However, with AMD's purchase of ATI Intel's position has weakened dramatically.

In one stroke, ATI not only gains secure funding for R&D to compete with nVidia but it also gains early notification of strategic direction as well as close cooperation with AMD's CPU designers. This gives the ATI division a definite advantage over nVidia and puts ATI in a category that was the sole domain of Intel's chipset and graphics division. While most computer enthusiasts will see this as competition in the discrete area this is not the big fight. I do expect ATI to gain back some of the graphics share that it lost to nVidia however this won't be as dramatic as it will be for Intel. ATI and nVidia have been competing for a long time and are very used to this; the company that isn't used to this is Intel. Having lost its integrated factory supplied chipset monopoly, Intel can expect competition to greatly increase for corporate and commercial sales.

The commercial sales area is defined much less by performance and more by cost and reliability. OEM's want the systems to be easy to configure and maintain and as low of cost as possible. This is the stronghold of the integrated graphics chipset where the graphics chip is already attached to the motherboard. ATI's presence in this market just got a big boost since ATI chipsets can now be shipped with AMD's name and official endorsement clearly stenciled on each package. There is no doubt that AMD is deadly serious about this market and is diving in with a headstart over Intel. The reason for this headstart has to do with cost.

Every chip on a motherboard costs something, and every chip connects to other chips and makes motherboard design more complex. The best way to reduce cost is to reduce the chip count. AMD already has a huge leg up on Intel because AMD motherboards don't need a Northbridge chip as all Intel motherboards do. This reduces both cost and complexity. AMD is continuing down the chip count reduction road with its plan to put the graphics chip (GPU) on the same die as the CPU. This sort of thing has been done in the past as when the MMU and FPU were integrated on the CPU die in the same fashion. AMD has went further by including the memory controller as well. So, in 2008 there should be plenty of smiles from motherboard designers who can leave off the Northbridge and GPU and design a simpler, cheaper motherboard. These smiles should be passed along to commercial OEM's and finally to commercial customers. Simple, inexpensive, easy to validate and configure. This is exactly what the market needs.

Intel however seems to have been caught off guard in this area. Presumably, it has been financing the added cost of the Northbridge with the money it has been getting from bus licenses. However, it may now not have this revenue from ATI. Intel had talked about putting the GPU inside the Northbridge which would indeed reduce the chip count. It appears that Intel has done this with the iG965. However, one has to wonder if this move is very bright when presumably the Northbridge will disappear in 2009 when Intel releases the new X86 socket with CSI and the memory controller included on the CPU die. In other words, when Intel tries to follow AMD by dropping the Northbridge then whatever work it has put into having the GPU on the Northbridge will get tossed out. This certainly isn't additive development but Intel probably has the money to waste. However, since the iG965 still has to connect to memory and the FSB of the CPU even adding the GPU doesn't reduce the complexity as much as it does on AMD motherboards. Also, it has been suggested that intel will need a switching hub to connect CSI among the processors so this would be adding a new chip on the motherboard. This too creates doubt that Intel will be able to match AMD in terms of reducing chip count and motherboard cost.

Intel may try to follow AMD by putting the GPU on the CPU die however I'm certain it will be at least a year behind AMD. Whereas AMD's K8 uses a crossbar to communicate among the HyperTransport links, the Memory Controller, and the CPU Cache, Intel's chips have no such functionality. AMD can easily hook up a GPU by adding another port to the crossbar. However, since Intel chips have no crossbar they will need to invent some method of communication. In the past this was done via Northbridge translation between the GPU and FSB of the CPU. Now that the FSB will be dropped along with the Northbridge Intel will need something entirely new. I suppose the quick and dirty approach would be to use CSI. This could work but CSI is a long way from HyperTransport's track record of robust, reliable communication. In fact, it is perhaps a somewhat embarassing secret that some 3rd party Intel motherboards use AMD's HyperTransport to talk between the North and Southbridge chips. In the BIOS this is referred to by the original name, Lightning Data Transport or LDT.

So, while Intel currently has the stronger position in terms of processors they will have to scramble to keep from having the chipset rug pulled out from under them in a determined and focused assault by AMD. AMD's drive to reduce cost and complexity by reducing chip count will be hard to match, not only by Intel but by other 3rd party chipset makers like VIA. This does leave the question of whether low end computers could be reduced in cost even further. And, since any further reductions in low end system cost along with potential increases in power could overlap with the set-top market one has to wonder if XBOX and Playstation will still be around in 2009. However, I have little doubt that by 2009 ATI's share of the discrete graphics market will be much closer to nVidia's than it is now and ATI's share of the desktop and mobile integrated markets will be much closer to Intel's.

124 comments:

enumae said...

I am not sure what to tell you except that the new AMD 690G chipset has both a North Bridge and South Bridge.

Sorry for double posting, it is in the previous post as well.

Woof Woof said...

The NB is significantly less complex than the ones in the Intel designs which have to serve as an arbitrator for 2x dual cores (in Clovertowns) as well as a memory controller.

howling2929 said...

Just a Clarification. HyperTransport is not an AMD technology. It was developed by an industry consortium.

What is a 100% AMD technology is a propiertary extension to Hypertrasnport called "Coherent HyperTransport" or ccHT for short.

While ccHT i a big achievement, one has to be clear to label things the haw they should. Therefore, no embarrasement is some 3rd party Intel mobos use an industry STANDART technology named HyperTransport.

enumae said...

Scientia

Just curious what your take is on Anandtech's article discussing AMD's Barcelona, and if you feel it to be an accurate description of what enhancements and changes have been made to AMD's current K8?

Barcelona Architecture: AMD on the Counterattack

Scientia from AMDZone said...

howling

Actually, it wasn't. HyperTransport 1.0 was developed soley by AMD and DEC. The HyperTransport Consortium didn't develop anything and was simply the Lightning Data Transport Consortium renamed. As of version 1.0 the consortium merely approved standards for HT 1.0 but did no actual development. Since that time, we can give more credit to the consortium for versions 2.0 and 3.0. I think it is safe to call HT an AMD technology because clearly its only reason for existing is AMD. Neither HP nor Intel have shown any inclination to continue what DEC originally developed.

enumae

The article is excellent. I was surprised because it is unlike anything I've read at Anandtech since 2001. In the past few years Anand has written few articles himself, has been ignorant of K8 technology as late as 2005 and has been very critical of AMD since 2002. This article in contrast was written by Anand Lal Shimpi himself, showed excellent understanding of both K8 and Barcelona architecture, and was reasonably optimistic. He even covers some technical aspects that have not been previously published.

The only problems in the article are on the first and last pages where he is incorrect about AMD's development schedule.

Scientia from AMDZone said...

BTW, since this is about chipsets I guess I'll respond to claims made in the last two article comments about how AMD is supposedly losing all sorts of customers for its chipsets. We know that AMD actually gained versus nVidia in Q4 06. And this INQ 690 Article says:

unlike previous launches, this time AMD has 34 different motherboards from 12 different partners, such as Albatron, Asus, Asrock, Biostar, ECS, EPoX, Foxconn, Gigabyte United, MSI,PC Partner, Sapphire, Shuttle. This only includes desktops. Mobile platform named Red Kite or HTPC vision Kauai are not included here, since there is over 40 different designs planned in that segment.

I don't tend to rely on the INQ but this seems likely to be true.

Scientia from AMDZone said...

Another INQ article pretty well sums what I was writing about in my article:

AMD'S INTRODUCTION of the 690 chipset marks several milestones for AMD. This is the first chipset for the mobile market in AMD's history, the company's first integrated chipset for desktop market, and its first desktop chipset after five years

This is a huge change in the chipset market.

enumae said...

The article was a bit over my head, but I do understand that the enhancements or changes could result in significant performance gains.

Just to follow up on my first post, if ATI has to manufacture a North and South Bridge for the 690G (which, as of know they do) how much cheaper can the chipset be when compared to Intel?

Also whats your take on the Intel's Robson technology that will be incorperated in the Santa Rosa platform?

I do not use a mobile device/laptop but to have almost instant power on capabilities would seem to be a great selling point and could strengthen Intels hold on the Mobile sector.

enumae said...

In regards to your other points about motherboard manufacturers...

All of this is at what expense to Nvidia?

Thats who will be hurt more than Intel, at least initially.

Scientia from AMDZone said...

enumaeJust to follow up on my first post, if ATI has to manufacture a North and South Bridge for the 690G (which, as of know they do) how much cheaper can the chipset be when compared to Intel?

As I said in the last article comments, AMD does not use a northbridge; the 690G is a GPU. The cost advantage is small on single socket systems. There is a small saving on the chipset itself and another on the cost of the motherboard. This cost difference is greater on 2-way and large on 4-way. This is one of the reason why Intel has to go to IMC and CSI.

Also whats your take on the Intel's Robson technology that will be incorperated in the Santa Rosa platform?

Well, if it makes the applications load faster and also saves battery power then that is quite good. It has been rumored that AMD will have this too when it releases SB700; the current Southbridge is SB600.

I do not use a mobile device/laptop but to have almost instant power on capabilities

It isn't; it just makes the loading faster.

would seem to be a great selling point and could strengthen Intels hold on the Mobile sector.

Not likely. Intel will lose mobile share during 2007. However, it will still retain most of it.

Scientia from AMDZone said...

enumae
Thats who will be hurt more than Intel, at least initially.


Then you don't understand who the customers are. nVidia isn't competing in this market.

enumae said...

Scientia

Sorry for so many post, but looking at your article again it would seem your statement is incorrect...

Intel has talked about putting the GPU inside the Northbridge which would indeed reduce the chip count.

Looking at Wiki and an Xbit article Intel is currently doing what your saying they haven't. Intel's 965G is using a GMCH, which according to Wiki is a "Some northbridges also contain integrated video controllers, which are also known as a Graphics and Memory Controller Hub (GMCH)".

--------------------------

In other words, when Intel tries to follow AMD by dropping the Northbridge then whatever work it has put into having the GPU on the Northbridge will get tossed out.

Looking at the a Hexus article it would seem that the North Bridge controls the PCI-E and Audio for AMD's 690G, so AMD, as of know cannot drop the North Bridge.

enumae said...

Scientia
Then you don't understand who the customers are. nVidia isn't competing in this market.

Ok, thats clarify's a question I was going to ask.

Scientia from AMDZone said...

enumae

I corrected the article to include mention of the iG965. However, AMD does not need a northbridge since PCIe functionality can be included on the Southbridge.

Intel currently sells its motherboards with its own chipsets to commercial OEM's. AMD needs a similar bundled capability to compete. This is an area where nVidia does not have sales; Intel currently has this market to itself.

However, another factor to consider is expenses. Assuming that AMD will stop developing for Intel ATI will no longer be paying bus licensing fees to Intel. Further, since ATI is part of AMD it won't have to pay the HT Consortium fees. If AMD becomes cheaper to develop for this does leave the question of how strongly nVidia will continue to support Intel for anythiing other than discrete graphics.

Scientia from AMDZone said...

roborat

You are now showing up in my Analytics traffic stats. It says I'm getting this much traffic from these sources:

0.80% <- Advanced Micro Devices
1.89% <- Roborat64.blogspot.com
4.13% <- Intel Corporation
6.77% <- Amdzone.com
8.09% <- Google

enumae said...

Scientia
However, AMD does not need a northbridge since PCIe functionality can be included on the Southbridge.

But it isn't yet and therefore your comparison in price would be incorrect.

I do not know economics, but I do understand that the more you make of something the cheaper it is, and in this case Intel's should be cheaper when comparing a chipset with both north and south bridges.

...Assuming that AMD will stop developing for Intel ATI will no longer be paying bus licensing fees to Intel.

How much is this fee?

How much is this fee compared to the business lost?

How much is this fee compared to having someone else manufacturing your chipsets?

Further, since ATI is part of AMD it won't have to pay the HT Consortium fees.

When I looked it up it was only $40,000 a year, not what I would call a huge savings, not in relation to the size of these companies.

...this does leave the question of how strongly nVidia will continue to support Intel for anythiing other than discrete graphics.

If AMD can make there own chipsets and enough of them, why would they look to Nvidia for anything on/in the corporate, desktop or mobile platforms?

The only place that AMD would need Nvidia would be Servers, again this is down the road but I would have to believe that Nvidia is aware of what could happen.

Nvidia may need business, and I believe that will come mainly from Intel, but this is a ways out.

Also, I feel well see SLI on Intel very soon and the support for Crossfire will be gone.

Scientia from AMDZone said...

enumae

You are being incredibly pessimistic about AMD and this is not warranted. Red tried to argue that AMD should have built another FAB instead of buying ATI but this idea still makes no sense. Without ATI, AMD's ability to continue to expand into new markets would be limited and having more chips wouldn't help. ATI is far more benefit than most people realize.

It has been suggested that AMD will make CrossFire an open standard to increase its use by other chipset manufacturers. This would seem consistent with AMD's past actions and philosophy. Therefore, your statment about SLI versus Crossfire is probably incorrect.

Secondly, you need to understand that AMD will compete directly with Intel in the commercial integrated graphics market, not with nVidia or VIA or anyone else. This is not a fight that Intel is prepared for as it has had this market all to itself.

Third, you need to understand that without ATI and with Intel's blackballing of VIA Intel now finds itself in a very poor position to negotiate with nVidia. Intel has no choice but to either try to supply all of its platform chipsets on its own, offer nVidia much better terms than it has in the past, or completely drop its bus licensing stance.

So, to recap:

AMD will benefit at Intel's expense.

AMD will undoubtedly take back share from nVidia.

nVidia may also benefit at Intel's expense or Intel may be hurt by the lack of a chipset supporting partner.

I get tired of hearing that nVidia will now be closer to Intel. Why would it make sense for nVidia to be closer to the world's biggest chipset and graphics competitor when AMD has given no indication that it wants to push nVidia out of the market? Sharikou180 has said this several times but he has never been able to explain it.

AMD will definitely support nVidia as a graphics and chipset partner. You really have it backwards when you ask why AMD would need nVidia. AMD will maintain a (license free) relationship with nVidia simply because some vendors may prefer nVidia products. It wouldn't surprise me if AMD develops another server chipset as it did when K8 was first launched however there is no doubt that the first priority will be commercial integrated and mobile integrated chipsets.

Fusion in 2008 on 45nm will reduce price, reduce power consumption on mobile chipsets and increase processor performance. As far as I know, Intel has nothing planned to match this in 2008 or even early 2009.

Christian H. said...

I corrected the article to include mention of the iG965. However, AMD does not need a northbridge since PCIe functionality can be included on the Southbridge.



I believe this is done to remove GPU IO from the other HT link to disk IO.

This seems like it would reduce contention.

Christian H. said...

The only place that AMD would need Nvidia would be Servers, again this is down the road but I would have to believe that Nvidia is aware of what could happen.

Nvidia may need business, and I believe that will come mainly from Intel, but this is a ways out.

Also, I feel well see SLI on Intel very soon and the support for Crossfire will be gone.




Wow, you have to be kiddng. How can Intel compete with one GPU when AMD has two from either company.


Some people will only buy ATi. I doubt if Intel wants to alienate that many users.

Wirmish said...

This is today, 28th February 2007 at 18H GMT +1 that the AMD's NDA ended.

Video interview with Phil Eisler & Giuseppe Amato :
http://www.syndrome-oc.net/articles.php?article=94&lang=en

"K10 40% perfs vs K8 is a very conservative number." - Giuseppe Amato

Scientia from AMDZone said...

theKhalif

Yes, but my understanding is that the 690G does not have its own HT link to the CPU and shares the link on the SB600. I suppose this is fine as long as the HT bandwidth is high enough. It has been suggested that the HT bandwidth is 4X what the GPU needs. Also, HT is capable of doing interrupted priority transfers. This is where, for example, the CPU is receiving packets from source A and in the middle it puts A on hold to receive priority packets from source B before resuming transfer from A.

Scientia from AMDZone said...

wirmish

Yes, I listened to the video; it was interesting.

BTW, the regular a href="" syntax for links is supported. You can find a tutorial Here

It's a shame that linked lists aren't supported too.

Ho Ho said...

scientia
"It has been suggested that the HT bandwidth is 4X what the GPU needs"

IGP needs to read and write to system RAM. On AMD that means going through the CPU memory controller that is connected to the IGP via HT link. To get any kind of serious fillrate you need to be able to move data at as fast rates as possible. Low-end GPU's have onboard memory with >10GiB/s bandwidth. Am I wrong when I say that HT1 doesn't have nearly as much bandwidth, especially when the link is shared with SB, as you claim?



That video wasn't all that interesting. I personally didn't learn anything new from it.


They said K10 will be 300% faster than K8 in HPC.

They go from 2FP per cycle to 4FP per cycle, giving theoretical 2x speedboost. Due to memory bottlenecks it will be a bit lower, with the help of L3 it'll be 1.6x faster. Add two more cores, multiply 1.6 by two and you have 3.2x.

That seems quite real but I wonder what kind of bottlenecks are they talking about. I thought the cache bandwidth was good enough to feed the SIMD units, especially as they improved their prefetchers.

I got >2x speedboost in SSE2 intensive code going from 3.8GHz P4D to 3.1GHz 2ML2 e6300. There didn't seem to be such bottlenecks on C2D for some reason. Perhaps it just has so much better prefetchers compared to Barcelona? After all, it has to make up for the much longer memory latency and P4 had quite good prefetching mechanism that they improved even further when they put it to C2.


They also said they decrease the core frequency because of lots of cache. Current intel quads have 2x4M+4x32k=8.25M cache, Barcelona will have 2M+4x0.5M+4x128=4.5M L2. So Intel has much higher core frequency and almost twice the cache. What is wrong with this picture?


But as usual, those numbers don't say much without some real-world benchmarks to back them up. So far all they have is only talk, no public examples.

abinstein said...

"IGP needs to read and write to system RAM. On AMD that means going through the CPU memory controller that is connected to the IGP via HT link. Low-end GPU's have onboard memory with >10GiB/s bandwidth. Am I wrong when I say that HT1 doesn't have nearly as much bandwidth, especially when the link is shared with SB, as you claim?"

Moot point, because in any rate, HT will offer more bandwidth for IGP than a shared FSB does. If the link go through south bridge, then it suffers the same bottleneck from both Intel and AMD.

abinstein said...

"I got >2x speedboost in SSE2 intensive code going from 3.8GHz P4D to 3.1GHz 2ML2 e6300. There didn't seem to be such bottlenecks on C2D for some reason. Perhaps it just has so much better prefetchers compared to Barcelona?"

Wait... what are you talking about? Conroe vs. Netburst or Conroe vs. Barcelona? What does prefetcher have anything to do with SSE performance, which is probably the simplest type of data to prefetch accurately? In fact, my measurement of SSE on Conroe (4MB L2 cache) is about 1.6x better than K8 (512K L2 cache) per core. Barcelona certainly has per-core SSE matching or exceeding that of Core 2.

"They also said they decrease the core frequency because of lots of cache. Current intel quads have 2x4M+4x32k=8.25M cache, Barcelona will have 2M+4x0.5M+4x128=4.5M L2. So Intel has much higher core frequency and almost twice the cache. What is wrong with this picture?"

1. Intel's cache is inclusive. AMD's is exclusive. The complexity is different, and complexity (which is a function of size/associativity), not size alone, affects frequency.

2. Barcelona is native quad core. It will adversely affect its available frequency.

abinstein said...

"AMD will maintain a (license free) relationship with nVidia simply because some vendors may prefer nVidia products."

Some end users (like me) also prefer nVidia products!!

enumae said...

Scientia
You are being incredibly pessimistic about AMD and this is not warranted.

I am not being negative or pessamistic towards AMD in anyway, if you feel I have please show my comments that were, I enjoy debate, and I thought that is what we were doing. I have made points and so have you, and when you do I acknowledge them.

ATI is far more benefit than most people realize.

Back to this, I have not implyed that AMD should have done anything differently, the aquisition was a must and I understand that, and I also understand thelong term impact it can have. Most of my statements are only to question what you post, as I compare what you say to what I find.

I noticed you removed a few comments, but in a previous post you asked about AMD being able to put PCI-E controller on the South Bridge, well, I don't see why not. My comment was "It isn't there yet...", and just to be clear, I thought we were talking about North and South bridges pertaining to this new chipset, not servers.

Therefore, your statment about SLI versus Crossfire is probably incorrect.

Good point, as I had overlooked that.

Do you believe we will see SLI on Intel?

Secondly, you need to understand that AMD will compete directly with Intel in the commercial integrated graphics market, not with nVidia or VIA or anyone else.

Understood.

This is not a fight that Intel is prepared for as it has had this market all to itself.

This I can not believe. Intel has to be well aware of what could happen, if they are not then they deserve to loose market share.

Fusion in 2008 on 45nm will reduce price, reduce power consumption on mobile chipsets and increase processor performance.

In an Xbit article...

“The first Fusion CPUs will be available in prototype in late 2008 and in production in early 2009,” said Phil Hester, chief technology officer of AMD, in an interview with InfoWorld.

How complicated will this chip be?

What effect will it have on yields, and what kind of volume do you expect leaving 2009?

----------------------------

Just to make sure, I am not trying to be thick headed. I understand I am not right all of the time, and I am willing to acknowledge it. I have no intention to be negative nor pessamistic about AMD.

If you percieve my coments that way I do appologize for it is not my intention.

----------------------------

TheKhalif
How can Intel compete with one GPU when AMD has two from either company.

Good point, I overlooked that as well.

Scientia from AMDZone said...

enumae

The Southbridge is SB600. Fusion would be after the release of SB700. If Fusion doesn't arrive until 2009 it could be SB800. And, my point is still valid that Intel will be a year behind AMD.

So, you think Intel is prepared? This is from an INQ comparison of G965 and 690G:

We tried a ton of settings in order to find playable setups on G965, but the number of crashes and lack of texturing just made us feel we were doing something wrong.

Intel owns more than half of world's graphics market, and it does so with parts which are inadequate for gaming and do not pass Microsoft DirectX tests - DCT 3.0.

The only games that did not crash were NFS: Carbon, which runs in semi-texturing mode and WoW, which runs at 7-11 fps and drops to 1fps when you enter a town which has more occupants.


The point I was trying to make is that AMD's chipsets will insure that AMD sells every processor that it can make. Further, these chipsets will allow AMD to take share from Intel in the desktop commercial integrated graphics market as well as mobile integrated graphics market. Some suggested in the last two articles that ATI popularity was waning and therefore AMD would have a shrinking market for ATI products. However this suggests otherwise:

The AMD 690 series chipset will be widely available from partners, including Albatron Technology Co., ASUS, Biostar, ECS, ELITEGROUP COMPUTER SYSTEMS CO., LTD, EPoX Computer Company, Foxconn Technology Group, GIGABYTE United Inc, Jetway Info Co. Ltd., MSI Computer Ltd., PCPartner Ltd. and Sapphire.

In addition, numerous system integrators are on board including Atelco, Bas Group, Formoza, Multimedis, Onda, PC Box, Seethru, Systemax, Unika, Unika Multimedia and ZT Group.


ho ho

690G review at Tech Reports:

A four-lane PCIe interconnect gives the 690G 2GB/s of bandwidth, which looks a little pokey next to the GeForce 6150's 8GB/s HyperTransport interconnect. However, even Intel's high-end desktop chipsets are perfectly happy using a 2GB/s DMI interconnect, so it's unlikely the 690G will be starved for bandwidth.

Ho Ho said...

abinstein
"Moot point, because in any rate, HT will offer more bandwidth for IGP than a shared FSB does."

You totally missed the point. IGP's in Intel NB have direct access to system RAM and don't have to go through FSB.

abinstein
"Wait... what are you talking about? Conroe vs. Netburst or Conroe vs. Barcelona?"

There I'm talking about C2 vs Netburst. I was just comparing how widened SSE units combined with other improvements gave me much more than 2x speedup in core-to-core, clock-to-clock comparison.

abinstein
"What does prefetcher have anything to do with SSE performance, which is probably the simplest type of data to prefetch accurately?"

Just that they were talking about memory bottlenecks and good prefetchers can lessen them, or at least make sure the data is availiable in caches before it is needed. I didn't see any bottlenecks in the application I tested even though Intel should have somewhat lower throughput.

You can find the benchmarking program I used from this thread. Probably best would be to start searching it backwards, there were lots of updates to the original program.


abinstein
"1. Intel's cache is inclusive. AMD's is exclusive. The complexity is different, and complexity (which is a function of size/associativity), not size alone, affects frequency."

Complexity should only mean more transistors. From what I know, inclusive vs exclusive cache the difference is neglible and only present in the part of chip that controls the cache usage. The cache cells should be pretty much identincal on both.

abinstein
"2. Barcelona is native quad core. It will adversely affect its available frequency."

Why so? Because it has too many transistors in one spot that all have to scale to certain frequency or because it gets too hot because of all those transistors?

If the first then you can see now why dual-die can be a better solution. If the second then Intel should have even more problems since MCM solutions use more power than monolithic cores.

scientia, seems as you also messed things up. This time it is memory access speed and connection speed between northbridge and southbridge.

Woof Woof said...

I guess this is as "objective" and "unbiased" as Anand himself can be :P

"The K8 core was born as an evolution of the K7; with a slightly deeper pipeline, slight architectural improvements and an integrated Northbridge"

Hello?

Integrating IMC, HT and adding a whole AMD64 extensions to 64 bit IMHO is not trivial.

Later he goes on to grudgingly give credit that this slight changes still took Intel up to C2D to win back the crown...

"..the Pentium M team at IDC was updating its architecture every year. Banias, Dothan, Yonah and Merom/Conroe all happened in a period of four years, and during that same time AMD's K8 remained unchanged."

Adding a crossbar switch, integrated dual core, SSE3, AMD-V extensions does not make the K8 unchanged in the last 3 years.

"Although there is a technical performance advantage to AMD's approach, we're unsure if it's something that will be visible in real world testing."

Didn't the "performance advantage" of X2 vs Pentium D already give us an idea how limited dual-die designs will scale?

"AMD lost the cache race to Intel long ago, but that's more of a result of manufacturing capacity than anything else. AMD knew it could not compete with Intel's ability to churn out more transistors on smaller processes faster, so it did the next best thing and integrated a memory controller."

I dunno.. I would have preferred an integrated memory controller to larger caches which will eventually run out, and IMHO only helps benchmarking, whereas an IMC will give you better returns for real world apps (unless your apps and data only fit in 4MB of cache)

He then goes on to talk abt victim-cache architectures. If I am not mistaken, Intel's doesn't implement this, which means the utilization isn't as efficient as it is with AMD's approach (which doesn't require a bigger cache)

But to be fair, he does get the new enhancements in the Barcelona core.

Ho Ho said...

woof woof
"Didn't the "performance advantage" of X2 vs Pentium D already give us an idea how limited dual-die designs will scale?"

How do you compare Core2Quad scaling versus x2? I know one is quad and other is dualcore but it shouldn't be impossible to compare scalability.


woof woof
"He then goes on to talk abt victim-cache architectures. If I am not mistaken, Intel's doesn't implement this, which means the utilization isn't as efficient as it is with AMD's approach (which doesn't require a bigger cache)"

Victim caches are not in the L1/2 caches. They are totally separate memory areas inside the core.

Roborat, Ph.D said...

Scienia said: "So, while Intel currently has the stronger position in terms of processors they will have to scramble to keep from having the chipset rug pulled out from under them in a determined and focused assault by AMD."

While this seems to be the key essence of your post, you never gave any concrete example of how AMD is going to accomplish this. You example of chipset consolidation only applies to the AMD market.
While you can argue that AMD's ownership of ATI will be beneficial to its chipset business, one can easily offer a counter argument. Intel will always be first-to-market in its CPU-CHIPSETs combo making them easily get design wins from major OEMs for next gen products. This is the reason why Intel will always have 50% market share. The AMD-ATI merger doesn't change any of that.
Secondly, the chipset business is not as lucrative as you think. It's an system enabler busineess. The only reason why Intel has 50% share is because thats just how much they make. A company with a margin target of >50% doesn't wish to make too much of these but only enough to enable its customers to buy its core business. (its the reason why you never hear of a chipset inventory buildup)
I seriously think you're inventing a benefit for AMD as a result of its ATI purchase. The loss in margin is much more of a pain than the gain it gets with the chipset business. its like buying back its flash business.

Scientia from AMDZone said...

ho ho
You'll need to be more descriptive; I have no idea what you are trying to say. Are you disagreeing with the TR article?

woof woof

This all started with Anand back in 2001 when AMD gave a demonstration of an 800Mhz K8. The initial release date was 1H 02 but this changed to 2H 02 just a couple of months later. However, AMD continued to have problems getting the clock speed up on SOI until it finally purchased process technology from IBM. You have to understand that AMD had previously tried to partner with Motorola and then UMC but neither were up to it. So, when Opteron was released in Q2 02, Anand ignored it and didn't do a review until September after Athlon 64 was released. In his review he sounded bitter and said that K8 was 18 months late and that AMD was pricing the chip too high.

Scientia from AMDZone said...

enumae

Dailytech AMD continues Intel Chipset development. So, it looks like AMD will continue to make Intel chipsets. I assume if licensing is involved then this will continue. This does however bring up a curious situation in terms of flexibility since AMD will be able to deliver chipsets for both platforms while Intel will only support its own.

DailyTechTrevally mobile. Okay, we can clearly see that the reference platform "Trevally" which will compete with Santa Rosa uses the SB700 and does not use a GPU on the CPU die. So, the 2009 timeframe with SB800 sounds reasonable.

ho ho

According to this SB700 diagram I had it backwards. The 690 connects via HT directly to the CPU while the SB is only connected through a quad port PCIE connection. So, it is obviously not a question of the 690 being bottlenecked to memory since its own HT connection is more than adequate. This should be adequate for even intensive operations with a local buffer attached.

Scientia from AMDZone said...

roborat
I seriously think you're inventing a benefit for AMD as a result of its ATI purchase. The loss in margin is much more of a pain than the gain it gets with the chipset business. its like buying back its flash business.


I haven't invented anything. AMD has stated that their commercial business is growing twice as fast as their consumer business. We know that AMD needs factory branded chipsets to really compete in this area. We also know that Intel has no competition in this market. I'll say again that this should enable AMD to maintain both desktop and mobile sales through 2007.

I don't what else to say to you, Robo, other than that I am sorry you can't see through your bias. Comparing chipsets to flash is sad indeed.

enumae said...

Scientia
And, my point is still valid that Intel will be a year behind AMD.

In terms of Fusion, we don't know much about it. If Intel and AMD can do SOC's then they should be able to make Fusion but, AMD is trying to allow (I don't know the term) the GPU to work with the CPU. If Intel wanted to makean SOC with a Core derivitive, why do you doubt they could by 2009.

So, you think Intel is prepared? This is from an INQ comparison of G965 and 690G:

You are talking about Corporate and Mobile segment, and the 690G, but you show video game benchmarks for a comparison, I am sorry but the whole platform is what will allow AMD to make a surge at Intels market share in the corporate and Mobile segments.

Now I am not saying the 690G is not a good platform and it may pull market share away from Intel, but I do not think it will have the impact you believe until the dualcore K10 derivitives are released and is AMD's manistream product.

-----------------------------------

This does however bring up a curious situation in terms of flexibility since AMD will be able to deliver chipsets for both platforms while Intel will only support its own.

As quick as everybody stopped working on the RD600 (X3200), why do you think AMD would do this?

I am not saying they won't, I'm just wondering why would they.

Unknown said...

Enumae, while gaming benchmarks are relatively misplaced, they do place actual demand on these chips, and are a standardized way to see how they handle load. As such, companies who need these chipsets for graphics work, small amounts of AutoCAD/Solidworks work, or video editing, but don't want to shell out the cash for a much nicer machine that may not be used that much.

I know from personal experience that you essentially can't run solidworks on Intel integrated graphics, even the x3000 has lots of problems and artifacting. However, you can run these programs on antiquated matrox cards, and nvidia 6150 chipsets. So, to universities, and many design firms, this is already a big help, as not all the solidworks machines need to be able to do the analysis or assembly functions. This is just one area, that probably isn't too much of an oddball in the industry.

Roborat, maybe you can quit pulling the double standard here by actually giving an argument to explain why AMD and ATI can't pull a first to market with its cpu and chipset combos? This just sounds like you're saying that Intel will always be ahead without giving any real argument or justification for such a statement.

abinstein said...

Ho Ho: "You totally missed the point. IGP's in Intel NB have direct access to system RAM and don't have to go through FSB."

You must be mislead by some very misleading marketing slogan. Direct access to system RAM? So the system RAM is dual-ported and magically a DDR2-533 can offer 1GB/s bandwidth? Tell me, what is the benefit of "direct access to system RAM" when the limit of BW is the RAM itself?

Ho Ho: "Just that they were talking about memory bottlenecks and good prefetchers can lessen them, or at least make sure the data is availiable in caches before it is needed. I didn't see any bottlenecks in the application I tested even though Intel should have somewhat lower throughput."

Prefetcher is not going to help bandwidth, but delay. Prefetcher will actually hurt bandwidth because it will fetch something that's not useful sometimes. Intel actually makes this point itself about enabling C2D's prefetcher.

Basic computer architecture tells us that nothing scale linearly. It's normal to get 1.6x speedup for 2x # ALUs added, since you could mispredict branches, could run out of decode units, could run out of reservation stations, could be waiting for cache misses, etc. Actually I believe the FP scaling from K8 to Barcelona is more like 1.9(2x ALU) * 1.66(2x core). It has nothing to do with whether the app you're testing has memory-bottleneck or not.

Ho Ho: "Complexity should only mean more transistors. From what I know, inclusive vs exclusive cache the difference is neglible and only present in the part of chip that controls the cache usage. The cache cells should be pretty much identincal on both."

Larger cache size means longer distance, more complex cache replacement algorithm, and more cache lines to lookup. 32-way associativity of the L3 cache also increase complexity a lot. These are certainly not negligible.

Inclusive and exclusive may have negligible difference in cache effective latency, but definitely not complexity. Exclusive cache trade complexity off for a smaller cache size. That is, exclusive cache could potentially lead to a slower processor, but more dies per wafer.

Ho Ho: "Because it has too many transistors in one spot that all have to scale to certain frequency or because it gets too hot because of all those transistors?"

Native quad-core could be slower because, by physical laws, the die is larger, and thus it has longer path from/to input/output, slower power propagation, larger clock tree, and higher variation in transistor characteristics.

Ho Ho: "If the first then you can see now why dual-die can be a better solution. If the second then Intel should have even more problems since MCM solutions use more power than monolithic cores."

1. dual-die is a "better" solution for the manufacturer because it's easier to get high-clock parts. It is however a worse solution for performance for multithreaded codes. No matter how high your clocks are, going off-die will severely limit the bandwidth and increase the delay.

2. Intel doesn't even want to tell you the max TDP its MCM's going to reach. Intel specifically says in its processor design guide that the TDP is not max.

enumae said...

Greg
Enumae, while gaming benchmarks are relatively misplaced, they do place actual demand on these chips, and are a standardized way to see how they handle load.

While you bring up a good point corelating Video to functionality, what percentage of the market do you think uses IGP's for CAD or any other graphic intensive applications?

I would lean towards a very small percentage, also people in these fields understand technology, and they themselve understand that IGP is not for them.

(PS: I feel sorry for you if you have to use a program like Solid Works with IGP, as I use Autocad 2007 about 8hrs a day I understand the importance of DGP)

Ho Ho said...

abinstein
"Direct access to system RAM?"

Yes, in sence that it doesn't need to go through the CPU (and FSB)

abinstein
"So the system RAM is dual-ported and magically a DDR2-533 can offer 1GB/s bandwidth?"

No, it just uses the MC sitting in the same die. Also, DDR2-533 has theoretical peak bandwidth of ~5.33GB/s, 10.66 in dualchannel.

abinstein
"Prefetcher is not going to help bandwidth, but delay."

They can help with bandwidth too when they don't fetch the wrong stuff.

abinstein
"Basic computer architecture tells us that nothing scale linearly"

Then why did that SSE2 thingie scale superlinearly? Or are you claiming that is not a real-world application?

Of cource there aren't many programs that spend >95% of their time in SIMD units but it certainly isn't synthetic benchmark.

Expect to see some game from those guys by summer this year. They already bought 2P 8-core beast for developing it.

Ho Ho said...

Sorry, I stated the wrong number for the 533MHz memory. I read it as PC5300 and gave the number of that memory. PC4200 (who the hell uses that kind of thing?) has bandwidth of around 4.26GiB/s per channel.

Roborat, Ph.D said...

greg said: ".. maybe you can quit pulling the double standard here by actually giving an argument to explain why AMD and ATI can't pull a first to market with its cpu and chipset combos? This just sounds like you're saying that Intel will always be ahead without giving any real argument or justification..."

you need read again what i said. I never said anything about AMD not being first to market with their OWN products. I was refering to Intel's CPU+CHIPSET in the context of Scientia's suggestion that AMD will eat away at Intel's chipset share. There is no way AMD can be first to market to compete against all Intel based systems. Just as i don't expect Intel to be from now on for AMD systems.
I can't believe you're asking for proof when it's always been the case. You just need to look at what major OEM's use for first of a kind systems: intel's chipsets. It's the only way for them not to be left behind.
And this goes to my original point, how can AMD take this market share from Intel? Impossible.

abinstein said...

Ho Ho: "Yes, in sence that it doesn't need to go through the CPU (and FSB)
...
No, it just uses the MC sitting in the same die. Also, DDR2-533 has theoretical peak bandwidth of ~5.33GB/s, 10.66 in dualchannel."


Dual channel or not, memory controller is the bottleneck here, not the HT link. The IGP in either Intel or AMD platform has strictly below 8GB/s memory bandwidth imposed by PCIe x16. In fact, they aren't going to get anywhere near that or they will 1) starve the processor (to 2GB/s or about DDR2-266), 2) saturate the CPU-GPU communication.

Ho Ho: "
abinstein
"Prefetcher is not going to help bandwidth, but delay."

They can help with bandwidth too when they don't fetch the wrong stuff."


Wrong wrong wrong ... you are totally wrong. Prefetcher do not help bandwidth. Prefetching or not, you have to have the one compulsory miss, which the bandwidth has to be spent on memory access. You waste bandwidth if you prefetch wrongly, but you gain no bandwidth back if you prefetch correctly.

Prefetcher helps reduce compulsory miss rate. It'll be more helpful if you have large cache size, because only then your compulsory miss is relatively significant enough to be reduced.

You should re-read some computer architecture textbooks before continuing this discussion.

Ho Ho: "Then why did that SSE2 thingie scale superlinearly? Or are you claiming that is not a real-world application?

Of cource there aren't many programs that spend >95% of their time in SIMD units but it certainly isn't synthetic benchmark."


SSE2 is more than 2x ALU than MMX. Different # of registers, different memory alignment, different instructions, etc., lots of things. But if you simply double the number of SSE2 ALU, I guarantee your speedup will be strictly less than 2x.

Ho Ho: "Expect to see some game from those guys by summer this year. They already bought 2P 8-core beast for developing it."

The "2P 8-core beasts" are no more better than any 4P 8-core workstation/server. They are memory starved and pathetic.

Ho Ho said...

abinstein
"The IGP in either Intel or AMD platform has strictly below 8GB/s memory bandwidth imposed by PCIe x16"

Excuse me but what has PCIe have to do with anything? On Intel RAM is directly attached to NB MC, on AMD it is connected to MC inside CPU that is connected with IGP over HT link. Where do you see the PCIe?

abinstein
"In fact, they aren't going to get anywhere near that or they will 1) starve the processor (to 2GB/s or about DDR2-266), 2) saturate the CPU-GPU communication"

What makes you think that? IGP's have shown fillrate of over 2Gpix , that means at minimum 8GiB/s.

abinstein
"Prefetcher helps reduce compulsory miss rate. It'll be more helpful if you have large cache size, because only then your compulsory miss is relatively significant enough to be reduced."

What I meant with that is that good prefetchers are better than the ones that fetch wrong stuff. I think I worded it a bit badly before.

abinstein
"SSE2 is more than 2x ALU than MMX. Different # of registers, different memory alignment, different instructions, etc., lots of things. But if you simply double the number of SSE2 ALU, I guarantee your speedup will be strictly less than 2x."

I'll try to explain again:
I had a program that almost exclusevly run on SSE2. On 3.8GHz P4D it was running at X FPS. I changed to 3.1GHz Core2 and the same program ran roughly 2x faster. Scaling to same clock speed it is around 2.23x faster. How much of that comes from twice the SSE ALU's, how much comes from generally better core?

Also cache wasn't really helping, at least not C2D. On P4D there was 2MiB of L2 per core, on Core2 I had 2MiB of it shared between the two cores.

abinstein
"The "2P 8-core beasts" are no more better than any 4P 8-core workstation/server."

Yes but it is considerably cheaper. Also to get same CPU power from AMD dualcore based system they would have had to get 8P machine. Every core on Core2 is twice as fast as one K8 core. Of courec Barcelona with its full-width SSE units will be about as fast as Core2.

abinstein
"They are memory starved and pathetic."

Ray tracing doesn't really need all that much memory bandwidth. Shared 6-8GiB/s over 8 cores is enough for it and last I've heard they got their program speed to scale linearly from 1-8 cores. RT scales perfectly with added computing power, it has been shown to scale linearly up to 128CPU clusters connected by 100Mbit link.

abinstein said...

Ho Ho: "On Intel RAM is directly attached to NB MC, on AMD it is connected to MC inside CPU that is connected with IGP over HT link. Where do you see the PCIe?"

IGP is connected to the chipset via PCIe. IGP does not connect to MC directly.

Ho Ho: "What makes you think that? IGP's have shown fillrate of over 2Gpix , that means at minimum 8GiB/s."

I have no idea how that's possible. Lets say you want 100fps on 1600x1280 screen. That would be roughly 200Mpi/s, or 800MB/s for 32-bit pixels, still strictly less than 1GB/s. But how often do you achieves 100fps on 1600x1280 with IGP?

Ho Ho: "I had a program that almost exclusevly run on SSE2. On 3.8GHz P4D it was running at X FPS. I changed to 3.1GHz Core2 and the same program ran roughly 2x faster."

First, you can't compare P4D and Core2 at the same frequency. P4D has deep pipeline, and it gets that high frequency much easier than Core2, whose better performance partly comes from its slower freq.

Second, if your app is multithreaded, then Core2 will outperform P4D even with the same amount of SSE, because the former is a native dual core.

Third, Core2's cache latency is much lower than P4D, even with the same size.

Forth, Core2's memory load/store are more advanced than that of P4D, and its memory bandwidth is actually better than P4D's.

All above contributes to C2D's better performance, much more than more SSE resource.

Ho Ho: "Yes but it is considerably cheaper. Also to get same CPU power from AMD dualcore based system they would have had to get 8P machine. Every core on Core2 is twice as fast as one K8 core."

Okay, so an 8-core, 2P C2Q desktop is cheaper than an 8-core, 4P K8 server, big deal; but the 8-core C2Q will have worse performance for server/workstation workloads. For most desktop apps, 2P C2Q would outperform 1P C2Q by less than 30%, mostly in the range of 15%. This is what I call pathetic.

Then, your idea of Core2 twice as fast is total bullshit. For SSE-intensive programs, Core2 is almost 2x as fast as K8. Core2 is also faster than K8 when it comes to media compression, AI and path finding. Yet for many other workloads, such as XML processing, high-order maths, cryptography, string matching, compilation, simulation, and source coding, K8 is either as fast or slightly faster than Core2.

But you won't care, and probably won't know these at all, because nowadays the enthusiast websites only benchmark rendering, games and media transcoders, and no matter how many programs they use, these all fall into the small subset of { SSE, AI / path finding, compression }. Do you really think that computers are sold/bought mostly for these few purposes?

In conclusion, I have to admit that Intel has strong (and nasty) marketing tactics. Intel produces the compiler, sponsors the benchmarks, funds the projects, feuls the websites, and in the end mind-washes the general public.

Ho Ho: "Ray tracing doesn't really need all that much memory bandwidth. Shared 6-8GiB/s over 8 cores is enough for it and last I've heard they got their program speed to scale linearly from 1-8 cores."

Your point seems to be that the main advantage of those 8-core C2Q systems is cost-saving.

Scientia from AMDZone said...

Well, I see a lot of statements and different points of view. The amazing thing is that even with so much disagreement most of the things being said are correct.

Roborat

Your statement that Intel should be able to just put a GPU on the CPU die is not correct. AMD can do this because the CPU includes a crossbar which Intel CPU's lack. The only fast way to hook up a GPU on die for Intel would be to put a PCIe controller on the die and this would eat up more space. However, when Intel adds CSI they could probably just add PCIe as an extra port. It would be a bit sloppy but it would work. The beauty for AMD of going through the crossbar is that the GPU can access memory without affecting the CPU. In effect, the GPU becomes just another core. In contrast, a GPU using PCIe would still need some driver code to work and this would take some CPU time.

I'm not saying that AMD will take Intel's chipset share, not at all. I'm saying that AMD will take share by supplying both AMD processors and chipsets. In other words, an OEM will use an AMD motherboard solution instead an Intel one. This should happen in both integrated commercial desktop and mobile integrated markets.

I see AMD's Intel compatible chipsets as important but I don't believe this will have much effect on Intel's sales. I think it is important in terms of fexibility and having stronger relationships with OEM's.

Greg and Enumae

You are both right about CAD and IGP's. Obviously, a dedicated CAD department is going to have equipment suitable for their work which would be DGP. However, the point is that the bulk of systems bought by any company will be IGP (lowest cost). And, AMD's IGP's seem to offer better functionality than Intel's current IGP's. In other words, AMD's chipsets could still perform some lighter graphics work that Intel's could not. Given the normal replacement cycles of commercial equipment I'm thinking this will be important.

However, this will not by any means take huge share from Intel as AMD will still be capacity limited. But, it should keep AMD profitable and growing.

Ho Ho and Abinstein

Ho Ho, you are overestimating the memory access demands of IGP. It is less than 2GB/sec, not 8GB/sec.

Prefetching does tend to reduce latency and not increase bandwidth. However, prefetching could increase bandwidth if there were unused memory access cycles that were filled by prefetching. Likewise, prefetching can decrease bandwidth if cycles are used to fetch something that isn't used.

MCM does tend to increase chip speed since both halves can be matched separately. However, the MCM uses more power and decreases FSB speed.

The difference in memory access for Barcelona and Clovertown is also interesting. With all four cores on the same die, any memory request is sorted and queued for the most efficient access pattern by the IMC. In contrast, the two separate dies of Clovertown will fire random memory requests that would tend to flog the memory controller and take access hits by page shifting. I'm certain that the NB memory controller does some optimization but this would come at the expense of latency.

The numbers for SSE with Barcelona have been confusing lately and this was not helped when it was mistated in the interview in France. Let me see if I can clarify. The increase per core on SSE is 2X, not 1.8 or 1.6.

However, Barcelona adds two more cores and it is well known that typical tasks incur some overhead when split. AMD has typically estimated this as 1.8X by doubling cores. In the interview, the figure was more conservative and stated as 1.6X. However, he mistated by associating this with SSE width doubling instead of core doubling as he should have. Now, if you look at Cray's estimates for optimized code, they give Barcelona's SSE speed as 4X that of dual core Opteron.

The 1.6X has nothing to do with memory bottlenecks. It is simply a rough estimate based on overhead incurred by dividing a task among four cores. In terms of actual memory access, AMD should be much better off with Barcelona than Intel is with Clovertown.

Given the changes between K10 and K8 we should see a pretty dramatic jump in memory access speed and bandwidth between K8 and K10.

Ho Ho said...

abinstein
"IGP is connected to the chipset via PCIe. IGP does not connect to MC directly."

Have you got any ideas how fast that PCIe link is supposed to be? Also, are you sure they use in-chip PCIe?


abinstein
"I have no idea how that's possible"

If you knew how rasterizing on GPU's work you'd understand. Basically there is much more to it than overwriting pixels in framebuffer.


abinstein
"Lets say you want 100fps on 1600x1280 screen. That would be roughly 200Mpi/s, or 800MB/s for 32-bit pixels, still strictly less than 1GB/s"

What about zbuffer, stencil buffer, blending and reading other textures and vertice data? Only writing solid pixels to colour/z/stencil buffer would double the needed bandwidth to 1.6GiB/s. Adding in blending you've almost tripple it and it wouldn't still count in reading data from other textures and vertices. Add in a full-screen post-processing effect and you again add at least 2*width*height*4 bytes of bandwidth for each and every pass over framebuffer, assuming you won't touch anything but colourbuffer. That is around 4MiB per frame of additional needed bandwidth on that 1600x1280 resolution.

Please note that no AA and AF is counted in those computations. Single 16x AF texture fetch can touch tens of texels in a texture, all of those are needed to be inside the GPU to get the real colour value. Bilinear filtering accesses four texels per salmpe, trilinear filtering twice that from two different texture mipmaps.

This should be the most basic knowledge known by anyone who claims to know anything about rasterizing and/or GPU's.

Of cource newer games aren't usually fillrate limited but shader limited but the basic point remains that IGP's can use a lot of bandwidth, a lot more than 1-2GiB/s.


abinstein
"Second, if your app is multithreaded, then Core2 will outperform P4D even with the same amount of SSE, because the former is a native dual core."

That program scaled perfectly linearly from one core to two cores on both, P4D and Core2. It also continued scaling to 4 and 8 cores. Being native has absolutely no effect what so ever for this program. Of cource you can say this program is an exception, most programs don't scale that well.


abinstein
"Third, Core2's cache latency is much lower than P4D, even with the same size."

How much lower? IIRC, C2 had even shorter latency to L1 than P4 did. Also C2 had half the total cache of P4D.


abinstein
"Forth, Core2's memory load/store are more advanced than that of P4D, and its memory bandwidth is actually better than P4D's."

Yes but as I said ray tracing is very light on bandwidth. I wouldn't expect big gains from better bandwidth.


abinstein
"Okay, so an 8-core, 2P C2Q desktop is cheaper than an 8-core, 4P K8 server, big deal; but the 8-core C2Q will have worse performance for server/workstation workloads"

I know that but that wasn't the usage scenario I was talking about. I was talking about one specific program. In current context that 2P 8core thing beats anything else under the Sun.


abinstein
"For most desktop apps, 2P C2Q would outperform 1P C2Q by less than 30%, mostly in the range of 15%. This is what I call pathetic."

Whose fault it is that most programs don't scale that good when going from 1->2->4->8 cores? Certainly not Intels.


abinstein
"Then, your idea of Core2 twice as fast is total bullshit"

Numbers I saw proved otherwise. Yoy have access to the program, prove me wrong.


abinstein
"For SSE-intensive programs, Core2 is almost 2x as fast as K8"

Not almost, it really was 2x faster, mostly thanks to double-width SSE.


abinstein
"Yet for many other workloads, such as XML processing, high-order maths, cryptography, string matching, compilation, simulation, and source coding, K8 is either as fast or slightly faster than Core2."

How many of those listed things are a bottleneck in your regular desktop application?

Can you prove that K8 is faster at compiling? I can give you this link to think about.

If you don't want to bother yourself reading the topic then at least read these two posts:
e6600
4800+

Both are running 64bit Gentoo and use same version of GCC. Only difference is minor version number of the compiled kernel. On Core2 the compile takes ~37% less time.


abinstein
"But you won't care, and probably won't know these at all, because nowadays the enthusiast websites only benchmark rendering, games and media transcoders, and no matter how many programs they use, these all fall into the small subset of { SSE, AI / path finding, compression }. Do you really think that computers are sold/bought mostly for these few purposes?"

Yes, vast majority of CPU's are sold as desktop PCs. What kind of other CPU intensive applications you had in mind? Or do you claim that most people run XML processing, high-order maths, cryptography, string matching, simulation, and source coding daily? Also, do you know any applications that use a lot of XML processing that are used daily on most PCs?

And yes, I don't care about those other things because I don't use these things that much and when I do they aren't CPU limited. I don't care that Power5+ and Itanium have far superior FP performance and memory bandwdith than most other CPUs since I can't use them for most things I like to do. Why should I care about the things that don't concern me?


abinstein
"In conclusion, I have to admit that Intel has strong (and nasty) marketing tactics. Intel produces the compiler, sponsors the benchmarks, funds the projects, feuls the websites, and in the end mind-washes the general public"

FYI, that ray tracer was compiled with GCC. ICC can't reach the performance of GCC, especially under 64bit. MSVC is far, far behind both. Also most of what I think about Core2 I've got from personal experience.


abinstein
"Your point seems to be that the main advantage of those 8-core C2Q systems is cost-saving."

They bought what gave them most bang per buck. That thing cost them ~€3200. For that money they couldn't have bought enough CPU's for 8P Opteron-based thing. Opteron box would have needed at least 4GiB of RAM (512M per socket) whereas they only used 1GiB for the 2P UMA box.


scientia
"In contrast, a GPU using PCIe would still need some driver code to work and this would take some CPU time."

Are you sure? Even standalone GPU's can directly access system memory without ever bothering CPU.


scientia
"Ho Ho, you are overestimating the memory access demands of IGP. It is less than 2GB/sec, not 8GB/sec."

I'm not saying it demands so much, I just say they are capable of it and benefit from it.

abinstein said...

Ho Ho: "Have you got any ideas how fast that PCIe link is supposed to be? Also, are you sure they use in-chip PCIe?"

PCIe x16 are 8GB/s. IGP tunnels through the PCIe switch, of course not a physical PCIe wire.

Ho Ho: "If you knew how rasterizing on GPU's work you'd understand. Basically there is much more to it than overwriting pixels in framebuffer.
...
...
"


You were saying that IGP's capable of minimum 8GB/s, or needs > 10GB/s memory bandwidth, to get serious fillrate (2Gpix/s). I was telling you that it's totally false. First, the dual-channel DDR2 memory only gives you 12.8GB/s bandwidth max. IGP can't take 10GB/s of it without starving the CPU. Second, 2Gpix/s fillrate is simply not needed unless you have a screen size 10 times larger than 1600x1200, irrespecitve how much computation power it requires. Third, every graphics API call goes through CPU as well as GPU - you can't assume memory accesses in those calls touch GPU-memory but not CPU-memory bandwidth; in the end it is a memory controller bottleneck. Fourth, you're never going to get 100fps @1600x1280 with any of today's IGP, but less than 1/10 of it; if you do all extra effect processing, expect something less than 2fps or 1/50 of what I described. Thus, your IGP will never have opportunity to use that much memory bandwidth.

Ho Ho: "This should be the most basic knowledge known by anyone who claims to know anything about rasterizing and/or GPU's.

Of cource newer games aren't usually fillrate limited but shader limited but the basic point remains that IGP's can use a lot of bandwidth, a lot more than 1-2GiB/s."


No, IGP's don't take a lot more than 2GB/s. It can have memory bandwidth usage in the order of 2GB/s, though.

Ho Ho: "abinstein
"Second, if your app is multithreaded, then Core2 will outperform P4D even with the same amount of SSE, because the former is a native dual core."

That program scaled perfectly linearly from one core to two cores on both, P4D and Core2. It also continued scaling to 4 and 8 cores. Being native has absolutely no effect what so ever for this program."


So what's your point? Using one single program to prove that C2D SSE scales superlinearly against P4D SSE? How do you scale across different microarchitectures, different generations, and different number of SSE units? Stop this mumbling, will you?

Ho Ho: "How much lower? IIRC, C2 had even shorter latency to L1 than P4 did. Also C2 had half the total cache of P4D."

Cache performance does not only depend on cache size, but also cache latency and hit rates. C2D has better cache than P4D, period.

Ho Ho: "abinstein
"Forth, Core2's memory load/store are more advanced than that of P4D, and its memory bandwidth is actually better than P4D's."

Yes but as I said ray tracing is very light on bandwidth. I wouldn't expect big gains from better bandwidth."


"Advanced" load/store is different from higher bandwidth. Out-of order, speculation, latency, efficiency, all add up to it. Again, Core2's memory load/store is better than P-D's, and this contributes to its better program performance, too, SSE or not.

Ho Ho: "Whose fault it is that most programs don't scale that good when going from 1->2->4->8 cores? Certainly not Intels."

It's certainly Intel's. The difference between multithread and multiprocess programs is that the former are more fine-grained, in that threads may synchronize/communicate often. With MCM and starved FSB, any inter-thread communication is bw limited and takes long latency.

Ho Ho: "Numbers I saw proved otherwise. Yoy have access to the program, prove me wrong."

As I said, the "programs" you talk about only do { SSE, compression, AI, and path-finding }, exactly the things that C2D does better than K8. Eve you run 1000 different programs of those types, they still do only the things above.

Ho Ho: "Can you prove that K8 is faster at compiling? I can give you this link to think about.

If you don't want to bother yourself reading the topic then at least read these two posts:
e6600
4800+
...
"

First, they compile different kernel versions. Second, I think the C2D is compiling 32-bit target. Third, their memory bandwidth are much different. Fourth, the C2D is obviously a much newer machine and could be using a faster HD, which we don't know.

Conclusion is this comparison is totally worthless. If you want to prove that C2D performance compilation 37% faster than K8, you have to use something else.

Ho Ho: "Or do you claim that most people run XML processing, high-order maths, cryptography, string matching, simulation, and source coding daily?"

Many business applications use XML internally for data exchange and representation. In fact, those PCs that do not need XML (enthusiast or home) are either niche market or low-end market.

Cryptography is used everyday on all browsers and most network transfer. String matching is about the most basic and important operation that a computer does. The others are not used daily by average Joe, but they are very important for a server/workstation market much larger than the gaming one.

Ho Ho: "And yes, I don't care about those other things because I don't use these things that much and when I do they aren't CPU limited."

You actually do use them, just don't realize what you're using (or more accurately what your computer is doing).

Ho Ho: "FYI, that ray tracer was compiled with GCC. ICC can't reach the performance of GCC, especially under 64bit. MSVC is far, far behind both. Also most of what I think about Core2 I've got from personal experience."

Funny for you who do not care about XML or crytography or string matching to like ray-tracing so much - how applicable is this to everyday computing? Getting 2fps on your games?

Ho Ho: "[2P C2Q] cost them ~€3200. For that money they couldn't have bought enough CPU's for 8P Opteron-based thing. Opteron box would have needed at least 4GiB of RAM (512M per socket) whereas they only used 1GiB for the 2P UMA box."

I have no idea why you want to compare the price of a 16-core 8P Opteron with that of an 8-core 2P Clovertown. The fact is, 2P C2Q market is as niche as AMD's 4x4, while multisocket Opteron is sold in millions.

Unknown said...

Hoho, you need to quit assuming a market is niche just because you can't imagine a use for them. This is inversely like what Airbus is doing with the A380 (and why it is failing). You have to look at numbers to prove the market, not use your imagination. Like Abinstein said, the 4p opteron systems have been sold in millions. The dual socket quad core machines from Intel have as well. Thus, the 4p opteron is not niche.

You also need to step back a bit when you try to talk about what type of power is useful. While you may have some sort of insatiable appetite for rasterizing, chances are, no one else who posts here or reads this does (j/k, but you probably do reflect a fairly small market).

Also, I don't know who hijacked your account, but core2 is not twice as fast as k8 in anything. On a core per core clock per clock basis, there is an average of 10-18% advantage on Intel's side. I don't know when you started thinking otherwise, as you seemed to have agreed with this before, but seriously...

Also, referring back to my Solidworks point. I used Inventor on my high school's computers, and it ran fine on three year old matrox cards. But our teacher's brand new laptop with "high end" Intel integrated x3000 graphics couldn't take it. At my current school, we use either Nvidia 6150s, or discrete 7300s. The 7300 machines are few and far between, were more expensive, and can't be placed in every room due to their size and the need to have as many computers in one room as possible. Considering that every state in the US has about 2-4 large Universities, has anywhere from 500-10,000 engineering students, and how limited their budgets normally are, the integrated market that's able to address those is logically extremely lucrative. Intel cannot address that, only AMD and Nvidia can. For the forseeable future, that's the way it'll stay.

Now, I know I'm not looking at the big picture. There are hundreds of times more high schools per state with pre-engineering programs, many other applications that are taught throughout our budget ridden country that cannot be run on integrated graphics, and many business applications for all of the above that I haven't mentioned either. But I hope I give you some idea of why being able to afford the ability to give more people access to these programs might be something people are interested in, and thus something that could be a big deal.

enumae said...

Greg
Also, referring back to my Solidworks point...

You make some great points, but looking up some numbers, "engineering graduate students are up 27 percent, to 127,375", now take that number and divide in half to take into account the amount of computers actually used for eduacting those students, since they don't all get there own, Lets say roughly 65,000.

Now take that number and put it into persrective, there were about 250 Million computers sold around the world in 2006, and about 1/5th of that was sold in the US, so that 65,000 is only about 0.13%.

Just for an example, lets take that 65,000 and mutiply it by 10, now we have 650,000 computers or 1.3% of the market, still not enough, so, I do not feel it will be a big deal, in terms of money or market share for AMD.

enumae said...

Scientia

What your take on..."Approve an amendment to our Restated Certificate of Incorporation to increase the number of authorized shares of Common Stock from 750 million to 1.5 billion shares."

Link

Also, while looking at the 10-K SEC filings, Contractual Obligations for both Intel and AMD...

AMD = $8.977 Billion

Intel = $9.856 Billion

Could you elaborate on what this means for both AMD and Intel.

Woof Woof said...

I am not sure I am getting enumae's points or if he is deliberately obtuse.

If a person could buy an integrated graphics solution for less money that could handle those 3D graphics and better Aero implementation today, it is still a compelling value proposition.

It doesn't matter if they are engineering students or not. Heck, these days, even Media students use a lot of 3D animation in their work.

And if Vista is a draw for the upgrade, the choice IMHO is even clearer.

enumae said...

Woof Woof
I am not sure I am getting enumae's points or if he is deliberately obtuse.

I do not respond well to insult, but I would recommend you take a closer look at my comments before making such statements.

If you still fail to understand my point I will try and help make it clear.

enumae said...

Azary Omega
enumae, Woof does have a very good point and if you are trying to say that better performance of IGP's isn't a big deal - then you don't have a point.

I am not saying that a better IGP is not a big deal.

What I am saying is that for mainstream computers (not gamers or people who work with CAD), the better IGP will go unnoticed and will not reflet a large market share gains.

Yes AMD has a platform now.

Yes AMD has better IGP's.

But as a whole, AMD's platform, when you factor in CPU performance is not going to be better than Intel for mainstream users due to K8.

I have tried to make it clear in earlier post that because of this, AMD will not see a large penetration into corperate sales until its platform is better than Intels (the release of mainstream K10).

CAD functionality in a non educational enviornment on IGP is not really relevant, machines in this commercial space need productivity, hence the expensive video cards to achieve that.

To be clear...

1. A better IGP is a good thing.

2. A better CPU is a good thing.

3. IGP for education of enginerring students is ok, but for commercial use it will require DGP.

3. A better platform (Intel's) is what will keep AMD from penetrating the enterprise segment until the release of mainstream K10's.

abinstein said...

enumae: "What I am saying is that for mainstream computers (not gamers or people who work with CAD), the better IGP will go unnoticed and will not reflet a large market share gains."

You got it backwards. IGP matters more for mainstream buyers who are most price sensitive and who notice mostly responsiveness from computers.

Very few enthusiast or CAD people use IGP; they buy high-end discrete. Very few engineering would care about IGP performance; they do simulations and maths and coding with little graphics.

AMD has a better performing IGP now, but that matters little at this moment because Intel groups chipset and IGP together and manufacturers may be given incentives to use Intel chipset than nVidia's or AMD's.

Woof Woof said...

Actually if you bother looking at worldwide market share, DGP is a very small percentage of the installed base, ESPECIALLY FOR Commercial Clients, where it is overwhelmingly IGP.

So, based on yr arguments then, the AMD platform with a better IGP and a better overall SYSTEM performance (rather than just CPU performance) would weigh in better for commercial clients.

enumae said...

abinstein
IGP matters more for mainstream buyers who are most price sensitive and who notice mostly responsiveness from computers.

I understand that, but my comment are in regards to AMD's 690G chipset, AMD's platform and AMD's market share as a result of the release of this new chipset. If you were not in a graphicly intensive secgment you would be hard pressed to see a difference between Intel and AMD's grapics capabilities, and which is why I do not beleieve that AMD will gain a large amount of market share with there current platform offerings.

---------------------------

Woof Woof
Actually if you bother looking at worldwide market share, DGP is a very small percentage of the installed base...

First, I was staying in the context of my comments with Greg.

Second, DGP is a small percentage, as is the enginerring/CAD segments which was the original context.

...ESPECIALLY FOR Commercial Clients, where it is overwhelmingly IGP.

The commercial clients who do CAD and Enginerring do not use IGP, not if they intend to be productive, again in the context of CAD and Enginerring IGP is almost useless.

...the AMD platform with a better IGP and a better overall SYSTEM performance (rather than just CPU performance) would weigh in better for commercial clients.

If AMD has the better system performance it would weigh in better for commercial clients.

Woof Woof, I am left wondering if you have a point relative to which the original discussion was based?

Woof Woof said...

Enumae asked: "Woof Woof, I am left wondering if you have a point relative to which the original discussion was based?"

Enumae, the definition of commercial clients are the ones used in office environments, and are predominantly IGP based.

You were thinking of workstation clients, where it is virtually 100% DGP. Perhaps it is your frame of reference that needs to change.

In commercial clients, I have encountered installed bases of 1000s if not 10s of thousands, PER customer. And every dollar saved goes a long way in such an environment. This has traditionally been a huge market for Intel, and where they had traditionally not had much competition, as Scientia has maintained, time and time again.

Today, vendors like HP and Dell are "strongly encouraging" their customers to upgrade to a discrete graphics solution to meet Windows Vista compliance. This is a big change from the past when a basic 945G based motherboard would suffice.

Even at cost, an entry Vista compliant card would cost 30-50 bucks. And that's before HP/Dell add their margins on. Now factor that by a thousand or tens of thousands, and you see why a CTO might have headaches.

Now contrast this with an AMD based RADEON Express 1250 or even the earlier 1150 based solution which could handle Windows Vista, without the added cost of the DGP.

enumae said...

Woof Woof

This is going no where, if you trully have a point that has not been previously made by me, please make it, otherwise we already agree.

IGP is for mainstream or enterprise segment.

DGP is for workstation or the CAD and enginerring commercial segment.

There is no debate here.

Today, vendors like HP and Dell are "strongly encouraging" their customers to upgrade to a discrete graphics solution to meet Windows Vista compliance.

Now, are you debating that AMD's 690G is better graphically when using Vista compared to Intels G965 in the enterprise space which is not graphically intensive?

I highly doubt that when using Excel or Word, browsing online or any other Enterprise segment task that you would see a difference.

In regards to market penetration, I believe it will be an uphill battle for AMD, but like I have said we will know more in about a month when they release earnings.

Scientia from AMDZone said...

greg

C2D is twice as fast as K8 in SSE. K10 will fix this.

enumae

I don't see anything significant in the 10-K forms. Basically, you can say that AMD has a lot more debt and that not all of Intel's materials purchases are obligated. Intel will obviously purchase much more than the listed obligations.

The stock issue really isn't significant either other than the fact that it could raise some additional capital for AMD. Let's say that the issuance of stock reduces the price to $10 a share. Then issuing 750 Million shares would raise $7.5 Billion in new capital. This would be enough to erase AMD's debt. The new stock could be justified in the increased value of AMD after the purchase of ATI. However, it is unlikely that AMD would issue the full amount; a third would be sufficient in the near term and would have a less depressing effect on the stock price.

I have to say though that I don't understand your use of terms:

will not reflet a large market share gains.

AMD will not see a large penetration into corperate sales until its platform is better than Intels (the release of mainstream K10).

I do not beleieve that AMD will gain a large amount of market share with there current platform offerings.


You keep using the term "large" and I have no idea why. I've said that I think AMD would gain an additional 2.5%. Do you consider this "large"? AMD will be capacity limited. It would be impossible for AMD to boost its share by any "large" amount.

is not going to be better than Intel for mainstream users due to K8.

This is obviously false. For anyone who does not use SSE extensively AMD's processor prices are comparable to Intel's.

better platform (Intel's) is what will keep AMD from penetrating the enterprise segment until the release of mainstream K10's.

This is obviously false as well. Again, most commercial users do not use SSE extensively. AMD's commercial sales have been growing twice as fast as their consumers sales. This has been taking place before 690 and before K10; there is no reason to assume that this would stop. The logical assumption is that this would excelerate given that AMD now has its own chipset.

enumae said...

Scientia
...Do you consider this "large"? AMD will be capacity limited. It would be impossible for AMD to boost its share by any "large" amount.

When I say large, please keep in mind that this is just my perception of your commwnts, as well as others.

These are from your original post/article...

1. "There is no doubt that AMD is deadly serious about this market and is diving in with a headstart over Intel."

2. "So, while Intel currently has the stronger position in terms of processors they will have to scramble to keep from having the chipset rug pulled out from under them in a determined and focused assault by AMD."

3. "However, I have little doubt that by 2009 ATI's share of the discrete graphics market will be much closer to nVidia's than it is now and ATI's share of the desktop and mobile integrated markets will be much closer to Intel's."

If I have misinturpreted this, or overlooked a previous post I appologize.

--------------------

...The logical assumption is that this would excelerate given that AMD now has its own chipset.

We will see in a little over a month. I am unable to make my point clear, and am actually tired of trying.

Wise lnvestor said...

I just wanted to add something interesting I read on the INQ.

690G graphics overclocked

It's from Legit Reviews.

Enjoy!

Wise lnvestor said...

I also wish to make a point... A MB with capable IGP could be sold in (100's of)millions in Asia, Latin America and Africa. When a inexpensive discrete GPU sold for 50$ US it mean a small fortune for those people.

So naturally the potential is much MUCH bigger there then over here.

Even with millions of system been sold every year, there are still billions of people waiting to get their hands on their first pc.

The current war wasn't just a price war it's a war of scale.

It's called a VISION from the combine of AMD /ATI.

That's why you see so many partners jump in!

Roborat, Ph.D said...

There are certain things I find erroneous in the discussion. I find the chipset and the business model surrounding it less dramatic than CPUs so I’ll just point out some paradigms without directly pointing to anyone.
1) It doesn’t matter if the system has discrete or integrated GP, the motherboard will always have a chipset regardless whether the integrated graphics is utilized or not. This is in response to some who believes that the two are separate markets, at least not for those who actually purchase them.
2) End user preference has little to do with chipset purchase decisions. The distance and the time difference are extremely large. Distance in terms of supply chain. End-user feedback affecting a MB maker’s decision to use a certain chipset is unheard of. Time difference – decisions to purchase X volume of chipsets occurs several months prior to system reaching end-user. Instead quantity of purchase is based on volume orders from OEM’s/channel who caters to an already defined demand (+/- a few % change).
3) Chipsets are designed for the system and not the other way around. Some here point out that IGP’s can’t be used for certain applications. I don’t see the point for argument as market segmentation isn’t divided by the chipsets but instead defined by the CPU and the discrete graphics used. Clearly some here are thinking the wrong way round. Performance on end user applications is never the reason why a chipset is bought and sold.
4) Chipset market isn’t heterogeneous. Intel/AMD creates the CPU and therefore creates the first chipsets that enables enhancements i.e., BUS changes. This is the reason why Intel has a strong hold on its 50+% share of the IGP business. And now AMD/ATI can do the same on its platform. First to market is so vital in the computer industry that OEMs have no choice than to create systems based on an Intel/AMD platform than wait for other chipset makers (VIA, NVIDIA) to come up with their own version. Intel (and now AMD) will always have an advantage.

Woof Woof said...

enumae, I am not making this up. HP and Dell do strongly recommend their customers upgrade to a DGP for Windows Vista, even for their traditional commercial clients, who have thus far been using 945G and 965G based IGP chipsets. It doesn't matter if they run Excel or Word. It's to meet MS's recommended specifications for an optimal Windows Vista experience.

This is an added cost, a huge one for corporate accounts, at the end of the day.

The AMD solution doesn't require the DGP. It gives a better overall experience for Windows Vista, without incurring additional costs.

It's not rocket science, and I don't see how you can not understand that.

Ho Ho said...

abinstein
"PCIe x16 are 8GB/s. IGP tunnels through the PCIe switch, of course not a physical PCIe wire."

I wasn't asking how much bandwidth does PCIe16x have, I asked how fast is the connection between memory controller and IGP. Btw, that 8GB/s is actually 4GB/s in both directions. Though at least with deduicated GPU's, readback rate is lower than writing throughput.


abinstein
"You were saying that IGP's capable of minimum 8GB/s, or needs > 10GB/s memory bandwidth, to get serious fillrate (2Gpix/s)."

I said "IGP's have shown fillrate of over 2Gpix , that means at minimum 8GiB/s.". I said that after you said that IGP's can't/won't use that much bandwidth since it would starve th CPU. And yes, to reach >2Gpix/s fillrate in non-synthetic benchmarks you will need more than 8GB/s memory bandwidth. If you can't understand why then read my previous posts, I think I explained it simply enough. If not then tell me to clarify the things you didn't understand.


abinstein
"First, the dual-channel DDR2 memory only gives you 12.8GB/s bandwidth max."

For others who might not know that is for DDR2 800, or PC6400.


abinstein
"IGP can't take 10GB/s of it without starving the CPU"

Depends on what the CPU has to do. Usually IGPs and CPUs don't need that much bandwidth. IGPs are more busy crunching shader programs or waiting for CPU to run vertex shaders for them. Also please remember that current Intel quadcores can handle quite nicely with shared 1066MHz bus, that is just a little more than 2GB/s per core if you expect them all to use as much bandwidth as possible.


abinstein
"Second, 2Gpix/s fillrate is simply not needed unless you have a screen size 10 times larger than 1600x1200, irrespecitve how much computation power it requires"

Yes, that high fillrate isn't usually needed but high memory bandwidth is quite important. Please reread my previous talk about different buffers, blending and other textures.


abinstein
"Third, every graphics API call goes through CPU as well as GPU - you can't assume memory accesses in those calls touch GPU-memory but not CPU-memory bandwidth"

Commands take awfully little amount of memory capacity and bandwidth. A single moderate sized texture (512x512) takes a MiB of RAM, a single drawing command takes a few tens of bytes.


abinstein
"Fourth, you're never going to get 100fps @1600x1280 with any of today's IGP, but less than 1/10 of it"

Yes, you wont. I have never said it would be possible to get nearly as high framerates and I said why: (colour buffer size)*(framerate)!=total memory bandwidth. Please try to understand that.


abinstein
"Thus, your IGP will never have opportunity to use that much memory bandwidth."

That is correct. Now can you tell me where did I say that in normal situations it might use that much (>8GiB/s) bandwidth?


abinstein
"No, IGP's don't take a lot more than 2GB/s"

And you know that because ...?


abinstein
"So what's your point? Using one single program to prove that C2D SSE scales superlinearly against P4D SSE? How do you scale across different microarchitectures, different generations, and different number of SSE units? Stop this mumbling, will you?"

I was just showing that things can and do scale linearly. Of cource not all things do but it is possible.


abinstein
"Cache performance does not only depend on cache size, but also cache latency and hit rates"

Yes and that was exactly what I asked: how much better latency and hit rates does C2D have compared to p4d. Instead of giving me a straight answer you simply state something without any proof. You seem to know the answer or otherwise you wouldn't be so sure about it. Please share your knowledge, I'd like to know it since I currently haven't compared the two.


abinstein
"Ho Ho: "Whose fault it is that most programs don't scale that good when going from 1->2->4->8 cores? Certainly not Intels."

It's certainly Intel's. The difference between multithread and multiprocess programs is that the former are more fine-grained, in that threads may synchronize/communicate often. With MCM and starved FSB, any inter-thread communication is bw limited and takes long latency."


Yes, I know it takes more time but that was not what I asked. Thread synchronization has been quite big problem since day of multithreaded programming began. AMD simply lightened the performance hit just a bit but it is still there. For the most part it is the fault of the programmers who don't know how to write proper multithreaded applications.


abinstein
"As I said, the "programs" you talk about only do { SSE, compression, AI, and path-finding }, exactly the things that C2D does better than K8. Eve you run 1000 different programs of those types, they still do only the things above."


abinstein
"First, they compile different kernel versions."

This has neglible effect, certainly not in order of >50%.


abinstein
"Second, I think the C2D is compiling 32-bit target"

You are wrong.


abinstein
"Third, their memory bandwidth are much different"

... but that particular K8 had much better latencies. Also I haven't seen that memory bandwidth would be a big factor in compiling. It mostly consists of tree traversal and latencies have much more to do with it than raw bandwidth.


abinstein
"Fourth, the C2D is obviously a much newer machine and could be using a faster HD, which we don't know"

That has almost no effect, certainly not very big.


abinstein
"Conclusion is this comparison is totally worthless. If you want to prove that C2D performance compilation 37% faster than K8, you have to use something else."

It was you who said K8 is much better at compiling than C2D. You should be the one showing benchmarks, not me. If you have a decent K8 box you can always come up with some benchmarks yourself.


abinstein
"Many business applications use XML internally for data exchange and representation"

Yes, they do and almost all of them can run on PC's that were considered low-end years ago.


abinstein
"Cryptography is used everyday on all browsers and most network transfer"

Have you got any ideas how much processing does it take to load some webpage that uses encrypted transfers? I'd say it is done in a couple of milliseconds on any >1Ghz CPU.


abinstein
"You actually do use them, just don't realize what you're using (or more accurately what your computer is doing)."

I know my CPU does a lot of things and I know for a fact that the things you are talking about are not the bottleneck in almost any program.


abinstein
"Funny for you who do not care about XML or crytography or string matching to like ray-tracing so much - how applicable is this to everyday computing? Getting 2fps on your games?"

As I said, almost all CPU's are fast enough to run your basic applications. Ray tracing is something I deal with as a hobby and performance in that area is the most important thing for me personally.

As for games, I don't play much. In Serious Sam 2 I get around 30-40FPS at 1024x768 and highish settings on my 6600GT under Linux. That is good enough for me.


abinstein
"I have no idea why you want to compare the price of a 16-core 8P Opteron with that of an 8-core 2P Clovertown."

Do I really have to reiterate what I said? Read carfully: most bang per buck for the thing that they are interested in. Going with AMD would have costed them about twice as much.


abinstein
"The fact is, 2P C2Q market is as niche as AMD's 4x4, while multisocket Opteron is sold in millions"

Have you got any numbers? I'd like to see the amount of 2P boxes sold vs 4P+ boxes.


greg
"Also, I don't know who hijacked your account, but core2 is not twice as fast as k8 in anything"

As was said, it is 2x faster in SSE and barcelona will at least match it. In other things there is 10-30% speed difference (could go up to 100% when including SSE programs), depending on applications.


abinstein
"What I am saying is that for mainstream computers (not gamers or people who work with CAD), the better IGP will go unnoticed and will not reflet a large market share gains."

I agree. Vast majority of computers sold with IGP's don't use any 3D applications. You can step into some of your local business or simply look around at your own workplace to see it.

Also an interesting fact is that in the company where I work we don't use any 3D applications either but expect the serevrs here is no PC with IGP, they all have dedicated low-end GPU's. You can get decent GPU's for <$40, even at roughly $20 they are quite good.


enumae
"I have tried to make it clear in earlier post that because of this, AMD will not see a large penetration into corperate sales until its platform is better than Intels (the release of mainstream K10)."

CPU performance isn't all that important for corporations. Smaller TCO has much bigger effect on sales.


woof woof
"In commercial clients, I have encountered installed bases of 1000s if not 10s of thousands, PER customer. And every dollar saved goes a long way in such an environment."

So it is. A good enough office PC costs <$700. Average salary per worker is how high? Would having a DGP improve their working efficiency enough to have it?


woof woof
"Even at cost, an entry Vista compliant card would cost 30-50 bucks"

Why on earth would you need Aeroglass on enterprise desktop? First, the additional cost is more than low-end DGP. Second it has been shown it lowers productivity. Even if someone manages to prove they need Aeroglass then most Intel IGP's would be good enough for it.


scientia
"For anyone who does not use SSE extensively AMD's processor prices are comparable to Intel's. "

So it is, at least for low-end dualcores and before the next price drop (some time in March?). Then again P4D dualcores are dirt cheap also, not to mention singlecores and Celerons that are both more than enough for almost everything.


Sorry for the long post, I was offline for a while.

Scientia from AMDZone said...

roborat
1) It doesn’t matter if the system has discrete or integrated GP


Then I guess you have missed a major point. AMD will sell far more IGP systems than DGP systems; this is why Intel makes IGP and not DGP.

2) End user preference has little to do with chipset purchase decisions.

Which is why I've used the term "OEM" over and over.

Instead quantity of purchase is based on volume orders from OEM’s/channel

Yes, and as I've said, having ATI makes AMD based motherboards more attractive to OEM's.

3) Chipsets are designed for the system and not the other way around. Some here point out that IGP’s can’t be used for certain applications. I don’t see the point for argument as market segmentation isn’t divided by the chipsets but instead defined by the CPU and the discrete graphics used. Clearly some here are thinking the wrong way round. Performance on end user applications is never the reason why a chipset is bought and sold.

Well, not exactly. Not all chipsets have IGP and not all chipsets have the same I/O capabilities. There are varying reasons to prefer chipsets other than graphics performance. However, I will say again that having an AMD branded chipset is more attractive to some vendors.

4) Chipset market isn’t heterogeneous. Intel/AMD creates the CPU and therefore creates the first chipsets that enables enhancements i.e., BUS changes. This is the reason why Intel has a strong hold on its 50+% share of the IGP business.

Really? Perhaps you could name what bus changes have occured since the introduction of socket 775 and 939. You've had slight increases in clock speed which would have been within the original spec anyway. For example, I'm certain AMD has already published the specs on AM2 and AM3. I don't see a technical advantage for AMD versus its other partners. The advtantage I see for AMD is that it can target whatever markets it feels needs chipsets. This increases market coverage and again makes AMD based solutions more attractive.

And now AMD/ATI can do the same on its platform. First to market is so vital in the computer industry that OEMs have no choice than to create systems based on an Intel/AMD platform than wait for other chipset makers (VIA, NVIDIA) to come up with their own version. Intel (and now AMD) will always have an advantage.

You are thinking about this backwards because you are only thinking in terms of chipset versus chipset on the same platform. Intel is now at risk for having its motherboards replaced by AMD motherboards. System versus system, not chipset versus chipset. This is an important distinction because Intel actually loses the advantage that you think it retains by making chipsets in-house.

Scientia from AMDZone said...

enumae

Okay, I see your confusion. Think of it this way, Intel gains desktop share in the consumer market while simultaneously losing share in the commercial desktop market. Overall, Intel loses share. I'm saying that AMD's chipsets and gains in commercial share will insure that it doesn't lose overall share even with Intel's advantage in C2D.

ho ho

Obviously, a 2P 8C system is cheaper than a 4P 8C system but it still remains that the 4P 8C system is better for some applications. In other words, it hurts Intel as much right now to not have a 4P Clovertown solution as it does AMD to not have a quad core Opteron solution.

abinstein wasn't saying that Intel 2P was niche; he said that 2P with quad core was niche which was true during 2006. This won't be niche by end of 2007.

Finally, AMD's prices are comparable for all dual cores, not just the low end models; AMD delivered the 6000+ at a competitive price. The only chips that aren't really competitive are the FX chips.

You do have a point about Celeron chips which is why AMD cut its percentage of Semprons in half. The last time AMD did this was in 2004 when it moved out of the low end market with K7 and allowed Intel to take more with Prescott Celeron. This suggests very strongly that AMD's K10 prices will remain high.

Unknown said...

Hey Scienta, I don't know if you caught this bit of news:

UPDATE - AMD says unlikely to meet revenue forecast

http://us.rd.yahoo.com/finance/external/reuters/SIG=11vg30000/*http://yahoo.reuters.com/financeQuoteCompanyNewsArticle.jhtml?duid=mtfh25617_2007-03-05_14-52-35_n05267637_newsml

I could probably take some time and go back through the old postings to see if your prediction of mine was more accurate for AMD, but I don't really think that is necessary unless you want to argue the point that I was 100% right and you were 100% wrong.

Ho Ho said...

scientia
"Obviously, a 2P 8C system is cheaper than a 4P 8C system but it still remains that the 4P 8C system is better for some applications"

Yes, it is better for some, most likely not for most applications. I haven't said otherwise.


scientia
"In other words, it hurts Intel as much right now to not have a 4P Clovertown solution as it does AMD to not have a quad core Opteron solution."

Of cource a 4P setup with C2 based CPU's would be a nice thing for Intel. Even with only dualcores it would be quite good. But still, 4P 8C is not that much worse than 2P 8C, especially for Intel since eithe way it has shared memory.

When application can work nicely on NUMA and needs a lot of bandwidth it would of cource work better on AMD, no matter if with single, dual or quadcores.

Not Penix said...

http://www.techreport.com/reviews/2007q1/quad-core/index.x?pg=13


Seems that NUMA hasn't made much of an impact under Vista

Scientia from AMDZone said...

real
UPDATE - AMD says unlikely to meet revenue forecast


Okay.

I could probably take some time and go back through the old postings to see if your prediction of mine was more accurate for AMD

Have fun but I don't recall making any prediction about first quarter revenues versus AMD's forecast.

I was 100% right and you were 100% wrong.

Well, I don't think I can be wrong about something I didn't talk about. Did you make a prediction about AMD's forecast?

Scientia from AMDZone said...

not penix
Seems that NUMA hasn't made much of an impact under Vista


I can't tell. I would need to see benchmarks both with and without interleaving to evaluate Vista's NUMA ability.

Woof Woof said...

This is what I know.

Microsoft provides large Corporate accounts with Corporate licences which make it compelling to switch to Vista, and XP Pro licences translate directly to a Vista Business Premium licence. No additional cost to them.

One of the biggest changes in Vista is in the user interface aka Aero. Even if corporations aren't planning to switch to Vista today, they are planning to, so any current PC purchases today need to be Vista Premium Aero interface ready for investment protection.

And customers I know and talk to directly have two choices: opt for the Intel and upgrade to DGP (as recommended by their systems vendors) or buy the AMD/ATI solution.

Not all of them will switch, mind you. There is still a lot of resistance to switching to an underdog, but the value proposition does make sense to them.

Not Penix said...

http://www.hothardware.com/viewarticle.aspx?page=5&articleid=911


Keep in mind they didn't test on the final release of Vista, although i'm quite suprized at a 50% peformance difference, anyone know why. Would it maybe need to be incorporated at the code level, much like the switch from sgl threaded to multi-threaded apps?

enumae said...

Woof Woof
...so any current PC purchases today need to be Vista Premium Aero interface ready for investment protection....opt for the Intel and upgrade to DGP (as recommended by their systems vendors) or buy the AMD/ATI solution.

Well if thats what you know then you are not informed.

Looking on line, Legit Reviews shows an AMD 690G chipset scoring 3.0 in the Windows Experience Index.

AMD Athlon 64 X2 5200+
MSI K9AGM2 690G
2GB Corsair PC2-8888C4

Intel has scored 3.4 in the Windows Experience Index, here.

Core 2 Duo E6400
Foxconn G9657MA-8EKRS2H
1GB of 533MHz DDR2

"Microsoft says this about base score level 3.0 computers:

This level represents the value end of machines that will ship at the end of 2006 and into 2007. This is the lowest capability Windows Premium Logo PC that will ship with Windows Vista™ pre-installed. Windows Vista will generally enable Aero automatically on level 3 machines. Aero will perform quite well on level 3 machines with single monitors. With dual monitors (especially larger than 1280x1024), users may see noticeable performance issues from time to time, especially on machines with scores less than 3.5 and/or 128MB of graphics memory."


I have shown you real world systems from both ATI and Intel that are quite capable of using Windows Vista Ultimate (and Aero).

Do you want to continue saying Intel is not capable?

You don't need DGP unless you are in a graphic intensive segment that trully requires it.

PS:You could save your custmers money.

abinstein said...

enumae: "Looking on line, Legit Reviews shows an AMD 690G chipset scoring 3.0 in the Windows Experience Index.

Intel has scored 3.4 in the Windows Experience Index, here.


You are not comparing 690G and G965 here. You are comparig the 3D capability (that the Windows Vista assumes) of ATi RX1250 and Intel GMA3000. There are a few problems with this approach of comparison.

First, Windows Vista can't have a good 3D business and gaming benchmark, other than Aero. The fact that X1250 scores only 3.0 in this area even when overclocked shows that Vista probably had some internal heuristics that does not detect true 3D performance well (see the Legit Reviews article - same 3.0 score from Vista, but 11% better performance in HL2).

Second, systems in different articles can't be compared side-by-side. They use different CPUs, memory, FSB/HT speeds, all affect benchmark results.

Third, the Intel system doesn't show detail score list. If you wanna compare Aero, that is the one you should be looking for. Intel's 3D performance score could be higher simply because it supports shader model 3 - something that's not prevalent outside high-end graphics app, and thus pretty useless for integrated graphics.

That said, I believe one point that you made is correct, that Intel graphics/chipset should run Vista just fine.

abinstein said...

not penix: "Seems that NUMA hasn't made much of an impact under Vista"

There are two problems with your "seems." First, as Scientia said, you have to compare memory interleaving to be certain how much NUMA helps. Second, a multi-threaded software that is not programmed with NUMA in mind will not benefit from OS's NUMA support.

BTW, if you compare FX-74/6000+ and QX6700/E6700 you'd find that 4x4 scales better than C2Q. This is even true with these same old benchmarks.

abinstein said...

Ho Ho, you are like many Intel fan enthusiasts who live in your own world in terms of computer performance, and by living such a mental life you help Intel spread its marketing messages to the rest of (pretty ignorant) computer users/buyers.

If you want to compare compiling performance, get a gcc benchmark from spec and run it. It's pretty much independent of operating systems but it'd be nice if you use exactly the same platform. Memory bandwidth is important because compiling has about the highest cache miss rate, and every miss consumes bandwidth. Memory access latency is not that much a big deal if you have large cache line with a basic prefetch (easily gets 128B - or ~ 3 basic blocks - for each main memory delay).

If you think cryptography and XML performances are not important, then you are dead wrong. These are probably two of the most wanted thing out there in the real (business, engineering, and scientific) world. AI, on the other hand, is really not that important at all.

Also, you're being totally unreasonable to compare the cost of 16 core Opteron with that of 8 core C2Q, no matter how many times you repeat yourself. People buy 16 core 8P Opteron because they want a monster to run software that are high bandwidth and mission critical, none of which is targeted by dual C2Q, which is starved at its FSB, period. Intel's FSB is sweating heavily to keep up with a single C2Q, not to mention two of them.

Last, (this also to Scientia), when I say 2P C2Q is a niche market, I mean it is a niche market. I don't say it is niche by the end of 2007 - and why would I say that? When I say it is, I mean now.

savantu said...

Last, (this also to Scientia), when I say 2P C2Q is a niche market, I mean it is a niche market. I don't say it is niche by the end of 2007 - and why would I say that? When I say it is, I mean now.

1.8 million x86 servers were sold in Q4.
90000 were 4 socket and up.
1500 were over 8 sockets.

Intel holds +70% of the UP/DP market and +60% of the 4/8 socket market.

Above 8 sockets it is Intel only territory , only the likes of IBM and Unisys have x86 servers that go over 8 sockets.Guess what , they are all Xeon based.

Woodcrest accounts for over 75% of the UP/DP Intel server share. ( that is around 900k servers vs. 500k for all types of Opteron )

Looking at the BS you spew out I can think of only 2 possibilities :

-you're disingenous
-you're disingenous and an idiot

Take your pick.

Ho Ho said...

abinstein
"If you want to compare compiling performance, get a gcc benchmark from spec and run it"

Exactly what benchmark are you talking about? I'd like to know the name of it so I could at least google for it.

From what I know there is no standard benchmark that measures the speed of compiling. There are benchmarks that measure the speed of the resulting program. Though that has absolutely nothing to do with compiling speed. So far I've seen people using Firefox and Linux kernels to measure compiling speed. Have you ever seen that mystical benchmark being used anywhere?


abinstein
"Memory access latency is not that much a big deal if you have large cache line with a basic prefetch (easily gets 128B - or ~ 3 basic blocks - for each main memory delay)."

For every cache miss, 2.4Ghz C2 has to idle for around 170 CPU cycles. 2.4GHz AMD around half that. You said, cache misses are quite frequent in compiling, therefore having larger caches doesn't really help all that much. But still it seems like C2 is a lot faster than K8. Got any theories why?

Also K8 has 64byte cache line size, the same as Core2. Where did you get that 128byte cache line size idea anyway?

Am I the only one who gets the feeling abinstein doesn't know half the stuff he is talking about? First it was the rasterizing memory bandwidth usage and compiling speed, now he is taking some other numbers out of thin air.


abinstein
"If you think cryptography and XML performances are not important, then you are dead wrong. These are probably two of the most wanted thing out there in the real (business, engineering, and scientific) world"

For N'th time, please list some programs that are bound by XML processing and cryptography speed. I'd also like to know how much slower C2 is in them compared to K8. You seem to know this so bringing examples shouldn't be difficult.


abinstein
"People buy 16 core 8P Opteron because they want a monster to run software that are high bandwidth and mission critical"

Other people want maximum CPU speed and they don't need all the bandwidth. There are different people with different needs.


abinstein
"Intel's FSB is sweating heavily to keep up with a single C2Q, not to mention two of them."

Depends on application. There are lots of things that can get by just fine with the little bandwidth the FSB proviedes. E.g one such is Oracle APEX that I use at work that will eventually run enterprise level business web applications.


abinstein
"Last, (this also to Scientia), when I say 2P C2Q is a niche market, I mean it is a niche market."

Yes, it is a niche but I bet it isn't nearly as small niche as 4P and up.

savantu said...

ho_ho , why do you even bother replying to him ?

Intel produces around 300k QCs/quarter which is roughly 1/3 of AMD's Opteron production.

Is that a niche ?

Ho Ho said...

savantu
"ho_ho , why do you even bother replying to him ?"

Mostly because I'm bored and I like discussions/arguing since I can practice my English skills. Also I've learnt a couple of interesting things while searching for information. I also like to see what excuses people come up with when they are cornered :)

Scientia from AMDZone said...

abinstein
Last, (this also to Scientia), when I say 2P C2Q is a niche market, I mean it is a niche market.


Yes, that is what I said. Although the Kentsfield market is currently larger than 4x4 it is still very small. And, Clovertown doesn't yet represent a large market segment although both are growing. Clearly, both Kentsfield and Clovertown will outgrow both 4x4 and 4-way systems. My point was that Intel is being hurt also by not having a 4-way solution and by the time it has one AMD will have quad core as well.

I don't say it is niche by the end of 2007 - and why would I say that?

That isn't what I said. I said it won't be a niche market by the end of 2007:

This won't be niche by end of 2007.

Scientia from AMDZone said...

savantu
Above 8 sockets it is Intel only territory , only the likes of IBM and Unisys have x86 servers that go over 8 sockets.Guess what , they are all Xeon based.


This is true if you are talking about 8 sockets on one motherboard. However, sun does have systems with more than 4 processors. This is also true of Opteronics and Cray. Secondly, AMD will release DC 2.0 in 2008 and this will move AMD up to 32-way. This will in fact surpass anything that Intel has planned on X86 through both 2008 and 2009.

-you're disingenous and an idiot

Stop with the insults or stop posting here; take your pick.

enumae said...

abenstein
That said, I believe one point that you made is correct, that Intel graphics/chipset should run Vista just fine.

That was the only point I was making, it was not my intention to compare anything.

I was simply showing Woof Woof that Intel is indeed, according to Microsoft, capable of the Aero interface without DGP.

Unknown said...

Hasn't tyan already released multiple board models with HTX slots that are used to add a 4socket daughterboard to current 4p server boards? In fact, haven't a couple other motherboard makers done this as well? I remember seeing tons of pictures of these from the last CeBit, and seem to remember something about them already being in production. I even remember several servers being shown using these boards or derivatives of them.

Also, please describe these 8 socket xeon boards. I have no doubt they exist, but substantially doubt their usefulness and market prominence. I also doubt that they use normal chipsets. I'm mainly interested in seeing what they look like, it is for no arguments sake that I ask this.

savantu said...

Scientia This is true if you are talking about 8 sockets on one motherboard.

There are no such systems.
What you're trying to say is SSI , single system image ( a SSI system is nP , where n is the number of CPUs )


However, sun does have systems with more than 4 processors. This is also true of Opteronics and Cray.


Those are 8P systems while Xeon scales to 32P ( IBM 460 and Unisys ES7000/one ).

Secondly, AMD will release DC 2.0 in 2008 and this will move AMD up to 32-way. This will in fact surpass anything that Intel has planned on X86 through both 2008 and 2009.


64 way systems exist now from IBM and Unisys.With Tigerton we'll see 128 way Xeon systems. Which is head and shoulders above anything Opteron can muster.

Btw , do you know something we don't about CSI ?

Woof Woof said...

Do you guys ever run 965G with Vista Aero? During a preview Vista event, there was a lot of artefacts and some of the Aero features could not be enabled on the Intel systems on display. The Radeon Express 1150 ran everything, all the features, ran everything smoother and without those artefacts.

Unknown said...

Holy cow abinstein, calm down. Five bucks says your post gets removed and your arguments (though they're good ones) will be wasted.

I think this will be helpful in determining how little we should care about xeon 8+way servers.

http://www.aceshardware.com/SPECmine/top.jsp

Also, seeing how xeons haven't had any supercomputer wins lateley, and opteron has (via scientia's previous article) I have the feeling that the industry likes opteron better, in terms of scalability.

Speaking of disingeuous, I happen to spy some 8p+ servers in that above list with power processors in that. Care to explain how IBM only uses Xeon now?

Oh ya, and I don't see any xeon systems in there with the number of processors you say they have. I'm not doubting there are 8p+ servers, but apparently they're not in significant enough volume, or they don't perform well enough in normal, non-extremely specialized situations to be benchmarked with spec.

Also, your post responding to scientias doesn't actually respond to him at all, it only throws out numbers that are in no way specific enough to mean anything. They don't show quad core 2p numbers, they don't show 4p dual core numbers, and they don't show 2p dual core numbers. Again, who is being disingenuous (and yes, inflationary statements are disingenuous).

Scientia from AMDZone said...

Abinstein's comment edited and reposted
~A bit less abusive language please~

Ho Ho: "From what I know there is no standard benchmark that measures the speed of compiling."

176.gcc specint2000
403.gcc specint2006

Look at SpecInt2006 and you'll see Core2's gcc score is about the same as K8. There is no 30% faster as claimed by you (or other enthusiasts).

Ho Ho: "For every cache miss, 2.4Ghz C2 has to idle for around 170 CPU cycles. 2.4GHz AMD around half that."

This is bogus. You have totally no idea, dude. In terms of main memory latency, K8 vs Core2 is about 50 cycles vs 60 cycles. Greatly reduced memory latency is partly why Core 2 outperforms Netburst so much.

Ho Ho: "You said, cache misses are quite frequent in compiling, therefore having larger caches doesn't really help all that much. But still it seems like C2 is a lot faster than K8. Got any theories why?"

Where did I say larger cache doesn't help?

Compilation jobs can require high bandwidth because they have high miss rate. But actually compilation miss rate reduces quite well with a larger cache due to good locality.

In the two gentoo compile that you referenced, the K8 has 1MB L2 cache whereas the C2D has 4MB; K8 will experience probably 25% higher miss rate. You get slightly (~5%) higher effective memory latency on K8, and a much lower bandwidth.

Conclusion: the comparison is worthless. The 30% something is not processor difference.

Ho Ho: "Also K8 has 64byte cache line size, the same as Core2. Where did you get that 128byte cache line size idea anyway?"

Did I say Core 2 having larger cache line than K8? What I say was large cache line reduce cache misses but could lay burden on memory bandwidth. Also true with prefetch algorithm - and a basic prefetch would give you 128B per miss.

Ho Ho: "Am I the only one who gets the feeling abinstein doesn't know half the stuff he is talking about? First it was the rasterizing memory bandwidth usage and compiling speed, now he is taking some other numbers out of thin air."

Oh yeah, I don't care what type of feeling you get because you obviously can't read and know little of the reality yourself. Why don't you just go see specint2006 results and tell me how much Core2 is better than K8? I hint that you it's between +10% to -10%.

And you are the person who claims that IGP would have 2Gpix/s fillrate and take 10GB/s memory bandwidth bullshit. I say IGP at most have about 50Mpix/s (1600x1280x25) fillrate, and its memory bandwidth will be in the order of 2GB/s, or 1/4th of the smoke you come up with.

Scientia from AMDZone said...

savantu
There are no such systems.
What you're trying to say is SSI , single system image ( a SSI system is nP , where n is the number of CPUs )


Typical 8-way Opteron systems use two quad socket daughterboards connected via HyperTransport. This is not a significant distinction since the boards function as a single motherboard. Minicomputer processors used to be built the same way.

Those are 8P systems while Xeon scales to 32P ( IBM 460 and Unisys ES7000/one ).

Xeon itself has no native capability to scale to those sizes. IBM has created 32-way systems by using the X3 Hurricane chip. Newisys has a similar chip for Opteron called "Horus" although I don't know of any motherboards that use it yet.

64 way systems exist now from IBM and Unisys.

To the best of my knowledge, IBM's X3 only goes up to 32-way (such as on the X3950). If you know of any system with 64 processors then please post a link.

With Tigerton we'll see 128 way Xeon systems.

Really? I'm not aware of any connecting technology in Tigerton. IBM's X3 uses Infiniband. Post a link.

Which is head and shoulders above anything Opteron can muster.

I think you are dreaming.

Btw , do you know something we don't about CSI ?

It was clear from Intel's silence on CSI at the 2006 IDF that it wasn't going to be ready soon. To the best of my knowledge CSI will not be released on X86 in 2008. It is clear from Intel's renewed interest in the FSB that they intend this to be the mainstay through 2008.

Do you have any technical information that suggests that CSI is capable of greater than 4-way cache coherency?

Finally, you should keep in mind that IBM spent a lot of money developing X3. The problem is that X3 won't work with CSI. Intel will be starting from scratch with CSI and following where AMD now has years of experience. It is possible that Intel could catch up but I seriously doubt that will happen before 2010.

enumae said...

Woof Woof
Do you guys ever run 965G with Vista Aero?

I do not, do you?

If not, do you use the Radeon Express 1150?

During a preview Vista event, there was a lot of artefacts and some of the Aero features could not be enabled on the Intel systems on display.

You are using an argument that is not able to be upheld.

If you go your local computer store and play with a few machines you will see Aero works fine with Intels IGP... provided it has the correct video drivers.

Here is a link showing the Intel GMA950 running Vista Aero (its a video with a cheesy song).

The Radeon Express 1150 ran everything, all the features, ran everything smoother and without those artefacts.

I do not have any doubts that the ATI IGP did well, but I have shown reviews and videos saying Intel is capable and apparently you don't believe it.

Why?

abinstein said...

abinstein: "Why don't you just go see specint2006 results and tell me how much Core2 is better than K8? I hint that you it's between +10% to -10%."

Oops... to prevent possible confusion, I meant to say specint2006 gcc benchmark results. The overall specint2006 score of Core2 is about 15% better than K8 largely due to Core 2's better AI, path-finding, and media compression advantages.

Scientia from AMDZone said...

Ho Ho's comments edited and reposted

abinstein
Look at SpecInt2006 and you'll see Core2's gcc score is about the same as

How on earth could you compare the results on Spec page? For the 2006 version I couldn't find K8 and C2 based machines that would have been directly comparable. If you could then please link to them. I couldn't find two systems with same OS and compiler version. If I would want to I could bring you examples with different OS'es and compilers that show C2 to be 50%+ faster in compiling than K8 on some other OS and using some other compiler. I guess you wouldn't like that so show us the systems we should compare. Also, as AMD itself said, SpecCPU2000 is old, 2006 is the new one that should be used.

Note: it is pointless to post Spec 2000 scores compiled with Intel's compiler. However, I'm not aware of any current Spec scores which show AMD in the lead.

Also, is that benchmark availiable for free? I'd be glad to run it. I tried to find the 2000 version some time ago but it seemed it cost a lot of money.

K8 vs Core2 is about 50 cycles vs 60 cycles

First, those benches show the difference in nanoseconds, not cycles. Also you should know those tests used older version of SiSoft that used not-so-god access patterns and Core2 could guess them and fetch data to caches. Here are couple of other benchmarks that have a bit more real-world numbers with the updated version of the program, again times are in nanoseconds:
regular DDR2, 70-81ns
FBDIMM, 109-120ns, DDR2, 85-105ns.

Here, for e6600 it is 240 clock cycles of idle time for cache miss. For 3GHz Xeon it is 327 cycles, for x6800 it is 249 clock cycles. It seems that I was rather conservative with my numbers and 170 cycles sleep time is the best-case scenario.

In comparison, it takes around 60 cycles to intersect four rays with a triangle and compute the exact intersection coordinates, assuming that data is in registers.

Note: why do you have two diferent speeds for DDR2 listed. Also why are you mixing together memory speed, gpu speed, and cache miss idle time? Exactly how are these three things related?

You missed my point. It was said that in compiling, cache misses are very frequent. That means that caches are not as effective since there are awfully lot of misses. Having bigger cache will not increase its usefulness that much, most certainly not 25%. You'd be lucky to get even 5% better efficiency out of it. If you doubt just see the benchmarks where they have measured CPU's with same architecture and clock speed and different cache sizes. The biggest difference I've seen is around 15% in media encoding where the working dataset started to fit entirely to the cache. In other tasks it was mostly in order of 1-5%. Doubling the cache size will never decrease misses 2x.

I'll try to get working on my machine to measure it. If you have a K8 based machine I suggest you to do the same so we could test our theories. http://icl.cs.utk.edu/papi/

Note: I edited this because you put a large block of text in the href link.

Wow, you really don't know anything about data structures. I wouldn't call tree-like structures very local in memory. Actually they are one of the worst data structures when it comes to prefetching. Linked lists and hash tables would follow them. Also as I said, increasing cache size will not increase miss rate that much. I've tested it way too much to believe anything that states otherwise.

Note: Real compiling jobs don't fit in memory and therefore aren't very good for benchmarking. Typically you could fit the symbol table in memory which would increase lookup speed but often the rest of the code will be limited by harddrive speed. I don't see the point in arguing small details over something highly synthetic. In general compiler speed is proportionate to integer computation speed which we all know is currently faster in C2D.

I could also look up several research papers where people have measured how big impact does increasing cache size have to the cache misses. E.g in ray tracing, secondary rays usually have around 50% cache hit ratio with 8k cache size. Increasing it to 64k would increase hit rate to around 65%. Bigger caches can help a lot when the entire dataset fits into it.

Now you say that DDR1 400 with CL2 has only 5% lower latency. Have you ever seen what kind of latency that gives? It is somewhere around 45ns, much lower than the 70ns+ for Core2. That should be at least 40-50% better latency.

Next Core2 does have total of 4M L2 but that K8 has 1M per core. As compiling isn't multithreaded byt multiprocessed that means that Core2 has effectively around 2x more cache than K8, not 4x more. having twice the cache alone will never increase miss rates to 25%.

Note: It is 4X. Each core on Woodcrest has access to all 4MB of L2 whereas each Opteron core only has access to 1 MB.

abinstein I say IGP at most have about 50Mpix/s (1600x1280x25) fillrate, and its memory bandwidth will be in the order of 2GB/s, or 1/4th of the smoke you come up with.

Now a little bit of math.

Note: How about some common sense instead? ATI is not stupid enough to bottleneck its GPU's with only 1/4 to 1/5 of the necessary bandwidth. Secondly, if your calculations were accurate at all then the nVidia IGP would have a huge advantage. Since no advantage has yet been demonstrated it seems more reasonable to assume that IGP's do not use 8-10GB/sec of bandwidth.

greg Care to explain how IBM only uses Xeon now?

My guess is they are cheaper and can manage the tasks they are bought for. Power5 will surely do better but would be an overkill.

Note: If Power 5 is overkill then why is it falling in HPC? Opteron is currently the fastest rising HPC system.

scientia, would it be possible to make the comments to be as wide as the entire page? Having only 25% of usable space kind of sucks.

Note: I would edit the template for this page if I could but I don't have access to it. This post has been edited to roughly half the size even with my comments. It originally ran 2,000 words. When you quote the same person multiple times it isn't necesarry to post the person's name over and over. Also, it isn't usually necessary to quote word for word everything someone says.

abinstein said...

"Holy cow abinstein, calm down. Five bucks says your post gets removed and your arguments (though they're good ones) will be wasted."

Well, okay, apology to all who have read my tough language. I only wanted to stress how discontent I was to be quoted with some wrong statements that I never said.

"I think this will be helpful in determining how little we should care about xeon 8+way servers.

http://www.aceshardware.com/SPECmine/top.jsp
"

Thanks for the link, though the info there is not completely up-to-date... :-p

To be honest, in reality people don't even 100% believe specint comparisons, or in general any benchmarking results. Only PC enthusiasts would believe a geometric average of a few games and marks that much.

So I went back to read the page that Ho Ho linked initially. I have to say that by the information revealed there the E6600 box does seem to be the best choice (without considering of price) for compiling the gentoo Linux kernel. This however shows only one program being compiled in a few totally uncontrolled environments, thus its accuracy in reflecting the processor's ability to do compilation should be worse than that of SpecInt_rate2006 (for multi-processor).

abinstein said...

Ho Ho, I'm not going to respond to each of your points, partly because Scientia has done much of the needed responses, and partly because you're probably never going to be convinced no matter what.

There are still two thing that I'd like to say. First, operating systems make a big difference on performance if lots of system calls made, or under heavy multithreading. Since compiler is multiprocessed and has relatively low IO, it doesn't depend on OS that much.

Second, Core 2 simply does not wait 2x than K8 for every L2 miss. You can't calculate main memory latency using raw memory cell delay, because main memory is highly banked, interleaved, prefetched, and accessed in groups. The fact is L2 cache miss of Core 2 vs. K8 is about 6 vs 5 (or 5 vs 4, depending on whether you look at the average or worse case).

Ho Ho said: "It was said that in compiling, cache misses are very frequent. That means that caches are not as effective since there are awfully lot of misses. Having bigger cache will not increase its usefulness that much, most certainly not 25%."

You are wrong again. Cache size affects program performance in a wildly different manners. Compilation, or more specifically the gcc benchmark in specint2000, has higher cache miss rate than other programs when the effective cache size is small (< 1MB), but improves greatly with larger cache. From 1MB to 4MB cache miss rate reduces 20-25%.

Ho Ho said...

abinstein
"Oops... to prevent possible confusion, I meant to say specint2006 gcc benchmark results."

Can you list some specific results? I skimmed through most of these last night and didn't find a single benchmark that would have proven your point. Your point about K8 compiling speed has so far had no proof, only several examples of the opposite.


scientia
"it is pointless to post Spec 2000 scores compiled with Intel's compiler"

Do you really expect that ICC would use largely different codepaths when running on different arhitectures? If yes then what might be the difference between those different codepaths* and how big would be the efficiency difference? Could you list some results from Spec 2006 that would be even remotely comparable?
*)I'm talking about the used machine instructions.


"why do you have two diferent speeds for DDR2 listed"

First, thanks for removing the links. That will surely make the point more clear that Intel has a lot higher latency.
Secondly, what speeds were different and how big impact do you think that has?

For reference, here are the links again you removed from my post:
regular DDR2, 70-81ns running at 800MHz. This should fit nicely to compare with the results showed by abinstein. You can even see how much does different memory settings affect latency on Core2. (3-3-3-9 vs 5-5-5-15 makes a difference of around 16%)
FBDIMM, 109-120ns, DDR2, 85-105ns. Both at 667MHz. This should be a good refecence to see how big latency increase FSBDIMM brings. It is especially useful when comparing SpecInt scores as most machines there use FBDIMMs.

Do you expect that DDR2 800 would make that big difference on intel when comparing against 667? Intel has fixed latency from CPU to NB, only increasing FSB speed can improve this. Anything that effects the latency of RAM goes between RAM and NB. 800 vs 667 has raw difference of 20%, that can never translate to 20% better overall latency on Intel. It could on AMD where there is no extra delay from CPU-MC communication.
If anyone does think that RAM makes a huge difference in Core2 latency then please bring some links to prove your point but please avoid showing bugged programs.


"Also why are you mixing together memory speed, gpu speed, and cache miss idle time?"

Where do I do that? I have said nothing about GPU speed.


"Exactly how are these three things related? "

Two things, actually. Cache miss idle time comes directly from memory latency time. What is wrong with that?
I brought in ray tracing to show how much time is wasted on cache misses. CPU's are insanely powerful but a lot of their performance gets lost thanks to cache misses. How much exactly is dependant on program.


"Real compiling jobs don't fit in memory and therefore aren't very good for benchmarking"

What kind of "real compiling jobs" are you talking about? Highest I've seen compiling a single file was ~450MiB of RAM. That will surely increase cache misses quite a lot but still fits in memory. Are you really suggesting that real compiling takes more than two times as much RAM? Could you bring examples of such programs?

Having been using Gentoo for more than three years that half a gig was the highest memory usage that I've ever seen. If you are talking about all the resulting object files then this is a moot point. Even OOo doesn't generate that much that won't fit into memory, even not during linking.


"Typically you could fit the symbol table in memory which would increase lookup speed but often the rest of the code will be limited by harddrive speed"

Say what? Are you talking about swapping or reading source code from disk and writing object files back to it? If the first you are 100% wrong assuming your PC has >128M ram. If the latter then this has very little effect overall. Compiling kernel on my machine with clear disk cache vs compiling cached files made a difference of around five seconds or around 5%. Also HDD's have speed a lot faster than any CPU can output compiled object files. You'd be lucky to see >2MiB/s throughput when compiling on high-end Core2Quad, less on other CPU's.


"It is 4X. Each core on Woodcrest has access to all 4MB of L2 whereas each Opteron core only has access to 1 MB."

As it was said, compiling is multiprocessed. That means every core runs its own separate copy of GCC and compiles a different program with different data. On average both cores have access to half the cache or 2M per core, they can't share almost anything.


"How about some common sense instead? ATI is not stupid enough to bottleneck its GPU's with only 1/4 to 1/5 of the necessary bandwidth"

What would be the bottleneck? Can you list the connection throughput between IGP and RAM when considering all kind of inter-chip connections also. So far I haven't heard anything that would prove your point about this bottleneck.


"Secondly, if your calculations were accurate at all then the nVidia IGP would have a huge advantage"

What exactly is different between that radeon and NV IGP's that would give NV that advantage? On same platform they should have same memory architecture and bandwidth. Only real differene is the GPU and its shaders speed.


"Since no advantage has yet been demonstrated it seems more reasonable to assume that IGP's do not use 8-10GB/sec of bandwidth."

As I've stated several times those numbers that abinstein use are unreal. I just used those to come up with the bandwidth requirements at those framerates and resolution to compare against what he calculated (read: took from thin air). It is not my fault he picked bad input values.

I've also said that in real world IGP's struggle with shaders, not with memory bandwidth. That is the reason why we don't see so high framerates at those resolutions, memory throughput doesn't have much to do with it. In real games bandwidth is a bit lower since IGP can't process things fast enough.


"Power 5 is overkill then why is it falling in HPC? Opteron is currently the fastest rising HPC system"

Simple: not enough bang for buck.
How many Opterons can you get for one Power5? What about their power usage?


"I would edit the template for this page if I could but I don't have access to it"

Are you sure the global blogger template has nothing to do with comments page? I haven't got time to find it out myself but I've seen quite different comment pages than the default one.


"Also, it isn't usually necessary to quote word for word everything someone says."

I do it to avoid confusion. There is too much of it even without me doing it. A forum would be a much better environment for this discussion but we'll have to try to get buy with what here is.


abinstein
"Thanks for the link, though the info there is not completely up-to-date... :-p"

Well, I don't remember that AMD has made anything interesting during the last 5 monthns. DDR2 support was there long before the last update. Also I still don't see you bringing any better examples.


"This however shows only one program being compiled in a few totally uncontrolled environments"

Are you capable of bringing better examples? After all, it was you who said that K8 is quite good at compiling, that should mean you have seen it before and it should be trivial to prove.


"Ho Ho, I'm not going to respond to each of your points, partly because Scientia has done much of the needed responses, and partly because you're probably never going to be convinced no matter what."

You should since he was wrong with most of his answers. Also why do you think I can't be convinced? I've changed my oppinions before when faced with evidence. So start showing links that support your claims.


"The fact is L2 cache miss of Core 2 vs. K8 is about 6 vs 5 (or 5 vs 4, depending on whether you look at the average or worse case)."

How do you know that? Have you got any source for that information or is this yet another thing you took out from thin air?


"From 1MB to 4MB cache miss rate reduces 20-25%."

Can you bring some links to prove this? Also as was said, compiling when using several cores, single core will only have half the cache usable. How much will this decrease hit rate?


For those who are interested, here is my entire last post withouth the edits and a couple of fixed typos. There I also have the calculation of the used memory bandwidth of one hypothetical game and some other interesting things not present in the cencored post. There is also a small list of things that so far haven't got an answer. I'd like to see those things answered.

Scientia from AMDZone said...

Ho ho

My mistake, you are right that compiling would use both cores and therefore C2D would only have 2X L2.

As far as compiling itself itself though, you couldn't be more wrong. Obviously, you are not familiar with compiler design. With the last project I worked on the make file alone was five separate files. The only possible way that you could use a regular project as a benchmark would be to first load all of the source files and libraries into a RAMDisk. The gcc benchmark in spec is a trivial example of compiling.

As far as the Intel compiler goes, yes it has been demonstrated many times that it does not create optimal code for AMD processors. This is one of the cheats that THG uses.

I'm not arguing that DDR2 speed has any real effect on Intel CPU latency. Every review that I have seen on THG has used memory that fits well with C2D. In contrast, it does effect K8 because the latency is too high. This is another standard cheat that THG uses.

As I've stated several times those numbers that abinstein use are unreal. I just used those to come up with the bandwidth requirements at those framerates and resolution to compare

Then you've wasted a lot of time. I think we can stop talking about imaginary hardware.

As far as the template goes, I am only aware of the main page definition. If the comments page defintion is there somewhere I haven't located it yet.

Yes, I agree longer discussions would be better in a forum. I regular post at AMDZone.

Scientia from AMDZone said...

Ho Ho

Why did you skip my questions about CSI, your current claim of 64-way, and your future claim of 128-way Xeons?

I can link to the IBM page showing that X3950 is 32-way.

As far as when CSI will be released, the best estimate I've seen is late 2008 at the earliest for Itanium and not until 2009 for Xeon.

The CPU redefined Torrenza and CSI

In time CSI will replace the frontside bus of both the Intel Xeon and Intel Itanium. Just as with HyperTransport the processors on CSI will be connected collectively through fast point-to-point data busses with Intel promising transfer speeds of up to 6.4 Gigatransfers a second. Intel is also planning to crossover to an integrated memory controller, based on FB-DIMM, at around the same time that it will release its CSI platform. The first processor to make use of CSI will be the next generation of Itanium processors (codenamed Tukwilla) which is planned for release in 2008. The implementation of CSI for Intel's Xeon, and possibly its desktop processors, will happen in 2009.

6.4G Transfers/sec for CSI
20.8G Transers/sec for HT

CSI is only equivalent to HT 1.0.

HT 3.0 also has power saving mode, split mode, long distance mode, and is hot pluggable. I don't see Intel's catching up to this in 2009, maybe by late 2010.

Woof Woof said...

I guess enumae didn't read my earlier posts. I played with the Vista Preview machines and as I said, there were glitches and artefacts with the GMA notebooks.

As it turned out, I had a chance to mess with Vista Premium and Basic again. FWIW, none of the GMA notebooks were configured for Aero. Zilch, Nada, not the one. Turned out the MS guys mentioned they had issues getting it to run, so they only enabled it in the Centrino Duo machines with the Nvidia GeForce Go chipsets.

Not Penix said...

--->so they only enabled it in the Centrino Duo machines with the Nvidia GeForce Go chipsets.<----

Centrino Duo means that it would be an intel chipset, not an Nvidia one. I"m not sure why people keep on saying the GMA chipsets can't run VIsta Aero, i've got 2 notebooks at this moment one with GMA 900 the other with 950 and they both run aero without a hitch, and its been proven again and again that the GMA chipsets do run Aero.

enumae said...

Woof Woof
I guess enumae didn't read my earlier posts...

Actually I did, but just to refresh your memory here is what you said... "During a preview Vista event, there was a lot of artefacts and some of the Aero features could not be enabled on the Intel systems on display."

I do not see where you used the machine, just where you saw a demonstration or preview of Vista.

As it turned out, I had a chance to mess with Vista Premium and Basic again. FWIW, none of the GMA notebooks were configured for Aero. Zilch, Nada, not the one.

When? What chipset? What video drivers? These are very important factors, as of now your comments are a little to vague to be considered valid.

Turned out the MS guys mentioned they had issues getting it to run...

Without the current drivers it will not work properly, look at the link I provided for the Intel machine in an earlier post, they discuss this.

Ho Ho said...

scientia
"With the last project I worked on the make file alone was five separate files. The only possible way that you could use a regular project as a benchmark would be to first load all of the source files and libraries into a RAMDisk. The gcc benchmark in spec is a trivial example of compiling."

What has the number of files got anything to do what I said? Yes, most programs consist of more than one file but so what? Compiler will read those in one by one, concat #includes and compile it all together to single object file. When a bunch of object files are done they are linked together to executable or .so.

I was pointing out that with that kind of compiling I haven't seen too big memory usage, ~450MB max and that was when using kdeenablefinal USE flag that concats all files in a directory into single huge file. Mostly compiling big files takes around 50-150MiB.

As for disk access, yes, with multiple files you have multiple file reads. Any half decent OS caches those files on first access and that means you'll only read them once from disk. No big impact.

On the whole having several files have almost no impact in benchmarks assuming that compared systems have remotely similar drives. By huge difference I mean bandwidth and latency differences of >2x, anything under that won't be much of a problem, unless you use lots of cores in parallel (>>4).

Even if there would be a big effect because of disk acces you can always see how much time is spent in system calls. Waiting for disk reads is directly visible from there as compiler itself doesn't use any other system calls.


"As far as the Intel compiler goes, yes it has been demonstrated many times that it does not create optimal code for AMD processors."

I know ICC generates different code for different architectures. Mostly that means not using SIMD stuff on AMD and other non-intel CPUs. I wasn't talking about that, I was asking about ICC itself and if it uses separate codepaths while compiling programs on Intel and on AMD HW. I can't imagine what parts of compiler could use SIMD instructions for compiling.


"I'm not arguing that DDR2 speed has any real effect on Intel CPU latency. Every review that I have seen on THG has used memory that fits well with C2D. In contrast, it does effect K8 because the latency is too high"

Does that mean the CL2 DDR1 memory using AMD in the Gentoo forum post was actually in much better position than similar AMD with DDR2 memory would be?

Btw, whats with all the THG references? Was any of the sources I brought not credible enough?


"Then you've wasted a lot of time. I think we can stop talking about imaginary hardware."

I did it only to show how wrong abinstein is in his bandwidth "calculations". He claimed to know the stuff he was talking about but yet again couldn't prove his points.

Also it didn't take all that much time, about ten minutes at most. I've studied this stuff for the past six to eight years or so so I know those things by heart :)


"Why did you skip my questions about CSI, your current claim of 64-way, and your future claim of 128-way Xeons?"

You are messing me up with someone else (greg?) as I didn't claim this. All I said about >8P machines was that some old P4 and P3 based things were listed on the spec top20 table and no AMD machines.


I just remembered that KDE has a nice application called KSysguard. I could run an example compile job on my box and show all kinds of statistics like CPU load, memory usage, disk read/write accesses, disk bandwidth, context changes and some other things. If people are interested in that kind of thing I can create a small webpage somewhere and describe the results I get.

Just for fun I recompiled my kernel at work today on my 3GHz P4 HT box (1 core, -j3). Average disk bandwidth was around 10-15kiB/s with up to 500kiB spikes after every few seonds (write cache flush?). Nothing too big. IIRC there were around 3-5 disk access operations per second with occasional spikes of 20-50 accesses though I think these include disk cache accesses, normal drives can't give that many IOPS. I didn't notice spikes in system calls CPU usage that would match with those access spikes.

I had a relatively old 80G IDE drive and with that there was at most 2-3% of the compiling time spent in system calls, meaning disk access was not a big deal.

abinstein said...

Ho Ho, discussing with you so far is totally fruitless because 1) you can't read properly and went about assuming what others were saying, 2) you simply don't admit errors but keep going on and drag things from out of context to aid an argument that's going off track further and further.

First, I never said large cache doesn't help, nor did I say Core 2 has larger cache line. What I said was that compiling has high miss rate, and main memory latency is mitigated by prefetech and large cache line. These are facts; go read any standard computer architecture textbook and stop pesting me with proof. You can't just go around assuming what I said, and even insist you are right to do so. Please don't sink lower.

Second, I have no answer to your little list of questions (I don't intend to answer them even if I know the answers), because you reframe, rephrase, and distort my statements and make these questions completely meaningless. For example, I didn't say desktop apps bottlenecked on crypto and XML; I said the latter two are important for business and engineering IT. You can find out yourself how much C2D cache performs better than P4D, or how much more multi-socket Opteron sold than C2Q. The facts remain, that C2D loses to K8 in some apps (amid 15% better in average), and dual-socket C2Q is much more a niche market (desktop enthusiasts) than the multi-socket Opteron.

Third, your little bit of math is based on nothing but false assumption. Can you show us any IGP system, which is >5x cheaper than any DGP counterpart, that get 25fps with these shading processing? You'd be lucky to get 10fps with any meaningful shading at 1600x1280. If you reduce resolution to 1024x768, you half the bandwidth requirement; reducing to 800x600, you half it again.

At the end of the day, IGP shares its memory bandwidth with CPU. On both Intel and AMD platforms, the memory bottleneck is the memory and memory controller, not the link between IGP and the MC. Even if you use DGP, increase CPU's memory bandwidth from 8GB/s to 10GB/s (DDR2-533 to DDR2-667) will noticeably improve gaming performance. How can you get higher performance with IGP getting 8GB/s, leaving the CPU only 2GB/s?

You were just wrong.

abinstein said...

You'd be lucky to get 10fps with any meaningful shading at 1600x1280. If you reduce resolution to 1024x768, you half the bandwidth requirement; reducing to 800x600, you half it again.

I think I should explain this a bit more clearer, since there are people who are particularly obtuse when reading opposite arguments.

Lets say the 5GB/s memory bandwidth, based on 2M pix on screen, is about accurate. This is assuming 25fps with 1600x1280. We've not seen that, though, but two other scenarios:

1) 1600x1280 with ~10 fps.
2) 1024x768 with ~20fps

In both cases, the memory bandwidth is 2.5x less, or about 2GB/s.

Ho Ho said...

abinstein
"What I said was that compiling has high miss rate, and main memory latency is mitigated by prefetech and large cache line"

So it is. Am I correct when I say that having larger cache somewhat lowers the hit of memory latency for C2 compared to K8 and brings them to (very) roughly similar memory starving rates(C2 misses less often but has larger penalty, K8 misses more often but doesn't have a big penalty)? If so then am I correct to say that memory subsystem doesn't affect compiling speed that much when comparing the two architectures and most of the performance comes down to the actual CPUs that are crunching the code?


"For example, I didn't say desktop apps bottlenecked on crypto and XML; I said the latter two are important for business and engineering IT"

You said "Do you really think that computers are sold/bought mostly for these few purposes?"
As you were talking about apps that most computers are running I made a logical conclusion you were talking about desktop applications. If that conclusion was wrong I apologise but it was the only thing I could conclude since vast majority of computers sold are in fact desktops running regular desktop apps. Don't you agree?


"Third, your little bit of math is based on nothing but false assumption"

As I said it is based on the data you gave: 1600x1280@25FPS. I just calculated what kind of memory bandwidth would it need to render at that rate. It doesn't matter what renders it, bandwidth requirements would still be the same. You said 2GiB/s, I said 5GiB/s. Whose calculations were more accurate? I'm also still waiting for your calculations that gave you that 2GB/s number in the first place. Please don't be shy. When I see your mistake I can fix it and in the end it would benefit you.


"Can you show us any IGP system, which is >5x cheaper than any DGP counterpart, that get 25fps with these shading processing?"

I also said several times your numbers were absolutely unreal since no IGP reaches those speeds you gave since it is bottlenecked by shader processing. I was only trying to show that your assumption of fillrate and bandwidth usage were completely wrong, nothing else. The calculations were never not meant to show IGP memory bandwidth usage in real-world situations.


"If you reduce resolution to 1024x768, you half the bandwidth requirement; reducing to 800x600, you half it again."

Wrong. You only lower the bandwidth that is directly connected with framebuffer. Textures and triangle data that takes >70% of the bandwidth will still need just as much bandwidth as with higher resolutions. That means at best you would go from 200MiB per frame down to around 150MiB when going from 1600x1280 to 800x600.


"On both Intel and AMD platforms, the memory bottleneck is the memory and memory controller, not the link between IGP and the MC"

I agree and I haven't said otherwise. Some other people were talking about all sorts of buses between MC and IGP and CPU that would be bottlenecking when IGP would try to use lots of memory bandwidth. I was only trying to find out how big throughput those buses have.


"Even if you use DGP, increase CPU's memory bandwidth from 8GB/s to 10GB/s (DDR2-533 to DDR2-667) will noticeably improve gaming performance"

Please prove it. Also it would be nice when you could use some trickery to not decrease memory latency while doing it. Using singlechannel vs dualchannel would do the trick and it would show the difference even better since you'd cut the bandwidth in half, much more than merely going from 667 down to 533.


"How can you get higher performance with IGP getting 8GB/s, leaving the CPU only 2GB/s?"

You might not believe it but most games don't need much CPU<->RAM bandwidth. Even 2-4GIB/s would be more than enough. They don't need to stream tens to hundreds of megabytes of texture data through CPU every frame, they are operating on much smaller datasets. Yes, they do use lots of RAM capacity but only small parts of it are used per frame. E.g all you have in video RAM you also have in RAM but CPU almost never touches these things. Also sound takes quite a lot of memory but for an average song you'll use only a few kb's per second.

Don't believe me? Take some high-end s939 dualcore K8, OC it to hell and use 400MHz RAM in it in singlechannel setup. Next compare the FPS of games against the same machine with two sticks or RAM in dualchannel setup. I'm not telling there will be zero difference, only that having half the bandwidth won't bring performance down a lot, I estimate around 10-15% at most.

How much do you think having twice the bandwidth relly helps in games? Here is some food for thought. Of cource you all will probably start talking about incresed latency but please see here. Not that big differenec, is it? Actually highest end DDR2 has lower latency than highest end DDR1 so you can compare the two by only considering the throughput diffecence. In benchmarks they used DDR1 400 2-3-2-10 and DDR2 800 4-4-4-12.

You can also see that DDR2-533 vs DDR2-667 at best latencies have bandwidth difference of around 21% and latency difference of ~19%. How big difference is there in games?

As I said, games are not bandwidth starved. They are much more depending on latency. Also you can reread my calculations to see that even at redicilously high settings IGP's do not take 8GB/s. Assuming a bit more normal game they will be using much less than that. That means when IGP uses 80% of memory bandwidth there will still be plenty left for CPU.

Of cource GPU speed is the most important thing but for sake of simplicity lets say it is not the limiting factor.


"In both cases, the memory bandwidth is 2.5x less, or about 2GB/s."

Please share the formulaes you used to reach those numbers. I shared mine, it shouldn't be too hard for you to do the same. Also please read what I said about halving the framebuffer size.

Unknown said...

scientia had hoho mixed up with savantu, and hoho had savantu mixed up with me.

Scientia, my question was meant to point out an obvious flaw in savantu's argument. That being that IBM only offered Xeon based large multiproc systems. Obviously, they'd offer systems with their own processors in them. That would be like owning a clothing factory with a store in front, and selling melons in the store, but no clothes.

abinstein said...

Ho Ho: "As I said it is based on the data you gave: 1600x1280@25FPS."

What a lier you are! You can't read, you can't keep facts from estimates, and you can't quote correctly. I've never said 25fps @1600x1280, certainly not under shading processing. What I said was that (the most powerful) IGP today would be hard to get 100fps @1600x1280 without any 3D processing. You don't mess up with computational processing when you want to estimate the memory bandwidth usage.

Show me an IGP system that can get >100fps @1600x1280 at all, with or without any polygon/texel/pixel processing, before you let out the sh*t that IGP can take 8GB/s memory bandwidth.

And for your excessively obtuse mind to grasp the most basic concept, the 2.5x less bandwidth of my estimate comes from the fact that you get only 10fps instead of 25fps at the same 1600x1280, or you get 20fps for half the number of pixels @1024x768.

BTW, you do amaze me by sinking lower.

abinstein said...

Ho Ho: "You can also see that DDR2-533 vs DDR2-667 at best latencies have bandwidth difference of around 21% and latency difference of ~19%. How big difference is there in games?"

Stop using indirect and obscure "evidence" to help your argument. They won't work. It's about the first thing (pitfall) you should learn from any computer architecture course.

How much would bandwidth help on games? That's not nearly the question at all. The question was, will 3D performance improve with CPU memory bandwidth going from 2GB/s to 8GB/s? It's actually quite easy to answer this. Just compare 3D performance with lowest-end DDR2-400 single channel and highest-end DDR2-800 dual channel. Do you think the different won't be much?

Ho Ho: "Also you can reread my calculations to see that even at redicilously high settings IGP's do not take 8GB/s. Assuming a bit more normal game they will be using much less than that. That means when IGP uses 80% of memory bandwidth there will still be plenty left for CPU."

Please, Ho Ho, stop fooling around, will you? It was you who claimed in the first place that AMD's IMC is not suitable for IGP because of the HT "bottleneck" at 8GB/s. Now you're saying the exact inverse. What's the problem with you?

abinstein said...

Ho Ho: "Am I correct when I say that having larger cache somewhat lowers the hit of memory latency for C2 compared to K8 and brings them to (very) roughly similar memory starving rates(...)? If so then am I correct to say that memory subsystem doesn't affect compiling speed that much when comparing the two architectures and most of the performance comes down to the actual CPUs that are crunching the code?"

This "if ... then ..." statement is totally false, in that the if does not induce the then at all.

Tell me then, based on your if argument, what workload would be affected by memory subsystem, if not compilation?

abinstein said...

Ho Ho: "You said "Do you really think that computers are sold/bought mostly for these few purposes?"
As you were talking about apps that most computers are running I made a logical conclusion you were talking about desktop applications.
"

So your logical conclusion is wrong. Please don't show off your bad logic to us. I was not talking about most computers, but specifically named business and engineering computation. You really can't read, and you didn't even try to read my words, dude.

Ho Ho: "If that conclusion was wrong I apologise but it was the only thing I could conclude since vast majority of computers sold are in fact desktops running regular desktop apps. Don't you agree?"

First, this has nothing to do with my statement which claims that XML and cryptography processing are important for business and engineering IT where performance is needed.

Second, those majority of computers sold of yours saying would do just fine with cheaper K8 and 15% less IPC. Hell, most of them are even Celerons and Semprons.

enumae said...

Scientia, I now have a alot of questions after I read this.

I am wondering what you think about the new beta driver for Intel's G965 IGP, which according to the article are almost complete.

How will it compare to AMD's 690G?

Difference between DX9 and DX10 in gaming or typical use with Vista?

Could this effect AMD's 690G market penetration considering that existing G965 (as I understand it the X3000 GMA) could update there drivers and have DX10 capabilities?

Or is the new driver for new Intel G965 chipsets that will support DX10 or existing?

Maybe a comparison to AMD's DX10 chipset?

I look forward to your thoughts.

Thanks

enumae said...

I appologize, the first article I had read was a little deceiving, the drivers are a few months away.

Unknown said...

As we already all know, Intel obviously did something to gimp the ATI machine. However, realizing they couldn't have done too much, this is almost impressive, to say the least. It would actually be impressive if they listed what resolutions, settings, and "similar specs" of the systems, but it's still almost close.

I'm guessing, since Intel offloads much of the graphics workload to the main processor through the g965 chipset (which is why they've always been able to just use very small amounts of transistors for their cores) that they can just add an extra layer of software for dx10. Chances are this'll slow down the games that could use it too much to matter anyway. Seriously, who would want something like a 6200 with dx10?

This really shouldn't effect the 690's market penetration at all, seeing how the main factor for wanting to adopt the chip would be cost, power usage, and (in the most extreme scenarios) the ability to use industry graphics programs (which I've already stated, often can't run correctly on Intel's chipsets for some reason).

Even then, due to random things that no one can really fully explain, the market will even out, and AMD will gain market share due to this. This has happened in every market where a new capable competitor is suddenly present (and AMD is far from incapable). However, if this helps, five bucks says that Dell, HP, and IBM want AMD to have more market share so that they can have more leverage on either AMD or Intel when one has the upper hand and wants to charge too much for their processors.

enumae said...

Greg
It would actually be impressive if they listed what resolutions, settings, and "similar specs" of the systems, but it's still almost close.

While I havn't found system specs, I did find some settings...

"Half-Life 2 running at 1024 x 768 with most settings on high, no HDR or AA turned on, with an average of 34.25 running on a PC equipped with Vista."

...they can just add an extra layer of software for dx10.

DirectX 10 is not software based.

Unknown said...

But, as I understand it, when using an integrated graphics solution from Intel, it's simply offloading much of the work to the cpu (again, why they can make the graphics portion of their chipset so small). As such, if they could possibly release a patch to enable directx10, it would mean that the cpu is doing some hardware emulation in order to process the graphics (which would explain its problems with software like solidworks). If not, then it's impossible for them to simply release a patch to use dx10.

Also, HL 2 is a pretty non-graphics intensive game. I could play it at those settings on my 5200 and probably get that same if not slightly lower frame rate. Again, this is less than impressive, and upon further elaboration, becomes even less so.

enumae said...

Greg
...it would mean that the cpu is doing some hardware emulation in order to process the graphics (which would explain its problems with software like solidworks).

That is what I can find on the capabilities of the X3000.

Intel X3000 PDF

If it is possible for Intel to release a new driver with better support for gaming, then those capabilities should also apply to 3D applications.

Again, this is less than impressive, and upon further elaboration, becomes even less so.

Could you give me some details about your Nvidia 5200 video card.

Was it discrete, how much RAM, AGP or PCI?

Could you run COD2 at those settings?

Thanks.

Unknown said...

I'm pretty sure I ran COD 2 at 1024 though with slightly lower settings. However, I had the 128 MB version slightly overclocked, and also had an athlon 2400+ (cause I was dirt poor back then). Obviously it was discrete and agp (I don't think they made any other models).

It's been a while though, so I may be missing a few things. Regardless, that doesn't sound much more capable than a 5200, which isn't necessarily bad, just not that great. DX10 and shader model 3 aren't really useful if the gpu isn't capable of handling games that actually use them.

Scientia from AMDZone said...

ho ho

Am I correct when I say that having larger cache somewhat lowers the hit of memory latency for C2 compared to K8


You are not really correct. Intel has split memory channels, better branch prediction, and larger cache. All of these combined aren't enough when compared to K8's better memory bandwidth and latency for strictly memory intensive applications. However, these things work very well for achieving higher code hit rates. This means that Intel can be slower, the same, or faster depending on what you are running.

Actually highest end DDR2 has lower latency than highest end DDR1

I see. You are going by the link to the Sandra and Sciencemark scores. Interesting but I wouldn't count on that.

Scientia from AMDZone said...

greg

Yes, IBM's largest systems use Power processors, Sun's largest systems use Sparc, and HP's largest systems use Itanium.

Power is currently doing well but is slipping. I'm sure IBM will continue to prefer it for it's mainframe systems for several more years. IBM does sell both Xeon and Operon based systems as well.

Scientia from AMDZone said...

abinstein

Please, Ho Ho, stop fooling around, will you? It was you who claimed in the first place that AMD's IMC is not suitable for IGP because of the HT "bottleneck" at 8GB/s. Now you're saying the exact inverse. What's the problem with you?


Since he was talking about imaginary hardware maybe his arguments were imaginary too.

Scientia from AMDZone said...

ho ho
On the whole having several files have almost no impact in benchmarks assuming that compared systems have remotely similar drives.


This statement is absurd. It makes no difference at all if the two drives are the same because the drives are several orders of magnitude slower than the processors. And, supposedly this is testing the processors.

I know ICC generates different code for different architectures. Mostly that means not using SIMD stuff on AMD and other non-intel CPUs.

You are running a compiler benchmark compiled with the Intel Compiler and you don't think that makes a difference? The entire Spec group is suspect unless compiled with a neutral compiler and currently that means the PG compiler.

Btw, whats with all the THG references? Was any of the sources I brought not credible enough?

THG has a well established history of cheating on its benchmarks in favor of Intel. For example, they use very fast DIMMs to suggest that the memory is not a bottleneck. In reality it is because the high speed DIMMs are always high latency.

Unknown said...

Scientia, my point was that Savantu had no idea what he was talking about, but your point helps me do that, so thank you.