Wednesday, May 21, 2008

Approaching Midyear

Intel's lead remains clear just as much as what steps AMD might take to become profitable again remain a mystery. The layoffs might manage $25 Million a quarter while reducing capital expenditure might be good for another $100 Million but this is a long way from the $500 Million that AMD talked about.

Lately, I've been comparing Intel's and AMD's processors. A couple of things are noticeable. SOI is supposed to confer heat tolerance and AMD's processors now have a very respectable temperature limit of 70 C. Yet, Intel's processors running on bulk silicon now have a top end of 71.5 C. The voltage spread is also noteworthy. AMD's spread ranges from 1.05 to 1.25 volts. Intel's spread is much larger, ranging from 0.85 to 1.3625 volts. Intel's processors seem to run just fine on a much larger voltage spread again on bulk silicon. I also came across a comment in one of IBM's journal articles:

Chips with shorter channels typically run faster but use considerably more power because of higher leakage. In previous-generation processors, these parts would have been discarded because of excessive power dissipation but now are usable by operating at lowered voltages. In addition, chips with longer channels typically run slower, so some of these parts also would not have been used in earlier generation processors because of their low operating frequency, but now they also are made usable by increasing their operating voltages.

This piece was from an article about IBM's Power6 development. However, since this concerns IBM's 65nm process which presumably would be nearly identical to AMD's I have to wonder. I have to say that this reminded me a lot of AMD's initial Barcelona launch. I do wonder if the 2.3Ghz 130 watt chips would have been ones that would have been scrapped in previous generations. To be honest, the current B3 stepping looks a lot more like AMD's planned initial offering. And, with this delay it becomes more doubtful that AMD will hit its target of 2.8Ghz at 95 watts on 65nm. This too presumably would have allowed 3.0Ghz at 130 watts. It now looks like AMD won't get there until 45nm. At least we assume so; there have as yet been no 45nm ES chips reviewed. Hopefully, this won't be like the Barcelona release where AMD snuck the chips out on a Friday to delay review over the weekend. Hopefully 45nm will be something that AMD can finally be proud of.

I know some time ago Ed at Overclockers had suggested that Intel might have had an advantage due to RDR. I had doubted this at the time but have since changed my mind. I realized that Intel had considered their own production marginal without DFM and that AMD has now apparently become serious about using DFM at 45nm and below. It is perhaps possible that Intel's current process advantage at 65nm could be due to just a slight edge gained by making the chip layout easier to produce. If AMD really has made use of DFM at 45nm then the addition of immersion scanning should remove any limits they had at 65nm. This would be fine were it not for the fact that Intel is using high-K at 45nm which is looking more and more like the right choice. It has yet to be shown whether or not AMD can be competitive with low-K.

In fact, it is difficult to imagine how things could be better for Intel than they are right now. Fuad has been suggesting that Intel is having trouble with 45nm production. The only evidence I've seen of this is the lack of 45nm quads at NewEgg. You can find 45nm dual cores from 2.5 to 3.2ghz but quads are still scarce. Of course, even if this is true Intel still has the excellent G0 65nm stepping which is still comfortably ahead of AMD. Considering that AMD won't produce any significant volume of 45nm until Q4 this gives Intel a good six months to tweak 45nm production. Even at that, AMD is unlikely to challenge Intel's current top speeds before 2009.

Intel's chips seem to have a fair amount of headroom as well. The dual core chips seem to be able to hit 4.0Ghz without much trouble. The quad cores might also be able to hit this but it looks like they are limited by socket wattage. If socket 775 were rated for 175 watts Intel could hit 3.5Ghz with their quads with room to spare. Of course, they could always follow AMD and introduce triple cores. This would allow a 175 watt quad to fit within the current socket limit. The problem with this idea is that Intel's marketing department already got up on their soapbox and proclaimed AMD's triple cores to be crippled chips. They would look a bit foolish to go back on this now.

Right now of course this isn't a problem because AMD's B3 chips don't have much headroom and the current triple cores are not being released at higher clocks. But, if triple cores were pushing faster than 3.33Ghz in, say, Q1 2009 what would Intel do? They could up the clock on their dual cores or eat crow and release triple cores of their own. Some suggest that Intel is secretly overrating the TDP limits of its 45nm chips. This presumably would mean that Intel would only have to change the labels to fix the problem. I'm not sure this has any truth to it because while the chips will indeed clock higher the quad TDP ratings seem to match the dual core TDP ratings. So, there is no evidence that Intel is derating its chips. Or perhaps I should clarify that by saying that there is no evidence that Intel is derating the quad chips; the duals do indeed seem capable of clocking higher while staying within the socket limits.

Of course, AMD may not be much competition in the near term. AMD might need high-K to actually pose any threat and by that time Intel will be close to or already producing 32nm. Then again, perhaps Nehalem is much better than has been suggested. I'm a bit divided about the direction that Intel has gone with Nehalem. The IMC is a great idea as is QP. And, putting the loop buffer after the decoding stage does seem like an improvement and probably a lot better than the trace cache on P4. I had been wondering how Intel was going to make SMT work without the old trace cache and this seems to be the answer. But, I still have doubts. The IMC is good but in terms of the benchmarks it seems like a lot of this will be traded off with less use of prefetching with the smaller cache. In terms of real performance this is an improvement but I wonder how it will be on the artificial benchmarks that Intel has been so fond of. Secondly, SMT seems to be one of those ideas that looks great on paper but not in practice. Most likely this was added to give Intel some greater hitting power versus Sun's strongly threaded Niagara. And, it probably will.

The problem is that as a programmer you can't really program for SMT. If you only have one application running then this can work. However, as soon as you run other applications the processor state changes and you can find yourself running slower with SMT instead of faster due to things like sharing buffers, cache, and branch prediction. And, these days with virtualiztion moving up it becomes more and more likely that servers will be running not just multiple applications but multiple OS's as well. This simply cannot be optimized at compile time. Without some method of changing the SMT profile at runtime I can't see how this would work. It seems more likely that SMT will end being an option that most server managers leave turned off. Curiously, AMD seems to be one who is moving towards dynamic load profiling but it has never given any indication that it is interested in SMT. Intel has a tough path to follow trying to make Xeon competitive with Opteron but not so good that it kills Itanium. Once Xeon has 256 bit SSE Itanium will be holding on by only an artificial RAS differentiation. But RAS will undoubtedly become more and more robust in x86 server processors. As bad as Nehalem might be for Itanium it is most certainly what Intel needs for Xeon. I'm eager to see the benchmarks to find out if it is good for the desktop as well.

49 comments:

SPARKS said...

“ If socket 775 were rated for 175 watts Intel could hit 3.5Ghz with their quads with room to spare.

Academic, I type this comment at 4.06, on an Asus P5E3 Premium X48 motherboard, 24/7, air, 54 C. I would suggest that your inference that INTC has indeed “de-rated” its chips to suit current market conditions has already been policy. I call it sandbagging, so did ‘MaximumPC’.

The head room in my chip, QX9770 (with an unlocked multiplier) far surpasses the relative thermals and obtainable clock speeds of the Q6600 (GO) by roughly 750 MHz, easily. I own both.


“Of course, they could always follow AMD and introduce triple cores.”


Sorry Sci, this is where you and I part ways. Frankly, I’m surprised you even consider it as a viable strategic option. I’d bet every share I own in INTC this would NEVER happen.

Besides, on a manufacturing level, it costs the same for INTC to build a quad, as it would a triple! Why bother? Simply give the 4th core away.


Why would they bother when there are quad cores that could easily be with in reach the same performance/price points as an AMD triple?

Or, just simply keep Q6600 the last remaining 65nM chip in production to fill Intel’s gap between dual and quads at a competitive performance/price point. This is what will happen.

SPARKS

Scientia from AMDZone said...

sparks

"Academic, I type this comment at 4.06, on an Asus P5E3 Premium X48 motherboard, 24/7, air, 54 C."

This doesn't tell me anything. What temperature would you reach while stressing the cores with Prime95 and how much power would be pulled through the socket?

"I would suggest that your inference that INTC has indeed “de-rated” its chips"

Actually that isn't what I said. I said the dual core chips have been derated. This has not been shown for the quads.

"The head room in my chip, QX9770 (with an unlocked multiplier) far surpasses the relative thermals and obtainable clock speeds of the Q6600 (GO) by roughly 750 MHz, easily. I own both."

Again, this doesn't tell me anything. Most OC'ers figure headroom based on several different arbitrary levels of stability which often exceeds what the manufacturer is willing to gamble on. The duals would obviously fit the current criteria at clocks above 3.33Ghz.

"Sorry Sci, this is where you and I part ways. Frankly, I’m surprised you even consider it as a viable strategic option. I’d bet every share I own in INTC this would NEVER happen."

I'm always amazed when I get misquoted on roborat's blog which seems to be about half the time. Did you really have that much trouble understanding what I wrote? I don't think Intel would release triple cores unless they were really cornered. They would look very foolish after trashing the concept in their ads.

"Besides, on a manufacturing level, it costs the same for INTC to build a quad, as it would a triple! Why bother? Simply give the 4th core away."

You don't understand. AMD triple cores are quad cores with a defective core; they were never intended to be made. Intel has been dumping its bad core chips as low end single core Celerons. There is nothing to prevent these from being paired with another good dual core to make a triple core.

"Why would they bother when there are quad cores that could easily be with in reach the same performance/price points as an AMD triple?"

They might indeed. Again, what I said had absolutely nothing to do with price or performance. It only had to do with the rated wattage limit of the socket. I thought this point was pretty clear but okay: If a quad core exceeds the wattage limit of the socket then a triple core could fit. That concept should be pretty straightforward.

"Or, just simply keep Q6600 the last remaining 65nM chip in production to fill Intel’s gap between dual and quads at a competitive performance/price point."

Which again has nothing at all to do with what I wrote.

SPARKS said...

“You don't understand. AMD triple cores are quad cores with a defective core; they were never intended to be made.”

I knew that, trust me.

Perhaps, I don’t understand the essence of your supposition.

As so many naysayers have pointed out in the past, INTC’s MCM approach has been called inelegant. However, it left INTC with many more binning options. 2 top bin dual cores can either be 1 QX9770 or 2 E8600’s the only difference being the unlocked multiplier in the top binned Quad. INTC can play mix and match all the way down the line, binning speed being the criteria, for both duelies and quads. Again, why even mention 3 core as you did in your essay?

I agree with you whole heartily, INTC going to a three cores would absolutely ridiculous no mater how “back against the wall” they found themselves. With their current performance lead with 45nM transitioning, clock per watt speeds, performance per dollar, it would be marketing suicide. Besides, INTC doesn’t want to play “copy AMD” strategy before Nehalem’s immanent launch, or sell anything that may appear to be “broken”, not at this late stage.

AMD, however, because of its more “elegant” quad core solution, has no such options presently.


“ Again, this doesn't tell me anything. Most OC'ers figure headroom based on several different arbitrary levels of stability which often exceeds what the manufacturer is willing to gamble on. The duals would obviously fit the current criteria at clocks above 3.33Ghz.”

I can’t speak for “most overclockers”, but I can speak for myself, an avid overclocker. When I can play “Crysis” at 2 in the morning, and found the machine has run a virus scan AND defragged the hard drive after I’m done playing, I find this is extraordinary. All of this, with nary a glitch in gameplay. This is real world performance at 4 GHz, 24/7, no gamble, 100 percent stable.

I suppose one could take a Buggati Venron and run at top rpm and speed for a week, 24/7, and wait for something to fail, and say ‘this tells me something’.

I have no interest in power usage; the board is qualified to handle the processors requirements, and I am 100 percent stable on air. Does this tell you something? If not here is a link below that will, further, it will also concur with my personal observations, not synthetic benchmarks.

http://www.overclockersclub.com/reviews/intel_qx9770/



SPARKS

Ho Ho said...

scientia
"Curiously, AMD seems to be one who is moving towards dynamic load profiling but it has never given any indication that it is interested in SMT"

What AMD proposed was a standardized way of doing low-level CPU profiling. That kind of low-level profiling has been availiable on CPUs for years (IIRC since first Pentiums), just that with every new CPU or sometimes even new revisions of a CPU the events you could monitor changed and it was almost impossible to write cross-platform/CPU code that could do such performance analysis of the code.

Scientia from AMDZone said...

sparks

"Perhaps, I don’t understand the essence of your supposition."

Even after I explained it four times? I don't know what to say.

"INTC’s MCM approach has been called inelegant."

I never said anything negative about Intel's MCM so this is unrelated.

"Again, why even mention 3 core as you did in your essay?"

See the above comment about already explaining this four times.

"When I can play “Crysis” at 2 in the morning, and found the machine has run a virus scan AND defragged the hard drive after I’m done playing,"

Right so your cores are hardly ever fully utilized.

"This is real world performance at 4 GHz, 24/7, no gamble, 100 percent stable. "

So you've never tried running Prime95 on all four cores for half an hour. That would be something to brag about.

"I have no interest in power usage"

Intel does.

"Does this tell you something?"

Not really.

Scientia from AMDZone said...

ho ho

"What AMD proposed was a standardized way of doing low-level CPU profiling."

You are getting confused with AMD's Light-Weight Profiling Proposal.

However, I wasn't talking about a proposal. I was referring to IBS which is new in Barcelona.

" That kind of low-level profiling has been availiable on CPUs for years (IIRC since first Pentiums)"

I know that IBS does not exist in K8. The best you can do with K8 is performance counter sampling with PMC. IBS includes new hardware support for sampling which puts it at least a generation ahead of PMC.

Name the Penryn equivalent of IBS.

SPARKS said...

SCI as per your suggestion-

[May 23 22:22] Work thread starting [May 23 22:22] Beginning a continuous self-test to check your computer. [May 23 22:22] Please read stress.txt. Choose Test/Stop to end this test. [May 23 22:22] Test 1, 4000 Lucas-Lehmer iterations of M19922945 using FFT length 1024K
-
-
-
-
[May 23 22:54] Test 2, 560000 Lucas-Lehmer iterations of M210415 using FFT length 10K. [May 23 22:54] Torture Test ran 31 minutes - 0 errors, 0 warnings. [May 23 22:54] Work thread stopped.


The highest speed for this completely unrealistic CPU usage for QX9770, air cooled, was:

8.5 (mult) X 450 (fsb) =3825.1
Vcore 1.408 V
DRAM 7-7-7-21 2T @ 1800 2.0V
CPU PLL Voltage 1.80
FSB Terminal Voltage 1.30
Northbridge Voltage 1.51

After these runs, I increased the multiple back to 9.0 for 4.08 GHz for real world, dynamically loaded, everyday usage.

It would not run Pime95 longer than 5 minutes at 4.0. The Zalman 9700 cannot keep up with heat with CPU loaded in Prime95. I suspect there would be no problem with water cooling at 4.0 GHz.

By the way, during this final run, I was reading my favorite tech sites/news.

I’m not bragging Sci, it’s just a simple fact that this is one terrific piece of hardware.

Will you admit that?

Finally, does THIS tell you something?

SPARKS

zdzichuBG said...

SMT isn't for every workload, that's sure. And Sun doesn't market their T1/T2/T2plus (Niagara/Niagara 2/Victoria Falls) CPUs for every workload. Like you wrote, CMT excels with single, threaded applications. Like webservers. This is exactly what UltraSPARCs T* are for. And they are unstoppable, when you have dual-socket, 32-core, 128-thread webserver in 1U... x86 CPUs can't even compete here.
I wonder if going to SMT in general-purpose is some marketing ploy ar simple necessity with slow memory.

sharikouisallwaysright said...

Beside the daily business, i am using now a phenom x4 9850 black edition on a cheap alivesata2glan from asrock and it works!

Imho is AMD doing right to make such things happen and not to force you to buy a new board.

And VIA is well remebered for this too - rest in peace!

Pop Catalin Sever said...

If AMD can't compete performance wise with Intel it will die slowly like Via & Cyrix.

I don't know what the hell H.R. is doing but none of the actions or decisions he makes (and his board) seem to take the company in the direction of being competitive on the performance front.

When Nehalem comes out I expect a bigger blow to AMD than Core 2 was. And then everyone will realize how flawed, underpeforming and short sighted the H.R. era was.

Scientia from AMDZone said...

sparks

I did look through some of the comments in the latest thread on roborat's blog. Most of the comments I saw were a waste of time.

For example, the discussion about 450mm wafers. AMD trailed Intel's adoption of 300mm by several years; they will of course do the same thing with 450mm, putting off the transition as long as possible.

The discussions about the lawsuit (apparently prompted by roborat's speculations) were similar. The bottom line is that if by some miracle AMD received a whopping judgement tomorrow of, say, $5 Billion Intel would just pay it out of available cash. Intel's stock would probably dip a little but recover within one quarter. So, no real effect on Intel. However, even with less debt AMD would still have to come up with competitive hardware and a balanced budget. In other words, with an extra $5 Billion in hand AMD's competitive position (and therefore outlook) would be hardly changed from what it is today.

There was similar speculation about shifting production to a foundry which obvious won't work. I looked over most of the comments which mentioned me specifically and each one that I looked at had incorrect ideas attributed to me. The ratio was even worse than the roughly 50% correct ratio that I saw in the past. Sometimes it was both interesting and sad to see someone invent an idea attributed to me and then see several other posters jump in and energetically do battle with this strawman.

I noticed that there does seem to be a similar phenomenon as I saw at Toms Hardware Guide where incorrect information bad for Intel is likely to be corrected quickly whereas other information is not. For example, I don't believe anyone ever corrected the assertion that AMDZone was blocking guest viewing of the forums which is obviously incorrect.

The only comments that I saw that seemed to be worthwhile were the technical discussions of gate properties, process technology, and production. These were indeed worthwhile. However, that still means that the bulk of comments are garbage. I might have to try to scan more for technical comments in the future.

Scientia from AMDZone said...

sparks

I would say that your results show exactly what I said: If Intel sockets were designed to handle more wattage then Intel could easily increase clock speed. But since the great majority of Intel systems are stock this doesn't help Intel much right now. However, you could make the argument that Intel could bump dual core speed enough to remain competitive with AMD's triple cores, but I doubt Intel would actually do this.

Scientia from AMDZone said...

zdzichuBG

I think Intel added SMT to P4 to try to mitigate the large penalties from pipeline stalls. With its shorter pipeline this doesn't seem to be a problem for Core. So, I'm guessing that Intel simply added this as a bonus to help with some server applications (web searches, database searches, text lookup, etc). I doubt it will have much value for desktops and again not much value for mixed application servers.

Scientia from AMDZone said...

Pop Catalin Sever

"If AMD can't compete performance wise with Intel it will die slowly like Via & Cyrix."

I agree; AMD would become marginalized.

"none of the actions or decisions he makes (and his board) seem to take the company in the direction of being competitive on the performance front."

Some improvement (a bump from 2.3 to 2.5Ghz) but more is definitely needed. And, more importantly AMD has to get its finances under control.

"When Nehalem comes out I expect a bigger blow to AMD than Core 2 was."

Why do you think so?

Pop Catalin Sever said...

'"When Nehalem comes out I expect a bigger blow to AMD than Core 2 was."

Why do you think so?'

Because Core 2 had just IPC improvements over an old infrastructure based on FSB and still managed to damage Athlon heavily.

Nehalem will have even more IPC, more bandwidth, lower memory latency and multi processor scaling, basically will have 4 advantages over K10 where Core 2 had only 1. (Not to mention power consumption)

Ho Ho said...

scientia
"The best you can do with K8 is performance counter sampling with PMC. IBS includes new hardware support for sampling which puts it at least a generation ahead of PMC."

With PMC you can count single instructions/events exactly as well as with IBS, only difference is missing virtual and physical addres collecting. Only thing you have to worry about is not trying to count too much different things at once. As long as there are enough counters you get 100% exact results. You can measure cache latency, cache misses, numbers of specific instructions made (or a family of instructions), stalls etc with PMC (there were >500 events and event sets measurable on Core2, IIRC). Only major thing I see with IBS is support for memory virtualization. The rest is just making stuff sound nice, similar to how Intel tends to label their things with well-sounding names even when they aren't actually worth much.

Scientia from AMDZone said...

Ho Ho

"With PMC you can count single instructions/events exactly as well as with IBS, only"

No.

"Conventional performance counter sampling is not precise making it difficult, if not impossible, to attribute events to specific instructions."


"difference is missing virtual and physical addres collecting."

True.

"The virtual and physical addresses of load/store operands are collected. "


"Only thing you have to worry about is not trying to count too much different things at once."

Yes, this is a difference and an advantage of IBS.

"A wide range of events are monitored and collected with each IBS sample. Either multiple sampling runs or counter multiplexing must be used to collect the same range of information with conventional performance counter sampling. "


"As long as there are enough counters you get 100% exact results."

We've already shown that isn't the case.

"You can measure cache latency,"

No. This is apparently new with IBS or at least this is what AMD indicates is one of the differences between IBS and performance counter sampling.

"Latency is measured for key performance parameters such as data cache miss latency."

Scientia from AMDZone said...

Pop Catalin Sever

"Because Core 2 had just IPC improvements over an old infrastructure based on FSB"

That statement is not accurate. Yonah had several improvements over Banias which likewise had several over PIII. C2D doubled Yonah's bus width and SSE execution width which greatly increased SSE performance. They added a third simple decoder and widened the pipeline to a width of four which greatly enhanced Integer performance. These are not small changes.

Nehalem uses a similar core to Penryn. The major changes seem to be some additional SSE4 instructions, SMT, and IMC. I have doubts as to how much improvement we will see with the IMC since the prefetching and caching on Penryn are so good already.

"Nehalem will have even more IPC, more bandwidth, lower memory latency and multi processor scaling, basically will have 4 advantages over K10 where Core 2 had only 1."

I'm not sure what you mean. K10 already has multi-processor scaling and this will improve with DC 2.0 and HT 3.0. K10 again already has low memory latency.

I'm also not sure what you mean about "even more IPC". As far as I can tell, Penryn is about 16% faster than K10 in Integer while K10 is about 16% faster in FP. I have no idea how much gain Shanghai might get but even if it were the same as Nehalem I expect Intel to hold onto the clock lead. Or were you talking about IPC only when using SMT?

I would agree about bandwidth. So, bandwidth and clock speed would be two advantages rather than four.

enumae said...

Scientia
As far as I can tell, Penryn is about 16% faster than K10 in Integer while K10 is about 16% faster in FP.


This is not true until you start running SPECint®_rate2006 and SPECfp®_rate2006.

"* The SPEC speed metrics (e.g., SPECint2006) are used for comparing the ability of a computer to complete single tasks.

* The SPEC rate metrics (e.g., SPECint_rate2006) measure the throughput or rate of a machine carrying out a number of tasks.
"

So when using the rate metrics, you are bringing platform capabilities into the equation and no longer talking about IPC.

If using the speed metric (which would be more realistic of IPC), Penryn and Barcelona are about even in FP, and with INT, Intel is about 25-30% faster.

Ho Ho said...

scientia
"We've already shown that isn't the case."

You don't seem to know much about PMC. Only way not to get 100% accurate results is when you overload your counters and have to multiplex. As long as you have enough counters you will have 100% exact results. This is something AMD has conveniently left out from its documentation.


"K10 already has multi-processor scaling "

Indeed, K10 had, Core2 didn't. Now Nehalem will have it too and K10 will face some serious competition in multi-processor platforms. Compared to Penryn Nehalem will have lower memory latency and better scaling. Compared to Barcelona Shanghai will hardly get noticeably better in those two things.

Pop Catalin Sever said...

Scientia from AMDZone said...
Nehalem uses a similar core to Penryn. The major changes seem to be some additional SSE4 instructions, SMT, and IMC. I have doubts as to how much improvement we will see with the IMC since the prefetching and caching on Penryn are so good already.

Not similar, "improved Penryn Core"
SMT, SSE 4.2, Improved dispatch, improved prefething...

Quote from HardwareSecrets:

"First Nehalem will have four dispatch units instead of three. So what does that mean? This means that internally the CPU can have four microinstructions processing at the same time instead of three like on other Core-based CPUs (Core 2 Duo, for example). This represents a 33% improvement in the CPU processing capability. Translation: this CPU will be faster than Core 2 Duo CPUs under the same clock rate because it can process four microinstructions at the same time instead of three."

Core 2 has 3 dispatch units but can dispach 4 instructions per clock because of macro ops fusion that fuses 2 ops into 1 under certain conditions. Nehalem will be able to dispach more than 4 instructions per clock if macro ops fusion is used (I think 5).

Scientia from AMDZone said...
I'm not sure what you mean. K10 already has multi-processor scaling and this will improve with DC 2.0 and HT 3.0. K10 again already has low memory latency.

Yes but this time Nehalem will also have multi CPU scaling and because of higher bandwidth it might scale better.

Scientia from AMDZone said...
I'm also not sure what you mean about "even more IPC". As far as I can tell, Penryn is about 16% faster than K10 in Integer while K10 is about 16% faster in FP. I have no idea how much gain Shanghai might get but even if it were the same as Nehalem I expect Intel to hold onto the clock lead. Or were you talking about IPC only when using SMT?

Intel said about 30% more performance than Core 2 at the same clock.

I would agree about bandwidth. So, bandwidth and clock speed would be two advantages rather than four.

Not only this but Intel seems to want to increase the clocks speed of Nehalem over Penryn also ... they seem to plan launching 3.2 GHz Nehalems (that's what they demoed at least), and I'm pretty sure sooner or later they will get to that speed. Intel touts a 4x bandwidth increase over Happertown that uses 1600 FSB.

Scientia from AMDZone said...

Ho Ho

"Only way not to get 100% accurate results"

Again, you are ignoring the fact that events cannot be related to actual instructions. This makes counter sampling less than IBS. If you want to consider counter sampling 100% then I guess IBS would be more than 100%. My original point still holds.

"Now Nehalem will have it too and K10 will face some serious competition in multi-processor platforms. Compared to Penryn Nehalem will have lower memory latency and better scaling. Compared to Barcelona Shanghai will hardly get noticeably better in those two things."

This is all nice but completely irrelevant. The original point that Pop made was that these were advantages for Nehalem; they cannot be advantages if they are the same as K10.

Scientia from AMDZone said...

enumae

"This is not true until you start running SPECint®_rate2006 and SPECfp®_rate2006."

Why would you run a single threaded benchmark on a quad core processor? Is your front door tall enough if you can crawl through it? Is your furnace good enough if it can keep 1/4 of your house warm? If you are only concerned about single threaded benchmarks then you don't need a dual core, much less a quad.

Scientia from AMDZone said...

Pop Catalin Sever

"Not similar, "improved Penryn Core" SMT, SSE 4.2, Improved dispatch, improved prefething..."

I see but you are ignoring that the Shanghai core is also improved. I'm also not sure just what of Nehalem's changes are really improvements when SMT isn't used.

"Core 2 has 3 dispatch units but can dispach 4 instructions per clock because of macro ops fusion that fuses 2 ops into 1 under certain conditions. Nehalem will be able to dispach more than 4 instructions per clock if macro ops fusion is used (I think 5)."

True indeed but I'm concerned about execution unit bottlenecks.

"Yes but this time Nehalem will also have multi CPU scaling and because of higher bandwidth it might scale better."

That doesn't seem likely. Socket scaling is not a function of memory bandwidth. It's more a function of inter-socket communication (which is why Opteron stalls at 4-way with HT 1.0). DC 2.0 gives 8 HT links compared to the three that AMD has with DC 1.0. Secondly, these links are about 50% faster. That is pretty good connectivity.

"Intel said about 30% more performance than Core 2 at the same clock."

That would be remarkable indeed if they can pull off a 30% increase in IPC without SMT. If that is true then they could probably take the lead in SPECFP_rate.

"they seem to plan launching 3.2 GHz Nehalems (that's what they demoed at least)"

If Intel can increase IPC 30% without SMT while releasing at 3.2Ghz then they will have no trouble staying ahead of AMD.

"Intel touts a 4x bandwidth increase over Happertown that uses 1600 FSB."

How do you figure 4X? As I recall Nehalem will use 3 memory channels for the server chips. That would be 50%. And, you get a bit more bandwidth (at the expense of latency) by moving to DDR3. This wouldn't add up to 4X though.

enumae said...

Scientia
Why would you run a single threaded benchmark on a quad core processor?


What creates the bottleneck on an Intel platform while running SPEC (rate metric)?

Would running a single instance of SPEC (speed metric) eliminate that bottleneck?

If that bottleneck is potentially removed (QPI & IMC), what does that allow the faster core (speed metric) on the Intel platform to do when running multiple copies (rate metric)?

This is why I look at the speed metric (no bottleneck) and not rate metric (bottlenecked) when comparing Nehalem to Penryn to Barcelona.

Scientia from AMDZone said...

Pop

If DDR3 tops out at 1600 compared to 1066 for DDR2 then this would be 50% faster. Three memory channels versus two would also be 50%. That looks like 2.25X to me rather than 4X.

Of course AMD will also shift to DDR3. But, AMD won't get more memory channels until MX which is looking like it won't arrive until 2010.

Scientia from AMDZone said...

enumae

"What creates the bottleneck on an Intel platform while running SPEC (rate metric)?"

What do you mean? Using all four cores is considerably more powerful than just using one.

"Would running a single instance of SPEC (speed metric) eliminate that bottleneck?"

No, you would do a lot less work. This would be like using a class I hitch on a full one ton truck.

"If that bottleneck is potentially removed (QPI & IMC), what does that allow the faster core (speed metric) on the Intel platform to do when running multiple copies (rate metric)?"

Again, we'll have to see because SPEC probably won't benefit from SMT. It sounds though like you are trying to say that SPEC is primarily limited by memory bandwidth. This would be more likely for Integer than FP but Intel leads on SPECint_rate.

If I were to guess I might think that the extra cycle to access the FP units might be a factor. Has this delay been removed with Nehalem? K10 and Penryn also have different load/store ratios so Penryn works best with a 1:1 ratio while AMD works best with 3:2. Doesn't Nehalem still have the same load/store ratio as Penryn?

"This is why I look at the speed metric (no bottleneck) and not rate metric (bottlenecked) when comparing Nehalem to Penryn to Barcelona."

Are you talking about speed with just one core? This would be like claiming that you have the best V8 engine as long as it is only running on 2 cylinders.

enumae said...

Scientia

1. What creates the bottleneck on an Intel platform while running SPEC (rate metric)?

e.g. FSB, i/o...

2. Would running a single instance of SPEC (speed metric) eliminate that bottleneck?

Lets start here.

Scientia from AMDZone said...

enumae

I'm not sure what you are trying to say. If the SPECfp_rate were bottlenecked because of memory bandwidth then wouldn't the SPECint_rate be also? Wouldn't Penryn then do worse on both benchmarks?

enumae said...

I am trying to explain my point, but I need you to answer the questions.

Pop Catalin Sever said...

"How do you figure 4X? As I recall Nehalem will use 3 memory channels for the server chips. That would be 50%. And, you get a bit more bandwidth (at the expense of latency) by moving to DDR3. This wouldn't add up to 4X though."

It's 2 socket memory bandwidth, the reference is this
slide

Ho Ho said...

scientia
"Again, you are ignoring the fact that events cannot be related to actual instructions"

I'm not ignoring it because you can count individual instructions. Well, at least on the Intel machines I've used, haven't had the pleasure of trying it on AMD CPUs.


"The original point that Pop made was that these were advantages for Nehalem; they cannot be advantages if they are the same as K10."

My point was that Nehalem will be a much bigger upgrade over Penryn than Shanghai is over Barcelona. Basically with Nehalem Intel architecture scales, say 100% better in 4P setup over Penryn whereas Shanghai will be much less of an upgrade over Barcelona. Same with memory latency/bandwidth upgrade. Compared to their older brothers Nehalem has much bigger upgrades than Shanghai.


"I see but you are ignoring that the Shanghai core is also improved."

Continuing my last paragraph, what has had bigger improvements, Shanghai over Barcelona or Nehalem over Penryn?


"It's more a function of inter-socket communication (which is why Opteron stalls at 4-way with HT 1.0). DC 2.0 gives 8 HT links compared to the three that AMD has with DC 1.0. Secondly, these links are about 50% faster. That is pretty good connectivity."

Now compare that to what Nehalem will have and see the relative improvement. Penryn with its pretty pathetic connectivity does relatively acceptably on 2P already, imagine what happens with much improved platform.


"That would be remarkable indeed if they can pull off a 30% increase in IPC without SMT."

Quoting yourself, "Why would you run a single threaded benchmark on a quad core processor?"


"As I recall Nehalem will use 3 memory channels for the server chips. That would be 50%."

You forgot to take real-world FSB throughput into account. It isn't too close to its theoretical throughput.


"Again, we'll have to see because SPEC probably won't benefit from SMT."

I disagree


"If the SPECfp_rate were bottlenecked because of memory bandwidth then wouldn't the SPECint_rate be also?"

Why should it be? They are different program sets with different memory requirements, after all. Basically your question is the same when I replace SPECfp_rate with liquid simulation and spec_int with superpi, they are not directly comparable.

enumae said...

Scientia

Can you answer these two questions?

1. What creates the bottleneck on an Intel platform while running SPEC (rate metric)?

e.g. FSB, i/o...

2. Would running a single instance of SPEC (speed metric) eliminate that bottleneck?

Scientia from AMDZone said...

enumae

One of your assertions is simply wrong. You claim that the SPEC cpu2006 benchmarks measure system level performance. This is incorrect. As SPEC says

"SPEC designed CPU2006 to provide a comparative measure of compute-intensive performance across the widest practical range of hardware using workloads developed from real user applications."

Your description would instead describe SPECsfs2008:

"SPECsfs2008 is a system-level benchmark that heavily exercises CPU, mass storage and network components. The greatest emphasis is on I/O, especially as it relates to operating and file system software."

If you really want to discuss the SPEC results in detail I suppose we can but that will take time.

Scientia from AMDZone said...

Ho Ho

"I'm not ignoring it because you can count individual instructions."

Of course you can count instructions with the older method. However, my statement had nothing to do with counting instructions.

"My point was that Nehalem will be a much bigger upgrade over Penryn than Shanghai is over Barcelona."

Which again has nothing to do with the original point.

"Continuing my last paragraph, what has had bigger improvements, Shanghai over Barcelona or Nehalem over Penryn?"

Which for the third time has nothing to do with my point.

Let me make this simpler for you. You have two starting points: A1 and I1. A1 moves to A2 and I1 moves to I2. You are attempting to assert that:

If (I2 - I1) > (A2 - A1) then I2 > A2

This is not valid. You need to prove some additional items to make this necessarily true.

You've now created three separate strawman arguments which you are working to demolish. However, disproving these ideas that you've invented in no way disproves anything I've said. Are you:

1.) having an argument with yourself?
2.) knowingly and dishonestly creating strawmen because they are easier to disprove?
3.) unable to follow what I actually say?


"Now compare that to what Nehalem will have and see the relative improvement."

With DC 2.0 AMD will have improved 4.3X since the original HT 1.0 and DC 1.0 spec. Now, while Intel seems to work just fine with 4-way I simply don't know what the comparative Penryn metric would be. Unless you can quantify the increase from Penryn to Nehalem I'm not sure what point you are trying to make.

Quoting yourself, "Why would you run a single threaded benchmark on a quad core processor?"

Let me see if I am understanding you. You want to run single threaded on one core when it benefits, run 2 threads on one core when it benefits, run one thread per core on all four cores when it benefits, and run two threads per core using SMT on all cores when it benefits. Then pick among each result and claim the best of each as Intel's new top metrics. I'm not sure what you would accomplish by doing this other than having the fastest scores on paper.

"I disagree"

Your link doesn't support your claim. If you actually compare you'll discover that Sparc is nearly identical to C2D on a core to core basis. The massive threading doesn't seem to be any benefit. Again we will have to see if it does anything for Nehalem.

Scientia from AMDZone said...

Pop Catalin Sever

"It's 2 socket memory bandwidth, the reference is this
slide"


I can appreciate Intel's marketing attempts to put the best possible spin on the increase. However, you can look at the Skulltrail diagram and see that Intel currently has 4 memory channels of 800Mhz apiece for 2 sockets. Comparing 3 channels of 1333Mhz apiece is 2.5X. I'm not sure what math Intel is using.

Scientia from AMDZone said...

Pop

In all fairness though you could make this assertion for 4-way since Intel's Caneland only has 4 memory channels for all four sockets. Of course when I made the same assertion last year that Tigerton would be memory bound several Intel fans threw a fit. So, you are saying that I was right then?

Also, if Caneland is currently bottlenecked then how badly is Dunnington going to get slammed with 50% more cores on the already choked memory channels?

enumae & ho ho

Ars Technica does seem to agree with you that SPECint is not memory bound:

"It's also the case that Tigerton/Caneland's integer performance is rock solid and stands up well to the competition. (This is why Intel touted the specint numbers in its forthcoming press release.) Not only are most integer workloads not memory-bound, but they also love Tigerton's 8MB (2x4MB) L2 cache. These factors are behind Tigerton's very good specint_rate scores (178/214 vs. 105/114 for Opteron), and the great TPC-C and SAP that Intel is touting today."

This does raise the question of whether Nehalem will fair as well with less cache however it does support your assertion that a good score in SPECint does not show good memory bandwidth, so we'll assume that is case.

Again, more detailed comparison of SPECfp will take time. Perhaps I'll see if I can detect a dropoff with increasing clock; this would suggest a higher score if dropoff did not occur.

enumae said...

Scientia
One of your assertions is simply wrong. You claim that the SPEC cpu2006 benchmarks measure system level performance...


They also say...

"Q5. What does SPEC CPU2006 measure?

SPEC CPU2006 focuses on compute intensive performance, which means these benchmarks emphasize the performance of:

* the computer processor (CPU),
* the memory architecture, and
* the compilers.

It is important to remember the contribution of the latter two components. SPEC CPU performance intentionally depends on more than just the processor.

SPEC CPU2006 contains two components that focus on two different types of compute intensive performance:

* The CINT2006 suite measures compute-intensive integer performance, and
* The CFP2006 suite measures compute-intensive floating point performance.

SPEC CPU2006 is not intended to stress other computer components such as networking, the operating system, graphics, or the I/O system. For single-CPU tests, the effects from such components on SPEC CPU2006 performance are usually minor. For large rate runs, operating system services may affect performance, and the I/O system - number of disks, speed, striping - can have an effect. Note that there are many other SPEC benchmarks, including benchmarks that specifically focus on graphics, distributed Java computing, webservers, and network file systems."

----------------------------------

Scientia
Ars Technica does seem to agree with you that SPECint is not memory bound...


Please stop. I did not claim that SPECint does not have memory bound test. Please look at pages 20 and 21 of this PDF Bandwidth and Latency Challenges.

If you want to believe that SPECrate is not a platform test, fine, but ask your self this, why doesn't Intel scale as well in SPECrate as AMD?

Could it be the platform?

Ho Ho said...

scientia
"However, my statement had nothing to do with counting instructions."

If you meant "events cannot be related to actual instructions" then you are still wrong. You can get 1:1 mapping of events and instructions.

Scientia from AMDZone said...

Ho Ho

"You can get 1:1 mapping of events and instructions."

You seem to disagree with the factory: AMD paper on IBS, page 2, section 3.

"It is difficult to relate a hardware event to the instruction that triggered it because the restart address is not the location after the trigger instruction. Contemporary superscalar machines such as AMD quad-core processors use out-of-order execution to exploit instruction-level parallelism. Up to 72 execution operations may be in-flight at any time. Due to operation reordering and in-order instruction retirement, the sampling interrupt triggered by an execution event may be significantly delayed. The delay is indeterminate and is not fixed. The reporting delay is called “skid.” Due to skid, the reported IP value is only in the general neighborhood of the instruction causing the event and may be up to 72 instructions away.

Inaccuracies due to skid accumulate as the program profile is built up. Events that belong to a single instruction are attributed to instructions throughout the neighborhood of the culprit instruction. The ability to isolate a performance issue to any single instruction is lost."


Since Intel hardware operates similarly with many instructions in flight and instruction reordering the above description applies to Intel as well. If you are still claiming otherwise then please cite a technical reference.

Scientia from AMDZone said...

enumae

That is a good link because you have actual data from Intel so we don't have to worry about the accuracy.

It indicates on page 20 that 5 out of 12 tests on SPECint scale close to 100% when going from dual core to quad core while the remaining 7 tests seem to average about 55% or about 73% for the whole.

For SPECfp 5 out of 17 scale close to 100% while the remaining 12 tests seem to average about 36% or about 54% for the whole.

The graphs on page 21 do indeed fall close to the expected results if memory is the limiting factor. So, we can clearly say that both SPECint and SPECfp are bandwidth limited for Intel on dual socket with 3.2Ghz quad cores.

Okay, so back to the original question which is about how much of an increase Intel will see when using 3 channels of memory with IMC. From these graphs it appears that Intel needs 46% more bandwidth than they can get from the 1333Mhz FSB. You could get this from 2 channels of DDR2/3-1066. So, it looks like the 3rd channel won't be needed until Intel goes above 4 cores per die.

Scientia from AMDZone said...

AMD has a press event late tonight at Computex. I suppose if they say anything particularly interesting I'll start a new article.

Mo said...

What do you think Sci

http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3326

Scientia from AMDZone said...

mo

What do I think of Anandtech's preview of Nehalem? Well, the testing was pretty bad. I've seen indications of more than 10% error in Anand's numbers. So, if we allow for this we would still come up with something like a 15-20% increase rather than, say, a 30% increase. I'm sure we'll have more reviews in Q3 and Q4.

I assume the comparison that everyone is waiting for is Nehalem versus Shanghai. From what I've seen though Shanghai is going to need 1066 memory, particularly if AMD is planning at least 2.6Ghz release speed. Once we have a proper Nehalem/Shanghai comparison and know what the release speeds will be we'll know how things stack up.

I also wonder how long it will take before we see a review of the desktop version of Nehalem which is presumably what most people would be buying. Are these going to be available before 2009?

Scientia from AMDZone said...

Is Cinnebench a typical benchmark? If I recall this is one that Intel enthusiasts usually point to because it shows a better than 15% advantage for C2D over Barcelona with Penryn scoring even higher.

However, when I look at the numbers that Anand reports I get a 16% increase for Nehalem with multithreading but only get an 8% increase for single threaded over Penryn. Is this what was expected?

Khorgano said...

The only single threaded bench was a Cinibench with scores of

Nehalem: 3015
Penryn: 2396

Which is a 25% increase. How do you calculate only 8% single threaded?

Also, what benches are you referring too that have a 10% margin of error?

Khorgano said...

Nevermind my last comment about the single threaded Cinebench scores, looks like Anand had a typo or something, he's updated it and is showing a 2% increase on the Nehalem system, so basically parity in this benchmark.

SPARKS said...

“looks like Scientia has banned everyone on his latest blog.”

DOC-

He hasn’t banned me,---yet.
Nor, will I give him the opportunity.
He knows I am no INTC paid liar.

I absolutely refuse to be baited into an argument, as you saw with my last series of posts concerning the performance of QX9770 with that ridiculous Prime95, half hour challenge.

In fact, I was so disappointed in his response to the challenge he suggested, I will never again waste my time posting anything there again.

He couldn’t bring himself to answer my challenge:


“I’m not bragging Sci, it’s just a simple fact that this is one terrific piece of hardware.

Will you admit that?

Finally, does THIS tell you something?”


Frankly DOC, the Nehalem/Anand benchmark issue is a steaming pile of horseshit. Perhaps we can get a general idea on the performance, but to draw conclusions based on substandard immature hardware is complete nonsense.

While the Anand article was a rushed preview to make headlines and sell soap, in contrast, I gave Sci real world benchmarks on 100% retail purchased products. His last response was far less than objective.

As with anything in this life, you put your money where your mouth is, I did. It’s the difference between a wimp who talks tough, talks about the hot cars and hot women, and the guy who’s out there, living on the edge.

Living on the edge ain’t easy, and it’s gonna cost ya. There’s always some asshole that’s gonna put you down, one way or another.

I’ll stay here in the “locker room” with you.

Sincerely,

SPARKS

Scientia from AMDZone said...

sparks

"looks like Scientia has banned everyone on his latest blog.”

I'm sorry; I have no idea what this refers to. It must be something you copied off roborat's blog and I'm already familiar with the quality of posts there.

"He hasn’t banned me,---yet.
Nor, will I give him the opportunity. He knows I am no INTC paid liar."


Again, banned? What are you talking about?

"I absolutely refuse to be baited into an argument, as you saw with my last series of posts concerning the performance of QX9770 with that ridiculous Prime95, half hour challenge."

Again, I'm not sure what you are referring to. You were bragging about your overclock so I asked if you had run Prime95 on all cores. Your response was:

"It would not run Pime95 longer than 5 minutes at 4.0. The Zalman 9700 cannot keep up with heat with CPU loaded in Prime95. I suspect there would be no problem with water cooling at 4.0 GHz."

I'm sorry but I'm not impressed with water cooling. I've seen it claimed over and over and over that Penryn draws very little power. If you have to resort to water cooling then you've proven otherwise.

"In fact, I was so disappointed in his response to the challenge he suggested, I will never again waste my time posting anything there again."

Suit yourself.

"He couldn’t bring himself to answer my challenge:

“I’m not bragging Sci, it’s just a simple fact that this is one terrific piece of hardware.

Will you admit that?"


That was a challenge of some kind? This seems to me more like an issue of common sense than any kind of challenge. You are apparently stretching your claim that Penryn can be reasonably overclocked to 4.0Ghz. And, because I have reservations about that you think I am ducking some kind of challenge? You need to acquaint yourself with reality instead:

Today, AMD enthusiasts are talking about overclocking Black Edition chips to 3.0Ghz while I have doubts that Penryn can be routinely overclocked to 4.0Ghz. Even if 4.0Ghz is too high and the real number is something less like 3.6Ghz this would still be significantly ahead of Barcelona in terms of overclocking.

"While the Anand article was a rushed preview to make headlines and sell soap, in contrast, I gave Sci real world benchmarks on 100% retail purchased products."

Again, you are apparently quoting part of a conversation from roborat's blog. Are you referring to the link you gave? In that article he said:

"400 x 11 at 1.50 volts came easy as well but the required voltage was more than my poor water cooling setup could handle under load."

Again, I am not impressed by water cooling. And, in what way does this refute my suggestion that Intel would be able to clock higher with quad core if the socket were able to handle more power?