Scientia's Blog: Updates And Old Patterns

Monday, April 14, 2008

Updates And Old Patterns

Amid AMD's torrent of bad news: the exit of Phil Hester, the reduced estimates for Q1 Earnings, and the announced 10% layoffs we can at least give AMD a small amount of praise for getting the B3 stepping out the door. It's a small step on a long road.

We can finally settle the question of whether AMD's 65nm process is broken. AMD's fastest 65 watt, 90nm K8 runs 2.6Ghz while the fastest 65 watt, 65nm K8 runs 2.7Ghz. So, the 65nm process is at least a little better than the old 90nm process. People still keep clamoring for Ruiz to step down. Frankly, I doubt Ruiz had any direct involvement with K10's design or development so I'm not sure what this would accomplish. I think a better strategy would be for AMD to get the 2.6Ghz 9950 out the door as soon as possible and try hard to deliver at least a 2.6Ghz Shanghai in Q3. Since Shanghai has slightly higher IPC a 2.6Ghz model should be as fast or faster than a 2.7Ghz Barcelona. I would say that AMD needs at least that this year although this would leave Intel with the top three slots.

AMD's current strategy seems to recognize that they are not competitive at the top and won't get there soon. The collection of quads without L3 cache, Tri-core processors, and the current crop of low priced quads including the 9850 Black Edition all point to a low end strategy. This is the same pattern AMD fell into back in 2002 when it devalued its K7 processors. Of course in 2002 AMD didn't have competitive mobile and its only server processors were Athlon MP. So perhaps Puma and a genuine volume of B3 Opterons will help. AMD's excellent 7xx series chipset should help as well but apparently not enough to get back into profitability without layoffs.

The faster B3 steppings are an improvement but you get the feeling they should have been here last year. You get a similar feeling when Intel talks about the next Itanium release. Although Itanium began with hope as a new generation architecture its perpetual delays keep that feeling well suppressed. And, one has to wonder how much of Itanium will be left standing when Intel implements AVX in 2010. We all know that HP is the only thing holding up Itanium at this point. Any reduction in support by HP will be the end of Itanium. And, we get a similar feeling about Intel's 4-way offerings which always seem to lag nearly a year behind everything else. For example, although Nehalem will be released in late 2008 the 4-way version of Nehalem won't come out until late 2009. Some still speculate that this difference is purely artificial and Intel's way of giving Itanium some breathing room.

However, as bad as AVX might be for Itanium it has to be a double shock for AMD coming not long after the announcement of SSE5. AVX seeks to copy SSE5's 3 and 4 operand instructions while bumping the data width all the way up to 256 bits. It looks like AMD only has two choices at this point. They can either drop SSE5 and adopt both SSE4 and AVX or they can continue with SSE5 and try to extend with GPU instructions. Following AVX would be safer but would put AMD behind since it is unlikely at this point that they could optimize Bulldozer for AVX. Sticking with SSE5 and adding GPU extensions would be a braver move but could work out better if AMD has its Fusion ducks in a row. Either way, Intel's decision is likely to fuel speculation that Larrabee's architecture isn't strong enough for its own Fusion type design. Really though it is tough to say at this point since stream type processing is just beginning to take off. However, GPU processing does demonstrate sheer brute power on Folding @ Home protein sampling. This makes one wonder why OC'ers in particular cling to the use of SuperPi which is based on quaint but outdated x87 instructions as a comparative benchmark.

There is also the question of where memory is headed. Intel is already running into this limitation with Nehalem where only the server and top end chips will get three memory channels. I'm sure Intel had intended that the mid desktop would get three as well, but without FBDIMM three channels would make the motherboards too expensive. This really doesn't leave Intel anywhere to go to get more bandwidth. Supposedly, AMD will begin shifting to G3MX which should easily it to match three or even four channels. However, it isn't clear at this point if AMD intends G3MX on the desktop or just the servers and high end like Intel. With extra speed from DDR3 this shift probably doesn't have to happen in 2009 but something like this seems inevitable by 2010.

36 comments:

Scientia from AMDZone said...: I guess we'll find out on the 17th how bad this quarter was. Given the delay of K10 and the fact that AMD never bothered to make a 65nm version of K8 Opteron I guess I'm not exactly surprised. The volume of server processors in Q1 was have been down considerably.; April 14, 2008 6:29 PM
Unknown said...: 1)What probabilities does Shanghai have to include SSE5 instructions?

2) About AVX, why didn't AMD thought of SSE5 to be 256 bit wide instead of being 128 bit? Why wait for Intel to do this move instead of taking the iniciative? Sometimes I have the feeling that AMD is short sighted. :/; April 14, 2008 7:19 PM
Polonium210 said...: G3MX is for servers IIRC-there is no space on an ATX MB to hold more than four DIMMS and this is all about large RAM and cost.

I think it would be wise of AMD to
ramp-up 45nm production ASAP. Shanghai and Montreal are essential
to the company's recovery and quite frankly Barcelona is too late to be of much value in this respect.I also think they would do well to produce a native dual core 45nm processor with 1MB or more L2 cache per core for the client space.The L3 cache seems to be of limited value on the DT so they could probably afford to omit it-for the moment.; April 14, 2008 7:58 PM
Bradley Coleman said...: the new graphics chip looks sweet, and the next one seems to be coming well according to fudzilla.

puma has 100 design wins, which seems to be awesome.

and the 780G/45nm dual core seems to be an unbeatable sweet spot platform.; April 14, 2008 9:50 PM
Bradley Coleman said...: erlindo said, "AMD is short sighted"

really, amd's done tons of innovation.

first to dual core. fusion. 64 bit. DTX. and more, which i can't remember. whats 128 bits here and there?; April 14, 2008 9:52 PM
Mark S. said...: Erlindo,

I think Intel is being shortsighted for making AVX only 256-bits. Why not 512 or even 1024? Any number less than something I pull out of my patootie must be short sighted!

Seriously, though, once again it is Intel responding to AMD's innovation (SSE5) instead of vice versa. AMD is trying to come up with genuine solutions to computational needs, whereas Intel once again tries to grab headlines and mindshare by throwing out vague promises of future performance by pronouncements of having bigger numbers.

AMD's problem for the past two years has been a failure to execute. Heads must roll, as they are, and then AMD needs to just hunker down and execute, execute, execute. That's all that matters at this point, because no one's going to believe anything they say without concrete proof.

Cheers!; April 14, 2008 10:14 PM
Scientia from AMDZone said...: Erlindo

"What probabilities does Shanghai have to include SSE5 instructions?"

When AMD released the SSE5 spec last Fall they specifically said it would show up in Bulldozer. This makes it unlikely that AMD could get even a first version operational before mid 2009.

"About AVX, why didn't AMD thought of SSE5 to be 256 bit wide instead of being 128 bit?"

256 bit wide operations doesn't buy you much without having internal 256 bit data paths. My guess is that AMD had intended to put wide workloads like this off on the GPU because this is what they are good at.

Polonium

There is more room than that on ATX. Still, I would have to wonder how much adding the MX chips would effect cost. For example, could you do a high end board with 3 MX chips and 6 DIMMs or could you do a mainstream board with 3 MX chips and only 3 DIMMs? I would wonder as well if using 2 MX chips with 2 DIMMs each would gain any bandwidth.; April 14, 2008 11:27 PM
Unknown said...: Bradley Coleman wrote:

really, amd's done tons of innovation.

first to dual core. fusion. 64 bit. DTX. and more, which i can't remember. whats 128 bits here and there?

I know AMD made lots of improvements to their micro-architectures, but what I meant to say is that AMD needs to be ahead of whatever intel is planning just for sake of being competitive (technologically speaking). I wonder if you've read The Art of War by SunTzu. ;)

Anyhow, speaking about why SSE5 is not 256 bit wide from start, is because it's not an easy task to do since it requires a total overhaul of the processor (according to what I understood from Scientia). Still, it should have been AMD's work to foresee this. (They still can...) ;); April 15, 2008 12:02 AM
Polonium210 said...: @Scientia
Sorry, my post was ambiguous. When I mentioned cost I meant that G3MX was designed to enable the use of unregistered DIMMs instead of more expensive registered ones. It is also designed to increase the total system RAM. I do not know of any applications on the DT that require as much RAM as a server load so I just don't see G3MX being implemented on the DT. As for getting more than four DIMMs on an ATX MB, if you look at the component population and expansion slots, I doubt that you could.I for one haven't seen an ATX MB with more than four DIMMs. I suppose you could dispense with the expansion slots!; April 15, 2008 1:04 AM
Scientia from AMDZone said...: polonium

You can fit 4 DIMM slots on a micro-ATX board so you should be able to fit 6 on ATX.; April 15, 2008 3:03 AM
Ho Ho said...: scientia
"We can finally settle the question of whether AMD's 65nm process is broken. AMD's fastest 65 watt, 90nm K8 runs 2.6Ghz while the fastest 65 watt, 65nm K8 runs 2.7Ghz"

Ok, but what if AMD chose not to release faser 65W 90nm CPUs? They've chosen to do all kinds of weird stuff before, why not now? It doesn't want to compete with itself, after all. Considering how many 90nm upgrades there have been after that 2.6GHz 90nm CPU (released over a year ago) they surely could have released a higher clocking 90nm dualcore.

"The faster B3 steppings are an improvement but you get the feeling they should have been here last year."

They are only faster if you compare against patched system.

"AVX seeks to copy SSE5's 3 and 4 operand instructions while bumping the data width all the way up to 256 bits.
...
Following AVX would be safer but would put AMD behind since it is unlikely at this point that they could optimize Bulldozer for AVX."

So you are saying that even though Bulldozer won't be here for years it is difficult to adapt to AVX but it is trivial for Intel to copy SSE5 with its 2010 CPUs? I thought that it takes around four years of development for such big changes so neither could have copied it from each other, it's just the way industry moves: more SIMD instructions. The fact they had a bit different solutions only proves that they aren't copying from each other.

"Either way, Intel's decision is likely to fuel speculation that Larrabee's architecture isn't strong enough for its own Fusion type design."

How des AVX show anything about Larrabee? It should have a different vector instruction set compared to other x86 CPUs, at least that was said some time ago when Intel was describing it. Do you have any other details I've missed?

"This makes one wonder why OC'ers in particular cling to the use of SuperPi which is based on quaint but outdated x87 instructions as a comparative benchmark."

And how is F@H that much better? At least for GPUs it benchmarks only a tiny subset of GPUs. For CPUs it has been used before by some sites but as it takes quite long time and often people don't have that they don't use it as much.

"Intel is already running into this limitation with Nehalem where only the server and top end chips will get three memory channels"

Do you think two channels of DDR3 isn't enough for normal users? Even dual channel DDR2 on a 1066MHz FSB on quad cores is enough for most todays needs, Nehalem with its IMC will easily double that, if not more.; April 15, 2008 3:36 AM
Ho Ho said...: erlindo
"1)What probabilities does Shanghai have to include SSE5 instructions?"

Zero

"2) About AVX, why didn't AMD thought of SSE5 to be 256 bit wide instead of being 128 bit?"

It is a bit easier to work with 128bit than with 256bit (2x2 vs 2x4 quads). Of course if you are smart, plan ahead and can make your algorithms work nicely for 256bit then it would be quite a bit better (roughly half the amount of instructions for same amount of work). Possibly AMD just didn't want to make things more difficult for developers, 128bit is good enough, even though 256bit would offer a bit more performance.

bradley coleman
"first to dual core. fusion. 64 bit. DTX. and more, which i can't remember."

How is DTX living? It was supposed to be something that would conquer the world but I haven't seen or heard much about it, even newegg is pretty much empty.

mark s
"I think Intel is being shortsighted for making AVX only 256-bits. Why not 512 or even 1024?"

I too would have liked 512bit much better as I could have used quads of 4x4 floats to do calculations. Though probably it would have gotten a bit too big and complex for their processing tech and they chose 256bit as sweet spot between performance and complexity.

"AMD is trying to come up with genuine solutions to computational needs, whereas Intel once again tries to grab headlines and mindshare by throwing out vague promises of future performance by pronouncements of having bigger numbers."

In the end all that matters is what sells better and what clients really need. There is no stopping for AMD to improve their stuff, you can't blame others for doing things better than you are doing.

scientia
"256 bit wide operations doesn't buy you much without having internal 256 bit data paths."

I haven't actually checked if Intel CPUs are going to have double cache bandwidth to match 256bit SIMD or not but the non-destructive nature of the instructions should help a bit with the bandwidth concerns. Though I do wish they would have doubled the register count to 32.

"You can fit 4 DIMM slots on a micro-ATX board so you should be able to fit 6 on ATX."

Is 24GB/s theoretical peak bandwidth too little for desktop? I don't think so and thus dual channel DDR3 should be good enough for a long time. Also capacity isn't much of a problem, 8GB should also work for a long time.

Long story short, I don't see G3MX having any benefits on desktop machines. It isn't cheaper and it isn't less complex than directly using two channels of RAM.; April 15, 2008 3:42 AM
Pop Catalin Sever said...: 256 bit wide instruction don't necesarily mean higher performance (2x or close), it all depends on the hardware pipeline width. Intel might implement 256 bit wide sse (AVX) over 128 bit wide hardware pipeline, just like SSE that was implemented over 64 bit wide pipeline prior to Core 2, Barcelona, using instruction breakup.

Also AMD could easily provide an AVX compatible instruction set even if it doesn't use a 256 bit wide hardware.

Basicaly AMD is free to implement AVX in any upcoming architecture (even with 128 bit wide SSE units), but the question is once again if they will manage to make an better performing or worse performing implementation of AVX.; April 15, 2008 8:34 AM
Ho Ho said...: Even with two-pass SIMD it might be faster thanks to less instructions needed to be loaded-decoded.; April 15, 2008 10:03 AM
Scientia from AMDZone said...: Ho Ho

"Ok, but what if AMD chose not to release faser 65W 90nm CPUs?"

It is clear from the TDP that neither the 90nm nor 65nm process could go above 3.2Ghz without exceeding 130 watts.

"They are only faster if you compare against patched system."

Actually, 2.4 and 2.5Ghz are faster than 2.3Ghz, patched or otherwise.

"So you are saying that even though Bulldozer won't be here for years it is difficult to adapt to AVX but it is trivial for Intel to copy SSE5 with its 2010 CPUs?"

AMD would barely have 18 months whereas assuming Intel started with the SSE5 announcement they would have 36 months.

"The fact they had a bit different solutions only proves that they aren't copying from each other."

It is possible that Intel's similarity of AVX is a coincidence but I think unlikely. For example, AMD copied the popcnt instruction from Itanium so Intel added it to SSE4. Both actions were related.

"And how is F@H that much better?"

I wasn't suggesting using F@H as a benchmark. How about a less ancient version of SuperPi with SSE3 instructions and maybe one with GPU instructions?

"Do you think two channels of DDR3 isn't enough for normal users?"

It is right now; I'm not sure about in 2010.

"I don't see G3MX having any benefits on desktop machines. It isn't cheaper and it isn't less complex than directly using two channels of RAM."

Intel is planning to use 3 channels on its top desktop systems with Nehalem.; April 15, 2008 3:01 PM
Scientia from AMDZone said...: AVX over the existing 128 bit pathways wouldn't gain much speed. Also, in K8 the 128 instructions were issued as two 64 bit micro-ops which meant that it took the same amount of decoding time as two 64 bit instructions would have.

It could be easier for AMD to do 256 bit operations on the GPU. However, the GPU would have to be beefed up some and AMD would have to think of some way to tightly couple the instructions rather than implementing GPU as an external device. In other words, the GPU would have to have some connection to the actual instruction pipeline and that may not be possible unless AMD has already been working on it.; April 15, 2008 3:14 PM
Polonium210 said...: OK Scientia, it looks like you CAN fit 6 DIMMs on an ATX MB-it'll be a tight fit though and I'd like to see someone try it. I have no idea how big the G3MX chips are so there may not be enough room to fit those as well. I should caution that some MB manufacturers call a 305mm x 224mm MB an ATX and I am using the Standard ATX definition of 305mm x 244mm. Better to use EATX though-305mm x 330mm.; April 15, 2008 7:37 PM
Ho Ho said...: scientia
"It is clear from the TDP that neither the 90nm nor 65nm process could go above 3.2Ghz without exceeding 130 watts."

Are you saying that even though their 65nm can get 100MHz more on 65nm they will still top out at same speed at 125W? That isn't all that much improvement I'd say.

"Actually, 2.4 and 2.5Ghz are faster than 2.3Ghz, patched or otherwise."

Ah, I see, I thought you were saying that B3 magically got higher IPC than B2.

"AMD would barely have 18 months whereas assuming Intel started with the SSE5 announcement they would have 36 months."

36 months is still not enough for a major design change I would assume. Intel was talking about its future CPUs quite a long time ago. Remember that slide where they had Larrabee and Sandy Bridge side by side and their FP/memory systems were compared?

Let's assume that indeed Intel copied AMD SSE5. Is it bad that they took the idea and improved it further?

"How about a less ancient version of SuperPi with SSE3 instructions and maybe one with GPU instructions?"

First someone has to write it, though there are much better Pi calculation programs already out there. If it would be using GPU instructions it wouldn't be CPU benchmark any more.

"It is right now; I'm not sure about in 2010."

Perhaps it is just me but I'm not seeing that in couple of years memory bandwidth requirements triple, definitely not in desktop space. How long time ago it was when we felt that 3GB/s (dual channel DDR1-200/PC1200 RDRAM) wasn't enough? I'd say that you can get almost everything done today with only 3GB/s bandwidth.

"Intel is planning to use 3 channels on its top desktop systems with Nehalem."

Their current top-end has two server quads and FBDIMM, I don't call that practical either.

"It could be easier for AMD to do 256 bit operations on the GPU."

Unless you take latency into account, of course. Also current R6xx GPUs have much bigger effective vectors than 256bits.; April 16, 2008 3:08 AM
Scientia from AMDZone said...: Ho Ho said
"Are you saying that even though their 65nm can get 100MHz more on 65nm they will still top out at same speed at 125W? That isn't all that much improvement I'd say."

I didn't say it was a big improvement; just that it isn't worse than 90nm as has been claimed.

"Ah, I see, I thought you were saying that B3 magically got higher IPC than B2."

No, although the IPC is supposed to increase some with Shanghai.

"36 months is still not enough for a major design change I would assume."

Stop dithering. 36 months is twice as much time as 18 months so obviously Intel could do a lot more.

"Let's assume that indeed Intel copied AMD SSE5. Is it bad that they took the idea and improved it further?"

Well, it's kind of a copy of a copy. 3 and 4 way instructions are common on RISC instruction sets. It would be more accurate to say that AMD's announcement of SSE5 compelled Intel to add similar functionality to their future procesors. Their biggest constraint is trying to not overshadow Itanium.

"First someone has to write it, though there are much better Pi calculation programs already out there. If it would be using GPU instructions it wouldn't be CPU benchmark any more."

No, it would be a useful benchmark which is not the case with SuperPi.

"I'd say that you can get almost everything done today with only 3GB/s bandwidth."

Okay, we can figure this. A 2.0Ghz K8 needs one channel of DDR-400 minimum. A quad processor should therefore need 4X DDR-400. Doubling the SSE rate with K10 would then require 8X DDR-400. And, we should be looking at 3.0Ghz so 1.5X more. This would be a minimum of 12X DDR-400. So:

12X DDR-400
6X DDR2-800 therefore:
3 channels of DDR3-1600 or
2 channels of DDR4-2400

"Unless you take latency into account, of course. Also current R6xx GPUs have much bigger effective vectors than 256bits."

You could reduce latency if you hooked the GPU into the CPU pipeline. Yes, theorectically you could do more than one VFX operation at a time.; April 16, 2008 5:18 PM
Ho Ho said...: scientia
"I didn't say it was a big improvement; just that it isn't worse than 90nm as has been claimed."

I'd say if you get to a lower tech node and do not increase clocks at same thermals on a die-shrink design then things aren't looking well.

"A quad processor should therefore need 4X DDR-400"

For what tasks it would need that much? Any links you can share where one can see that some CPU is limited by memory bandwidth, preferrably in a desktop scenario as we are talking about desktop CPUs. Btw, intel quads with 1066MHz FSB have effective bandwidth of around 2/3 of that 4x DDR400 and are still doing perfectly fine.

"You could reduce latency if you hooked the GPU into the CPU pipeline."

How would that look like? Something similar to the old addon-x87 that processed every single insruction in parallel with CPU? Fact remains that transferring any instructions out from CPU has huge latency and would only pay off for rather large patches of code and data. If you are talking about a CPU that has some GPU instructions(which ones?) built-in then this is simply a regular CPU with extended instruction set.; April 17, 2008 3:52 AM
Pop Catalin Sever said...: A verry interesting article :

Analysis: AMD Asset Lite strategy will create MAD AMD

I'll post only the conclusion:

"AMD is not going down any time soon and even after the AMD + ATI vs. MAD AMD LLC split, cooperation with IBM, TSMC, Chartered, ANGSTREM will not stop. In fact, it will expand into another alliance, but that is subject of future stories.
Current corporate climate has to change; otherwise AMD will continue to be occasional challenger to industry heavy-weights Intel and Nvidia. This is also one of primary reasons why the deal with Mubadala Abu Dhabi fund was not announced earlier.

One thing is certain: doomsayers claiming that AMD is dead forgot to check the facts. Just like they forgot to check actual facts of Ferrari in 1993, Apple in 1997, Airbus SAS in 2006, Nvidia in 2002, and Microsoft in 2007. This is big business, and big changes do not happen overnight. And success or failure of one product cannot change the destiny of a company.
"

I must say, splitting AMD and the new infusion capital, changes the whole AMD status quite a bit.; April 17, 2008 8:10 AM
Scientia from AMDZone said...: Ho Ho

Right now, the GPU is accessed as a completely separate, external device using library code. You have to specifically transfer data and instructions to the GPU's memory to do any computations.

The big question is whether you could have a GPU act on current MMX or SSE instructions sharing the MMX and XMM registers or whether you would have to have new instuctions. As I said, I don't think this is possible unless AMD has already been working on it since early 2007.; April 17, 2008 4:45 PM
Unknown said...: Good news folks:

Seems that AMD is planning a Dodeca core processor, and it seems that it will be some variation of Shanghai.

AMD engineers reveal details about the company's upcoming 45nm processor roadmap, including plans for 12-core processors

"Shanghai! Shanghai!" the reporters cry during the AMD's financial analyst day today. Despite the fact that the company will lay off nearly 5% of its work force this week, followed by another 5% next month, most employees interviewed by DailyTech continue to convey an optimistic outlook.

The next major milestone for the CPU engineers comes late this year, with the debut of 45nm Shanghai. Shanghai, for all intents and purposes, is nearly identical to the B3 stepping of Socket 1207 Opteron (Barcelona) shipping today. However, where as Barcelona had its HyperTransport 3.0 clock generator fused off, Shanghai will once again attempt to get HT3.0 right.

Original roadmaps anticipated that HT3.0 would be used for socket-to-socket communication, but also for communication to the Southbridge controllers. Motherboard manufacturers have confirmed that this is no longer the case, and that HT3.0 will only be used for inter-CPU communication.

"Don't be disappointed, AMD is making up for it," hints one engineer. Further conversations revealed that inter-CPU communication is going to be a big deal with the 45nm refresh. The first breadcrumb comes with a new "native six-core" Shanghai derivative, currently codenamed Istanbul. This processor is clearly targeted at Intel's recently announced six-core, 45nm Dunnington processor.

But sextuple-core processors have been done, or at least we'll see the first ones this year. The real neat stuff comes a few months after, where AMD will finally ditch the "native-core" rhetoric. Two separate reports sent to DailyTech from AMD partners indicate that Shanghai and its derivatives will also get twin-die per package treatment.

AMD planned twin-die configurations as far back as the K8 architecture, though abandoned those efforts. The company never explained why those processors were nixed, but just weeks later "native quad-core" became a major marketing campaign for the AMD in anticipation of Barcelona.

A twin-die Istanbul processor could enable 12 cores in a single package. Each of these cores will communicate to each other via the now-enabled HT3.0 interconnect on the processor.

The rabbit hole gets deeper. Since each of these processors will contain a dual-channel memory controller, a single-core can emulate quad-channel memory functions by accessing the other dual-channel memory controller on the same socket. This move is likely a preemptive strike against Intel's Nehalem tri-channel memory controller.

Motherboard manufacturers claim Shanghai and its many-core derivatives will be backwards compatible with existing Socket 1207 motherboards. However, processor-to-processor communication will downgrade to lower HyperTransport frequencies on these older motherboards. The newest 1207+ motherboards will officially support the HyperTransport 3.0 frequencies.

Shanghai is currently taped out and running Windows at AMD.

This is the original source from DailyTech: Dodeca-core: The Megahertz Race is Now Officially the Multi-core Race; April 18, 2008 2:40 AM
Ho Ho said...: scientia
"The big question is whether you could have a GPU act on current MMX or SSE instructions sharing the MMX and XMM registers or whether you would have to have new instuctions"

In order for this to work efficiently those GPU instructions must be implemented inside the CPU itself. Even a GPU core added to the package will not be efficient enough for majority of the stuff.

"The rabbit hole gets deeper. Since each of these processors will contain a dual-channel memory controller, a single-core can emulate quad-channel memory functions by accessing the other dual-channel memory controller on the same socket."

For that to work the socket must have four memory channels connected to it. I doubt most motherboards have that as only very tiny subset of customers will buy this monster 12-core CPU. That kind of special-socketed motherboards quite probable. Assuming that those 6-core CPUs have more cache than regular quads fitting two of those in a regular socket won't be simple anyway.; April 18, 2008 4:09 AM
Ho Ho said...: To add to the "gpu instcutions in CPU", you are basically describing what SSE5/AVX will be. I wouldn't call either of those as "adding GPU instructions to CPU". Only when CPU gets texture sampling instructions/HW I might call it a hybrid GPU. Just adding a bunch of SIMD instructions won't do it. If they would then Power-CPUs have actually been GPUs for years.; April 18, 2008 4:12 AM
enumae said...: Patrick Wang - Wedbush Morgan Securities

Okay, great, thanks. And then just one last question, just on 45-nanometer, I know that you said that you expect production of that material some time in the summer. Any more color you can provide there, just to help us better understand what’s happening?

Derrick R. Meyer

We’ll start the production ramp in the summertime and start to ship products in volume in Q4.

Whats you take on this?; April 18, 2008 12:12 PM
Scientia from AMDZone said...: ho ho

I've tried explaining this to you but you still don't seem to understand. Putting a GPU in the same package connected by HT is no big deal. Just putting a GPU on the same die is functionally equivalent. Both of these would have to be accessed the same as is currently done. No change.

However, if the GPU were connected to the actual CPU pipeline then it seems possible that it could be used for at least some SSE operations. However, this is a much more difficult prospect so AMD would have already had to have been working on it since early 2007 to have any hope of getting it out by 2010. This is theoretically where AMD is headed but again I don't know the timeline. Widening the instructions to larger than 128 bits is a separate issue and may not actually be necessary. Again, if AMD has plans for hybrid GPU processing then they may decide to use this with SSE5 and not adopt AVX. They could use AVX but I'm assuming this would cost them at least a year's delay.

AMD's G3MX is capable of 3 or 4 channels in the same socket style as socket F (although it would have pins reassigned and therefore not compatible with current socket F). I assume these boards would have to be larger than ATX. These wouldn't be for the desktop. This could match some of the current layouts which put each memory channel on a daughter card. These would compete with the 3 channel Nehalem systems.; April 18, 2008 4:22 PM
Scientia from AMDZone said...: enumae

I have no idea from that description. It sounds similar to what was said before but nothing specific. By that statement he could mean release in Q3 with volume in Q4 or release and volume in Q4. It could also mean either early Q4 or late Q4. It's still too vague.; April 18, 2008 4:25 PM
Anonymous said...: Look, here's the thing:

SSE/X87 have a fairly short latency and a fairly low bandwidth. Say, roughly around 4 cycles for a complex SSE op. However, the GPU has latency in the hundreds of cycles. At first glance, this sounds ridiculous. Under closer examination, one realizes that the GPU can do one DP FP op per cycle, MUL/DIV/etc regaurdless.

GPU's are insane for these tasks, trust me, I've worked with them. The problem for AMD is this: What is going to be more important? latency or bandwidth? Only the apps will tell...; April 18, 2008 9:31 PM
Anonymous said...: Well, I was rushed and said a few things a bit off in the above post. Still, the latency issue is extreme...; April 19, 2008 7:50 AM
Scientia from AMDZone said...: Most of the latency for GPU operations comes from having to transfer to and from its memory since it is treated as an external device. When executed properly GPU operations can still be 5-20X faster. GPU operations make sense when you are dealing with data blocks of sufficient size.; April 20, 2008 12:01 PM
Ho Ho said...: scientia
"Most of the latency for GPU operations comes from having to transfer to and from its memory since it is treated as an external device."

Wrong. GPUs have had caches for ages. Where do you get your information, anyway?; April 22, 2008 2:40 AM
Scientia from AMDZone said...: ho ho

"Wrong. GPUs have had caches for ages. Where do you get your information, anyway?"

Let me see if I get this straight. You fancy that telling me that GPU's have local memory is new information? Remarkable that you could be so confused. I was referring to getting information into the GPU, not execution from local memory.; April 22, 2008 2:20 PM
Ho Ho said...: What is the difference between GPU pulling textures, vertices and shaders into its caches/local memory and CPU pulling the same stuff to its caches?

Or was your point about the "external memory" that CPU has to feed all the data over PCIe to GPU memory? If so then this is not the only big bottleneck in GPGPU. What I and spam were talking about was the latency there is when all the required data is already in the VRAM.; April 23, 2008 2:19 AM
Pop Catalin Sever said...: "Ho Ho said...

What is the difference between GPU pulling textures, vertices and shaders into its caches/local memory and CPU pulling the same stuff to its caches? "

There's a big difference.
As nVidia said regarding CUDA, GPUs aren't optimized for memory access but for ALU operations. On video processors there are no implicit memory access caches (yes there are no caches of this type) but only caches that store explicit data (meaning you have to get the data in the local cache then use it, no prefetch, no automatic discarding of cache lines, and no coherency with main memory). Simple cache access takes from 4 to 64 cycles on nVidia GPU (depending thread banks that access the data or warps like nVidia calls them) Random Video memory access takes between 600 to 800 cycles. That's 5-10x more than a K8/Core2 class CPU needs to access system memory.

here's a reference to what I've said: CUDA Performance; April 24, 2008 11:09 AM
Scientia from AMDZone said...: pop

I'm not as familiar with the nVidia products as the ATI. I know with ATI they suggest overlapping some types of manipulations with the moves to local memory to increase the effective throughput. By doing this you can cut the processing time in half.

Again, I don't know about the nVidia products but ATI has DMA so it can access memory on its own. However, this would be very inefficient for random access. And, this access has to be controlled by a program running on the CPU. I agree GPU's do not run like autonomous processors.

ho ho

Yes, there is latency inside the GPU. There isn't much you can do about the latency. ATI uses tricks to maintain total throughput such as having many instructions in flight and interleaving operations. But, again, this works better for large block operations.

This is not unlike the difference between scalar and vector processing on supercomputers. GPU manipulation may become the equivalent of the supercomputer vector array.; April 26, 2008 12:35 PM

Scientia's Blog

Monday, April 14, 2008

Updates And Old Patterns

36 comments:

Links

Blog Archive

About Me