Comments on Scientia's Blog: Updates And Old Patterns

popI'm not as familiar with the nVidia products as...

2008-04-26T12:35:00.000-04:00

pop

I'm not as familiar with the nVidia products as the ATI. I know with ATI they suggest overlapping some types of manipulations with the moves to local memory to increase the effective throughput. By doing this you can cut the processing time in half.

Again, I don't know about the nVidia products but ATI has DMA so it can access memory on its own. However, this would be

"Ho Ho said... What is the difference between G...

2008-04-24T11:09:00.000-04:00

"Ho Ho said...

What is the difference between GPU pulling textures, vertices and shaders into its caches/local memory and CPU pulling the same stuff to its caches? "

There's a big difference.
As nVidia said regarding CUDA, GPUs aren't optimized for memory access but for ALU operations. On video processors there are no implicit memory access caches (yes there are no

What is the difference between GPU pulling texture...

2008-04-23T02:19:00.000-04:00

What is the difference between GPU pulling textures, vertices and shaders into its caches/local memory and CPU pulling the same stuff to its caches?

Or was your point about the "external memory" that CPU has to feed all the data over PCIe to GPU memory? If so then this is not the only big bottleneck in GPGPU. What I and spam were talking about was the latency there is when all the

ho ho"Wrong. GPUs have had caches for ages. Where ...

2008-04-22T14:20:00.000-04:00

ho ho

"Wrong. GPUs have had caches for ages. Where do you get your information, anyway?"

Let me see if I get this straight. You fancy that telling me that GPU's have local memory is new information? Remarkable that you could be so confused. I was referring to getting information into the GPU, not execution from local memory.

scientia"Most of the latency for GPU operations co...

2008-04-22T02:40:00.000-04:00

scientia
"Most of the latency for GPU operations comes from having to transfer to and from its memory since it is treated as an external device."

Wrong. GPUs have had caches for ages. Where do you get your information, anyway?

Most of the latency for GPU operations comes from ...

2008-04-20T12:01:00.000-04:00

Most of the latency for GPU operations comes from having to transfer to and from its memory since it is treated as an external device. When executed properly GPU operations can still be 5-20X faster. GPU operations make sense when you are dealing with data blocks of sufficient size.

Well, I was rushed and said a few things a bit off...

2008-04-19T07:50:00.000-04:00

Well, I was rushed and said a few things a bit off in the above post. Still, the latency issue is extreme...

Look, here's the thing:SSE/X87 have a fairly short...

2008-04-18T21:31:00.000-04:00

Look, here's the thing:

SSE/X87 have a fairly short latency and a fairly low bandwidth. Say, roughly around 4 cycles for a complex SSE op. However, the GPU has latency in the hundreds of cycles. At first glance, this sounds ridiculous. Under closer examination, one realizes that the GPU can do one DP FP op per cycle, MUL/DIV/etc regaurdless.

GPU's are insane for these tasks

enumaeI have no idea from that description. It sou...

2008-04-18T16:25:00.000-04:00

enumae

I have no idea from that description. It sounds similar to what was said before but nothing specific. By that statement he could mean release in Q3 with volume in Q4 or release and volume in Q4. It could also mean either early Q4 or late Q4. It's still too vague.

ho hoI've tried explaining this to you but you sti...

2008-04-18T16:22:00.000-04:00

ho ho

I've tried explaining this to you but you still don't seem to understand. Putting a GPU in the same package connected by HT is no big deal. Just putting a GPU on the same die is functionally equivalent. Both of these would have to be accessed the same as is currently done. No change.

However, if the GPU were connected to the actual CPU pipeline then it seems

Patrick Wang - Wedbush Morgan SecuritiesOkay, grea...

2008-04-18T12:12:00.000-04:00

Patrick Wang - Wedbush Morgan Securities

Okay, great, thanks. And then just one last question, just on 45-nanometer, I know that you said that you expect production of that material some time in the summer. Any more color you can provide there, just to help us better understand what’s happening?

Derrick R. Meyer

We’ll start the production ramp in the summertime and

To add to the "gpu instcutions in CPU", you are ba...

2008-04-18T04:12:00.000-04:00
To add to the "gpu instcutions in CPU", you are basically describing what SSE5/AVX will be. I wouldn't call either of those as "adding GPU instructions to CPU". Only when CPU gets texture sampling instructions/HW I might call it a hybrid GPU. Just adding a bunch of SIMD instructions won't do it. If they would then Power-CPUs have actually been GPUs for years.

scientia"The big question is whether you could hav...

2008-04-18T04:09:00.000-04:00
scientia
"The big question is whether you could have a GPU act on current MMX or SSE instructions sharing the MMX and XMM registers or whether you would have to have new instuctions"

In order for this to work efficiently those GPU instructions must be implemented inside the CPU itself. Even a GPU core added to the package will not be efficient enough for majority

Good news folks:Seems that AMD is planning a Dodec...

2008-04-18T02:40:00.000-04:00
Good news folks:

Seems that AMD is planning a Dodeca core processor, and it seems that it will be some variation of Shanghai.

AMD engineers reveal details about the company's upcoming 45nm processor roadmap, including plans for 12-core processors

"Shanghai! Shanghai!" the reporters cry during the AMD's financial analyst day today. Despite the fact that the company

Ho HoRight now, the GPU is accessed as a completel...

2008-04-17T16:45:00.000-04:00
Ho Ho

Right now, the GPU is accessed as a completely separate, external device using library code. You have to specifically transfer data and instructions to the GPU's memory to do any computations.

The big question is whether you could have a GPU act on current MMX or SSE instructions sharing the MMX and XMM registers or whether you would have to have new instuctions. As

A verry interesting article :Analysis: AMD Asset L...

2008-04-17T08:10:00.000-04:00
A verry interesting article :

Analysis: AMD Asset Lite strategy will create MAD AMD

I'll post only the conclusion:

"AMD is not going down any time soon and even after the AMD + ATI vs. MAD AMD LLC split, cooperation with IBM, TSMC, Chartered, ANGSTREM will not stop. In fact, it will expand

scientia"I didn't say it was a big improvement; ju...

2008-04-17T03:52:00.000-04:00
scientia
"I didn't say it was a big improvement; just that it isn't worse than 90nm as has been claimed."

I'd say if you get to a lower tech node and do not increase clocks at same thermals on a die-shrink design then things aren't looking well.

"A quad processor should therefore need 4X DDR-400"

For what tasks it would need that much? Any

Ho Ho said"Are you saying that even though their 6...

2008-04-16T17:18:00.000-04:00
Ho Ho said
"Are you saying that even though their 65nm can get 100MHz more on 65nm they will still top out at same speed at 125W? That isn't all that much improvement I'd say."

I didn't say it was a big improvement; just that it isn't worse than 90nm as has been claimed.

"Ah, I see, I thought you were saying that B3 magically got higher IPC than B2."

scientia"It is clear from the TDP that neither the...

2008-04-16T03:08:00.000-04:00
scientia
"It is clear from the TDP that neither the 90nm nor 65nm process could go above 3.2Ghz without exceeding 130 watts."

Are you saying that even though their 65nm can get 100MHz more on 65nm they will still top out at same speed at 125W? That isn't all that much improvement I'd say.

"Actually, 2.4 and 2.5Ghz are faster than 2.3Ghz, patched or

OK Scientia, it looks like you CAN fit 6 DIMMs on ...

2008-04-15T19:37:00.000-04:00
OK Scientia, it looks like you CAN fit 6 DIMMs on an ATX MB-it'll be a tight fit though and I'd like to see someone try it. I have no idea how big the G3MX chips are so there may not be enough room to fit those as well. I should caution that some MB manufacturers call a 305mm x 224mm MB an ATX and I am using the Standard ATX definition of 305mm x 244mm. Better to use EATX though-305mm x 330mm.

AVX over the existing 128 bit pathways wouldn't ga...

2008-04-15T15:14:00.000-04:00
AVX over the existing 128 bit pathways wouldn't gain much speed. Also, in K8 the 128 instructions were issued as two 64 bit micro-ops which meant that it took the same amount of decoding time as two 64 bit instructions would have.

It could be easier for AMD to do 256 bit operations on the GPU. However, the GPU would have to be beefed up some and AMD would have to think of some way to

Ho Ho"Ok, but what if AMD chose not to release fas...

2008-04-15T15:01:00.000-04:00
Ho Ho

"Ok, but what if AMD chose not to release faser 65W 90nm CPUs?"

It is clear from the TDP that neither the 90nm nor 65nm process could go above 3.2Ghz without exceeding 130 watts.

"They are only faster if you compare against patched system."

Actually, 2.4 and 2.5Ghz are faster than 2.3Ghz, patched or otherwise.

"So you are

Even with two-pass SIMD it might be faster thanks ...

2008-04-15T10:03:00.000-04:00
Even with two-pass SIMD it might be faster thanks to less instructions needed to be loaded-decoded.

256 bit wide instruction don't necesarily mean hig...

2008-04-15T08:34:00.000-04:00
256 bit wide instruction don't necesarily mean higher performance (2x or close), it all depends on the hardware pipeline width. Intel might implement 256 bit wide sse (AVX) over 128 bit wide hardware pipeline, just like SSE that was implemented over 64 bit wide pipeline prior to Core 2, Barcelona, using instruction breakup.

Also AMD could easily provide an AVX compatible instruction set

erlindo"1)What probabilities does Shanghai have to...

2008-04-15T03:42:00.000-04:00
erlindo
"1)What probabilities does Shanghai have to include SSE5 instructions?"

Zero

"2) About AVX, why didn't AMD thought of SSE5 to be 256 bit wide instead of being 128 bit?"

It is a bit easier to work with 128bit than with 256bit (2x2 vs 2x4 quads). Of course if you are smart, plan ahead and can make your algorithms work nicely for