Saturday, September 02, 2006

Beyond FBDIMM

It is no secret that as processors have gotten faster, they've outpaced memory speed. There was a time when memory was actually faster than the processor but this has slowly changed. Over time, processor designs began including Level 1 (L1) cache and then Level 2 (L2). And, now it appears that most processors will soon need L3 as well. These caches are smaller and faster with each level. So, L3 is smaller than main memory but faster; L2 is smaller, but faster than L3, and L1 is smaller again but faster than L2. Without these caches, the processor would spend a lot of time waiting on main memory.

Memory has valiantly tried to keep up. It has gone from regular DRAM to DDR to DDR2 and soon DDR3. However, as speed has gone up, fanout has decreased. Therefore, it is not possible to put as many DDR3 DIMM's on a board as DDR DIMMs. The number of usable DIMMs has been traded for speed. The only reason why this is somewhat acceptable is that memory chip capacity is going up as well so the greater memory on each DIMM somewhat makes up for having fewer DIMMs.

Intel has introduced what it feels is a way to solve this problem, FBDIMM. Unfortunately, FBDIMM creates just as many problems as it solves. Although FDBIMM is fairly fast and has good fanout, it introduces very large latencies and draws much more power. An FBDIMM is basically a DDR2 DIMM with an extra chip on it, an AMB. This AMB chip provides two serial ports that allow an upstream channel and a downstream channel. These connect from the memory controller to the first FBDIMM. Additional FBDIMMs are simply daisychained to the first one. The problem is that each FBDIMM in the chain introduces latency. It is difficult to imagine how poor the performance of the 8th FBDIMM in a chain would be after the data request has had to hop over the previous seven FBDIMMs and then the data has to make the same seven hops going back up. This can drastically slow down the memory speed and make FBDIMMs impractical at maximum fanout. Therefore, at large fanout, we've simply come full circle and traded away speed again to increase capacity. Secondly, the AMB chip draws lots of power. This chip all by itself draws as much power as the rest of the DIMM combined, thereby doubling the power draw. This is not a good tradeoff, especially in a time when large scale computers are becoming sensitive to the high cost of electric power for these systems. Processors themselves have become much more energy efficient and Intel likes to brag about the low power draw of its newer Core 2 Duo designs verus its older Pentium D designs, however, these gains will easily be lost with the extra power needed for Intel's FBDIMM design. So, power draw for the processor is simply replaced by power draw from memory and no real advancement is made.

Currently, there is nothing better, but there could be. Suppose we replace the AMB chip with a HyperTransport chip with three 16 bit links. We'll call this HTDIMM or just HTD for short. One link goes back to the processor and the other two provide a fanout tree. The first HTD connects to two HTD's. These two then connect to two more, making seven total (1 + 2 + 4). A fanout of seven would be almost as good as the maximum eight fanout of FBDIMM. However, whereas FBDIMM would need seven hops to reach the last DIMM, HTD would only need two hops. Power draw and cost could be an issue but the last four HTD's only need a single link so they could be cheaper and draw less power. Cost is also not likely to be an issue since HT was designed for low cost and current FBDIMMs cost about the same as regular DDR2 DIMMs. Three links might seem complicated compared to FBDIMM but all of these combined would be narrower than the current DIMM bus, so it would reduce motherboard circuit trace complexity. Also, each 16 bit path is independent so this greatly reduces the complexity of the current very wide serpentine circuit traces.

A fanout of seven would be a good compromise for most server configurations however this is not an actual limit. There is actually no reason you couldn't add another hop and another fanout of two. So then you would have eight more DIMMs for a total of fifteen (1 + 2 + 4 + 8). This might be necessary for some high powered servers. Even with three hops we would still be less than half of the seven hops maximum for FBDIMM but at about double the fanout. And, again, the last eight HTD's would only need a single link.

AMD's K8 now uses an onboard dual channel controller. The processor could use four HT channels instead of the current memory controller which would greatly reduce both the pincount and the number of circuit traces for memory access. This would reduce both motherboard and processor complexity.

The reason why HT would work better than FBDIMM is several factors. HT was designed long ago for use on motherboards and is rooted in technology from the DEC Alpha. As it became an open standard, it was designed for low cost, low complexity, low latency, and high bandwidth. It has been successfully used by AMD, IBM, Apple, and all of the 3rd party chipset makers like Nvidia, ATI, and VIA. In fact, there are motherboards today for Intel processors that use HT to communicate between the Northbridge and Southbridge chips. HT is so robust that it is a superset of PCI, USB, AGP, PCI-X, PCI-E, and even Intel's proposed CSI standard. HT is capable of transporting data from all of these protocols. HT has an advantage over the AMB chip in terms of flexibility and speed. HT is capable of handling a tree structure whereas AMB uses a ring structure. HT is also much faster. Whereas FBDIMM can move 6.4 GB/sec, HTDIMM would be capable of moving 20.6 GB/sec. This speed is about as fast as the fastest proposed DDR3 memory. However, HTDIMM would be capable of handling multiple simultaneous DIMM accesses and data transfers. With four HT channels, the processor would have a peak memory bandwidth of 83 GB/sec. Since this is well beyond what current processors can use, desktop systems would probably only use two channels while budget systems used one. The reason for this speed is that the base technology for HT comes from the very fast DEC Alpha chip. This technology has continued to be developed since it was used on DEC's Alpha and AMD's Athlon MP processors. Today, HyperTransport is a mature protocol in version 3.0. The speed has doubled with each version.

This type of memory would work for Intel as well. HyperTransport is an open protocol and can be used royalty free. Replacing the current FBDIMM ports on the 5000 Northbridge with HT ports would be fairly simple for Intel to do and it could pick up additional desktop chipsets from Nvidia since they already use HT. This would require no change to the processor or FSB design if Intel wanted to retain these. Using HyperTransport with current DIMM design would be the best way to reverse the lag in memory speed and increase fanout. This would also reverse the trend toward more and more cache. The DIMM design itself would be superior to FBDIMM in every way. HTDIMM is what the next Jedec standard should be.

15 comments:

Anonymous said...

From what i understand this is prety simler to rdram rims i had an dell demention 8100 with rdram i read the manual once and the basic disinge is to dasy chain a bunch of rimms to the memory controler in a ring.

Anonymous said...

Scientia, as far as I know, the only way to use HyperTransport in your components is to become a member of the consortium first, something Intel is very unlikely to do.

I agree these HTDIMM's would be a lot better than current implementations, but I heard about something similar to FB-DIMM's that works like it, but doesn't require the AMB. I cannot remember what it is, but it's supposed to be much more advanced than FB-DIMM's and something that could put an early end to the life of FB-DIMM's.

Anonymous said...

I really found your post informative.

Thanks.

Scientia from AMDZone said...

No, I wouldn't expect Intel to support HTDIMM at first. However, they probably would eventually if it were superior to FBDIMM. My comments were to show that it is an open standard and wouldn't exclude Intel if they chose to use it.

Anonymous said...

Brillant stuff as usual Scientia. I read it a couple of times to try to get it all.

I would really like to know one thing. What the hell does Intel have against using open standards tech? Does Intel think it is somehow a fatal blow to their business model? Do they fear that the public at large will rake Intel over coals and claim they have surrendered to AMD if they use an open standard protocol?

As far as I am concerned, if Intel actually did decide to embrace an open standard, it would only help them in the long run.

Anonymous said...

Considering AMD chips reportedly already use 400+ pins for two DDR2 channels, there doesn't seem to be a pressing need to daisy chain HTDIMMs at all. 400 pins will fit a lot of 14-bit HT 3.0 links (9GByte/s) for point-to-point connection to hypothetical HTDIMM slots.

There isn't much of a need today for a single processor to have more than the 20+ HTDIMM slots that could be supported with 400 pins worth of 14-bit HT links. That's 40GB of RAM even without considering future DRAM density increases.

Scientia from AMDZone said...

Intel has a lot of vertical integration. They make the cpu's and the chipsets and the motherboards.

When a company like Nvidia or ATI wants to make a chipset for an Intel motherboard they have to pay a royalty license to Intel. Since HyperTransport is an open standard it can be used without royalty. It is my understanding that there is some cost associated with being a member of the HT consortium however that also gives the member a vote on the proposed standards which AMD has to follow as well.

So, this basically brings up the question of what would happen if Intel started using HyperTransport. For one thing, it would make chipset design a lot easier. In fact, there are already 3rd party chipsets for Intel processors that use HyperTransport to communicate between the North and South Bridge chips.

Intel would lose licensing income and would lose lead time on chipset development. Combined with the lower cost of development for 3rd parties, this would tend to make 3rd party chipsets more competitive. Intel could lose half of their chipset and motherboard income.

This idea is foreign to Intel. They don't see themselves as being a competitor. They view their job as grabbing as much market share as possible by whatever means necessary. They tend to view 3rd parties like Nvidia and ATI as a necessary evil rather than as strategic partners. Centrino was ideal for Intel because it required that the OEM's buy the entire package. Intel wouldn't sell the Pentium M chips by themselves.

However, this is upsidedown of the normal business model. OEM's like HP, Dell, Sun, and IBM want to create strong brands associated with their companies. Centrino, however, is a supplier brand rather than an OEM brand. Centrino is not associated with any particular company except Intel. Supplier branding tends to be secondary, like "Dolby" for audio equipment. Imagine how it would be if you had to buy hardware directly from Dolby to use the Dolby name.

Normally, the customer has more leverage than the supplier. Walmart has tremendous leverage over its suppliers as do companies like Proctor and Gamble, General Motors, and Boeing. This is normal because a customer can always switch suppliers. Intel, however, has been practically immune to pressure from other suppliers and now sees itself as more important than its own customers. Intel's customers don't like this.

Intel basically setup a reinforcing loop in the past. It worked something like this: Intel pays customers to create commercials that push the Intel brand. The buying public identifies with the Intel brand and buys high volumes. The high volumes trap OEM customers because they can't replace these volumes with chips from AMD. The high volumes give Intel money and leverage. Intel uses its leverage to block sales of AMD products and thereby maintains both its high volume and high brand value. It uses the money to continue its exclusive marketing and keeps growing.

That reinforcing arrangement finally cracked Q4 05. Although Intel's margins for that quarter were excellent and they had record revenues they also increased inventory while AMD cut its own inventories by more than half. Q1 06 saw a big drop in Intel revenue and margin while AMD's revenues increased. Intel's inventories continued to increase and increased again in the second quarter. Intel had stopped growing and started shrinking.

Now, Intel is in a jam because AMD has a second 300mm FAB that is ramping and a second supplier with Chartered. This gives AMD enough volume to replace large blocks of Intel orders such as a guarantee of 10 Million chips to Dell at the same time that Intel is not able to fill all of the demand for C2D chips. Suddenly, Intel's OEM customers have another option. Also, the company that pushed Opteron most heavily for servers was Sun. Sun's share of the x86 server market grew by 48% while no one else grew by more than 3%. This is why IBM, HP, and Dell all quickly announced new Opteron server models. Apparently, the brand value of Xeon has declined a bit.

Intel will eventually have to rethink the way it sees itself and stop bullying its own customers and trying to promote itself ahead of its customers' brands. However, in going with a proprietary standard like CSI Intel shows that it still thinks in terms of being a monopoly and how to maximize its market position rather than how to deliver good products to its customers.

Anonymous said...

" Walmart has tremendous leverage over its suppliers"

Wal-Mart also goes to the extent of buying the supplier, which it does to some 3rd world countries.

Intel is the supplier and seller, they basically became the Wal-Mart of the tech world (sometimes cheap prices, but always shitty quality).

Anonymous said...

Interesting idea.
I have one question though. In this binary tree scheme, won't the 1st HTDIMM become the new memory bandwidth bottleneck?

Scientia from AMDZone said...

In this binary tree scheme, won't the 1st HTDIMM become the new memory bandwidth bottleneck?

On the current shared bus design of regular DIMMs there is a bottleneck because only one transaction can take place on the bus regardless of how many DIMMs are on it. The first HTDIMM wouldn't actually be a bottleneck because its speed would be the same as the link on the cpu.

However, unlike regular DIMMs you can have transactions taking place behind the first HTDIMM. This combined with mixed packet delivery tends to remove chip latencies. In other words, with HTD you can make multiple read or write requests and the data can be delivered mixed together. This removes the need to wait until a particular chip is ready and allows the data to stream at a higher rate.

Anonymous said...

AMD began developing the HyperTransport™ I/O link architecture in 1997.

Pre-Consortium Versions of the Specification
AMD has released these two pre-consortium documents which define two revisions of "LDT" [Lightning Data Transfer] as HT was known before the HT Consortium was formed.
http://www.hypertransport.org/tech/tech_specs.cfm


[2001]
AMD has disclosed HyperTransport technology specifications under non-disclosure
agreement (NDA) to over 170 companies interested in building products that incorporate
this technology.

Multiple partners have signed the license agreement for HyperTransport technology,
including, among many others:

Sun Microsystems Cisco Systems Broadcom
Texas Instruments NVIDIA Acer Labs
Hewlett-Packard Schlumberger Stargen
PLX Technology Mellanox FuturePlus
API Networks Altera LSI Logic
PMC-Sierra Pericom Transmeta


AMD is releasing the specifications to an industry-supported non-profit trade association in the fall of 2001.
The HyperTransport Consortium will manage and refine the specifications, and
promote the adoption and deployment of HyperTransport technology. It is also expected
to consist initially of a Technical Working Group and a Marketing Working Group.
Subordinate task forces will do the work of the consortium. Anticipated technical task
forces include:
Protocol Task Force
Connectivity Task Force
Graphics Task Force
Technology Task Force
Power Management Task Force
Information on joining the HyperTransport Technology Consortium can be found at
this website: http://www.hypertransport.org

HyperTransport Technology I/O Link (white paper) [PDF]




San Jose, Calif., July 24, 2001 -- A coalition of high-tech industry leaders today announced the formation of the HyperTransport™ Technology Consortium, a nonprofit corporation that supports the future development and adoption of AMD's HyperTransport I/O Link specification.

[...] More than 180 companies throughout the computer and communications industries have been engaged with AMD in working with the HyperTransport technology
hypertransport.org press release





HyperTransport technology has a daisy-chain topology, giving the opportunity to
connect multiple HyperTransport input/output bridges to a single channel.
HyperTransport technology is designed to support *up to 32 devices per channel* and can
mix and match components with different link widths and speeds.



AMD, coherent HyperTransport : "proprietary coherent version which let’s processors communicate directly, sharing cache data"



Henri Richard, EVP and chief sales and marketing officer at AMD:
I think our announcement of Torrenza is a very important event for the industry. I think it demonstrates that we’re willing to license our Coherent HyperTransport technology and create an open eco system where people that are capable of creating dedicated silicon will be able to plug their solution into an HTX slot or further down the road an AMD64 socket. That’s really going to help the industry to continue to expand this notion of a connected business model and provide both OEMs and end users a variety of choices and opportunities for optimal differentiation.

Chris Hall, DigiTimes.com, Computex show in Taipei [Friday 23 June 2006]

Scientia from AMDZone said...

Intel is the supplier and seller, they basically became the Wal-Mart of the tech world

I don't see the correlation. Intel is a supplier but doesn't own any retail or mail sellers. For example, Pepsi owns Pizza Hut, Taco Bell, and KFC. Intel doesn't own anything like this.

Anonymous said...

scientia nice idea there's only one problem. Not enough pin to power your dimm's.
3x16bitHT links that's

3x80pin's=240

current ddr2 dimms had 240pin including power pin's

Philip

Scientia from AMDZone said...

HyperTransport technology has a daisy-chain topology

The topology is daisy chained at the lowest level. With three links you can create other topologies. HT is not limited in this way.

Scientia from AMDZone said...

Philip,

scientia nice idea there's only one problem. Not enough pin to power your dimm's.
3x16bitHT links that's

3x80pin's=240

current ddr2 dimms had 240pin including power pin's


I'm not sure how you get from 3 HT links to 3 x 80 (that would be 5 pins per bit). With the actual 2 pins per bit this would be 32 pins per link. 32 x 3 = 96.

In fact, even if we doubled the links to 32 bits that would still only be 192 pins. We should have 48 pins left over for power and ground.