Cloud native EDA tools & pre-optimized hardware platforms
By Richard Solomon, Sr. Technical Marketing Manager, PCI Express Controller IP
On November 29, 2011 the PCI-SIG® announced that its 4.0 specification would include a 16 GT/s maximum data rate. That was exactly double the 8 GT/s maximum specified by the PCI Express® 3.0 Base Specification. In the intervening years, most focus and discussion has been around the challenges of developing a PHY/SerDes to implement the new data rate. The challenge of running such high speeds over least-common-denominator FR4 material certainly should not be minimized, but there are also a number of significant challenges facing digital designers building the PCI Express controllers which will run those PHYs.
One of the hallmarks of the PCI family of interface standards has long been backwards compatibility, and PCIe® 4.0 is no exception. The new 4.0 specification will include all the previous data rates: 2.5, 5, and 8 GT/s in addition to the new 16 GT/s data rate, and interoperability amongst all those devices is required. While this means that a designer can still build a 4.0-compliant PCI Express device which runs at a maximum of 2.5 GT/s, most implementers will be concerned with the latest 16 GT/s data rate.
Designers who have followed the PCI Express spec for the last several generations will recall the complexity of the 5 GT/s to 8 GT/s transition – where new equalization and encoding schemes were introduced to double the bandwidth without a doubling of the bitrate. The transition from 8 to 16 GT/s is expected to be a much less disruptive one – more akin to the 2.5 to 5GT/s transition that accompanied the PCIe 2.0 specification. The same 128/130 encoding scheme introduced with PCIe 3.0 and 8 GT/s will be used in PCIe 4.0’s 16 GT/s mode.
There has been some discussion within the industry of a connector change for PCIe 4.0, but it is important to note that this connector would remain pinout-compatible with the existing PCIe card definitions. In other words, some connector design changes are necessary for the higher 16 GT/s data rate, but new connectors will remain fully backwards compatible with older PCIe cards. Likewise, new PCIe 4.0 cards will plug into existing PCIe sockets without issue.
The so-far-immutable laws of physics dictate that higher-speed signaling is effective over shorter and shorter distances, and so the PCI-SIG is specifying “extension devices”--retimers and re-drivers--to extend the reach of PCIe and achieve today’s required 16-20 inch channel lengths at 16 GT/s.
A number of high-bandwidth markets are driving this new data rate. GPUs in the consumer market are constantly requiring higher-bandwidth interconnects to feed desire for new 4K gaming and higher levels of visual realism demanded by gamers. Traditional data center applications like servers, supercomputers, and their associated networks (e.g., 40G and 100G Ethernet) and storage controllers (e.g., SAS RAID and SANs) are also consumers of this new specification. The phenomenal bandwidth of solid-state disks has driven these devices from traditional SATA/SAS interfaces to PCI Express, and they are positioned to use as much of this newly available bandwidth as their purchasers will buy.
Five areas concern designers building PCI Express 4.0 controllers:
For each of these topics, the degree of difficulty will vary with the designer’s familiarity with existing PCI Express specifications and both their current and new target bandwidth.
The PCI-SIG is defining two types of software transparent extension devices: re-drivers, which are not protocol aware; and retimers, which are protocol aware. Re-drivers are generally analog-only devices with limited utility at 16 GT/s and are not visible to the end devices on their PCI Express link. Retimers, however, are intended to actually form two electrically separate “sub-links” connecting them to the two link partners.
There are no required changes for PCI Express devices to work with retimers, though devices that implement downstream ports (switches and root complexes) are optionally allowed to detect and report retimer presence on their links via a new bit in the PCIe training sequence. Retimers will set this new bit to allow such devices to report their presence to system software and potentially allow tuning of system parameters to optimize performance around the retimer. There is no specification on what switches or root complexes should do with this information, other than to “consider retimer latency” when determining areas such as replay timer timeouts and buffer sizes.
Endpoint designers may safely ignore retimers, but should monitor the PCI Express 4.0 Base Specification drafts in case this changes. Switch and Root Complex designers should likewise watch to see if more guidance is provided to them in future drafts. As a general suggestion, designers should err on the side of slightly larger buffers and timeouts to accommodate potential slight latency increases from retimers.
PCI Express 3.0 added a new Link Equalization mechanism for use with 8 GT/s signaling, whereby the two link partners perform link training and exchange equalization coefficients. This four-phase process will be extended for PCIe 4.0’s 16 GT/s mode but in a two-step procedure where the link switches to 8 GT/s then repeats the equalization process and switches to 16 GT/s.
Designers should watch this open area of the specification closely – particularly if any changes become required to the Link Training and Status State Machine (LTSSM). In truth, the risk of design changes here is low to moderate, so the overall risk of early execution is low. Given that there are other, bigger, changes involved, either implementing what’s in the current 0.3 draft or waiting to see if the mechanism changes are both a valid choices.
Most PCI Express controllers utilize the PHY Interface for PCI Express (PIPE) standard for interfacing to their PHY. The choice of width and frequency for this interface is obviously the first-order concern in moving to 16 GT/s. Common practice in the past has been 16-bits per lane, and 500MHz has been a common maximum frequency. One of those two will have to change to accommodate the new data rate and the tradeoffs are discussed later in this article.
Unfortunately there is a long history of the PIPE specifications lagging the PCI Express specification, so keep a close eye on the PIPE specification and plan ahead to collaborate with your PHY vendor in case of missing or incomplete PIPE functionality. Pay particular attention to controls for equalization, clocking, or any other new signaling which might be introduced.
This issue appeared when PCIe 3.0 introduced 8 GT/s, and while there are several different ways to explain why it happens, we will consider the simplest high-level view. The smallest PCIe Transaction Layer Packet (TLP) is made up of three 32-bit “DWORDs” plus a single 32-bit LCRC value for a total of 128-bits. At 8 GT/s it is common for the PIPE interface to run at 16-bits wide and 500MHz, so consider what happens in an 8 lane (x8) implementation: 8 lanes * 16-bits yields 128-bits for a datapath. This width is the same as a minimum sized packet, so at most one complete packet can be on the interface at once. However, when the link is 16 lanes (x16), 256-bits are required and now it’s possible to receive two complete packets on each cycle.
The problem worsens with PCIe 4.0’s 16 GT/s as keeping the PHY interface from exceeding 500MHz would require 32-bits per lane –- so a x4 16 GT/s link fits into the 128-bit datapath (32 bits * 4 lanes), but a x8 link (32 bits * 8 lanes = 256 bits) now exhibits the same two packets per cycle issue as a x16 8 GT/s link did. A x16 link running 16 GT/s would require 512 bits (32bits * 16 lanes) and could therefore have as many as four packets in each cycle.
As application logic in an SoC generally depends on receiving packets one at a time, the PCI Express controller designer must make a choice of how to handle multiple packets per cycle.
Option 1: Limit interface width
Limiting the interface width to 128 bits removes the multiple-packet issue but requires increased clock frequency: 8 GT/s x16 = 1GHz; 16 GT/s x8 = 1GHz; 16 GT/s x16 = 2GHz. While this is easy from an architectural and RTL design standpoint for both the controller designer and the controller user (SoC designer), gate implementation will be extremely challenging. The difficulty of closing timing from flop to flop is apparent, but even more difficult may be finding RAMs of suitable speed in the desired sizes. Fast memories are often not large memories.
Option 2: Provide multiple packet paths
A second option for the controller designer is to push the problem to the controller user (SoC designer) and insist that the user accept multiple packets per cycle. This forces the controller user to multi-thread an application interface that is traditionally single-threaded, and pushes significant responsibility for obeying PCIe ordering rules back into the application. While this option theoretically provides the best possible performance, the cost to the controller user makes it impractical.
Option 3: Serialize the data stream
The controller designer can instead guarantee never to issue multiple packets per clock by using internal buffering and logic duplication to hold back any packets that would appear simultaneously and present them to the application in sequence. This is the easiest approach for the controller user (SoC designer) but comes at an implementation cost to the controller designer. It also opens up the possibility of lower than theoretical maximum performance – clearly a case of 100% continuous minimum packets must eventually fill the controller’s buffers and apply backpressure in the PCI Express link.
Option 4: Combination of Option 1 (Limit interface width) and Option 3 (Serialize the data stream):
The most attractive option is a combination of limiting interface width and serialization to balance implementation complexity against maximum clock frequency. As shown in Table 1, the controller designer can optimize for an attainable frequency by adjusting the datapath width.
For example, a x16 link might be designed at 256 bits and accept two packets per cycle to limit maximum frequency to 1 GHz, vs the 2 GHz that would be required to avoid the issue altogether. Likewise, a 1 GHz frequency might be more attractive than the complexity of handling 4 packets per cycle, as the 512 bit choice would require.
|
Table 1: Optimizing for frequency by adjusting datapath width
Fundamentally, the higher bandwidth of PCI Express 4.0’s 16 GT/s signaling requires some tradeoffs as we’ve noted above. The central tradeoff is clock speed vs data bus width, and at the higher end with x8 and x16 links, the choice is around the speed or architecture the designer prefers.
512-bit data busses may cause routing issues when physical design begins. 1 GHz or 2 GHz clock frequencies certainly will be challenging or impossible timing closure efforts. Memory availability, silicon geometry, process variation, and Voltage Threshold (VT) cell mix all factor into determining whether a given design will be able to achieve high 16 GT/s speeds.
Design teams implementing PCI Express 4.0’s new 16 GT/s signaling rate should start architectural work now to accommodate the demands of its higher bandwidth – particularly in x8 and x16 links. Decisions about datapath widths and clock frequency permeate the entire SoC and may require far-reaching architectural changes in the application logic. These fundamental choices are driven by the new data rates, and will be independent of any anticipated specification changes. Switch and Root Complex designs, in particular, must take into account the potential impact of new PCIe retimers. Designers should closely monitor the PCI Express 4.0 Base and PIPE Specifications for the latest information and updates.
For more information about PCI Express 4.0 implementation, including how you can begin 16 GT/s design work today, please visit http://www.synopsys.com/pcie.