How CXL Is Improving Latency in High-Performance Computing

Gary Ruggles, Richard Solomon, Varun Agrawal

Aug 08, 2023 / 7 min read

From the dawn of civilization through 2003, roughly five exabytes of data were created in total, according Eric Schmidt, past CEO of Google. By 2025, global data creation is expected to reach 180 zettabytes. This means that within the span of a single generation, we've created roughly 36,000 times the amount of data ever created—that's a lot of data! To accommodate this data explosion, the installed base of storage capacity is expected to increase at 19.2% CAGR through 2025, and the data center accelerator market is expected to grow by 25% CAGR through 2028.

It doesn't stop there.

Managing data—created, copied, stored, consumed, and otherwise proliferated from the data center to the edge—creates unique challenges for SoC designers. This includes mounting pressure to move the data through systems faster and with greater efficiency and security: Lower power. Smaller area. Lower latency. And with data confidentiality and integrity. It's essential for the interconnects in multi-die systems to have low latency along with enough flexibility to manage a variety of bandwidths and throughput. Complying with the right industry standards can help ensure design success.

One of the newer kids on the standards block—and quickly gaining traction—is Compute Express Link (CXL), an open interface specification with its own consortium for processors, accelerators, and memory expansion. Read on to learn more about the CXL standard and when you might want to consider CXL for improving latency in your next SoC design.

3DIC Glowing Abstract

What Is the CXL Standard?

The CXL standard is made up of three protocols that negotiate up within one link that leverages the PCI Express® (PCIe®) electrical layer, including, CXL.cache, and CXL.memory (CXL.mem). Each of these CXL protocols have separate stacks, multiplexed at the PHY level, ideal for various contexts:

CXL Controller

  1. This protocol is carried over from the PCIe protocol layer, with minor tweaks to allow it to be carried concurrently with CXL.mem and CXL.cache. is used for link configuration and management tasks, and it can also be used for register reads and writes and large data transfers, such as those used in block storage and traditional networking.
  2. CXL.cache: This CXL protocol delivers cache coherency and extremely low latency. It works for smaller amounts of data (for example, cache lines or individual bytes). The extremely low latency comes from the fact that snoops (processor logic that checks for data changes in the cache) do not require copying data back and forth across the system. Cache coherency means that accelerators have direct access to the same data that the host processors do, enabling faster and more efficient computational offload.
  3. CXL.mem: This protocol provides a low latency access mechanism for memory resident on the CXL link. This capability allows for a number of interesting system architectures, and it inherently comprehends the idea of non-volatile memory. For example, CXL.mem allows pooling memory resources from across different devices so that multiple memory modules can be connected to a server without going through a traditional memory controller and DRAM interface — it gives processors access to shared memory resources.

Having low-latency interfaces such as CXL unlocks new ways of doing computing such as enabling efficient heterogeneous computing architectures, accelerating data-intensive workloads, and facilitating advanced real-time analytics. CXL's computational offloading and memory pooling coupled with the ability to interoperate with the ubiquitous PCIe standard opens up a wide array of design possibilities. It extends the new paradigm of disaggregation and composability in multi-die systems to include cache and memory.

Accelerators: Avoid Copying Data Around the System with Direct Memory Access

If a processor offers a CXL interface, accelerators can have access to the same data as the processor, avoiding the need to replicate data across the system.

Here's an example of how this helps the efficiency of your system:

Imagine you are designing a security camera application. There's a physical camera, and it dumps frames of data into system memory maybe 30-, 60-, 100-frames per second, or more. The processor takes those frames of data in the memory, and it recognizes a face, and another face, and another. The processor needs to parse out which face is Ted, which is Michael, and which is Sophia.

In the past, there was a lot of back and forth of the control and the copying of data to do this kind of operation. The CPU would have to tell the driver to copy the frames of data from memory and deliver it to the accelerator through the system bus. After the data was delivered to a memory buffer in the accelerator, the accelerator would analyze the data to determine who those faces were. All that data would then have to travel back through the system to the CPU that would write the names associated with the faces into the memory.

With CXL, instead of the driver copying the face data over to a buffer on the accelerator through the system bus, the accelerator has direct memory access. This means that the CPU can simply send pointers to the accelerator that say (for instance), "look at the addresses 1,000,000, 1,100,000, and 1,200,000 in the memory. Those are faces. Let me know who those face are." The accelerator can update the system memory directly, defining the faces as Ted, Michael, and Sophia without sending the data back and forth through the system.

With CXL, data only gets moved as the co-processor needs it, and even then, when it accesses a face, it does not copy all the data across the system bus, it only copies the information that is absolutely necessary—not the entire frame. This equates to less software overhead and latency, freeing your system up for better die-to-die communications.

Memory: Pooling Memory Resources Results in Memory Allocations that Fit Your Application

With today's demands in multi-die systems, the number of processor cores are rapidly growing. And greater numbers of processor cores equate to greater amounts of required memory. At the same time, you may not need all of the one-size fits-all memory allocation for the cores. CXL protocol solves for this by allowing memory expansion with coherence, meaning that processor cores can share memory resources in a way that increases efficiency.

Prior to CXL, designers had to make non-volatile memory interfaces look like DRAM so that the memory would persist even if the power was cut. While the methods to manipulate products to look like DRAM might work great for dynamic memory applications, they are anything but streamlined for non-volatile memory. It's like trying to hammer a square peg into a round hole. A few companies even went so far as to build products that look like DRAM and physically plug into a DIMM socket on your server. That way, the DRAM could be copied off to NAND Flash or MRAMs when the power failed, or the DRAM could otherwise interface to more complex technology. To do all of this effectively, you had to be really clever and creative, and then? Well, you needed to hope for the best.

Enter CXL (because hope really isn't a great strategy). While DRAM solutions work well when in close physical proximity to your processor, when you’re running hundreds of bits across even a few inches of PC board, you run into all kinds of skew problems. CXL.mem is better suited for moving data over longer distances.

In addition, the CXL.mem protocol part of the specification enables you to:

  • Put more DRAM in your system.
  • Move DRAM to a different place.
  • Enable DRAM-like behavior along with other attributes, for instance attributes of non-volatile storage, class memories, linear storage, or byte-addressable storage.

Imagine if your system could talk to your disk drive like it were a memory and you didn’t have to worry about sectors, heads, and tracks. What if your system could get 3 bytes of memory from a device in one place and 100 bytes from a device in a different place? In short, what if memory was pooled as a common resource?

In the past if you rented 100 processor cores from a server farm you probably also rented 800G (or some other fixed amount) of memory. But, maybe you didn't need all that memory and so some of that memory was stranded (never used). CXL mitigates this with memory pooling so you can converge on perfect memory utilization, reducing time and latency.

How to Implement State-of-the-Art CXL Solutions

With CXL, you can create virtual machines that have the right mixture of memory, processing, and acceleration for your specific job. Synopsys offers a comprehensive solution for implementing CXL to help you get started. including our CXL PHY and controllers with IDE security and verification IP. Synopsys CXL controllers behave like a super set of a PCIe controller, leveraging the speed of PCIe along with the PHY. We also have hardware help in the form of IP prototyping kits.

Synopsys Verification IP and protocol verification solutions for CXL (up to 3.0) on Synopsys ZeBu® and HAPS® hardware-assisted platforms provide IP to system-level methodology to verify CXL bus latency and identify system bottlenecks in compliance with Chapter 13 of the CXL specification. 

As part of our complete solution, our experts can help you make the right decisions for your subsystems. This is useful when you are building very complex SoCs—for example 20 different combinations for bifurcation cases, (16 lane, 2×8 lane, 8×2 lane, 4×4 lane, etc.) Having a number of combinations requires instantiating a lot of different controllers and handling the various clock and reset logic carefully—potentially, even integrating the PHYs into a single subsystem. We have deep experience helping customers on the leading edge of adoption and beyond, and we use this background to help you.

Not only can we help ease your design journey and lower your risk, but we've also got you covered with interface security. We have solutions for Integrity Data Encryption (IDE) for both PCIe and CXL, for confidentiality, integrity and replay protection, including support for TEE Device Interface Secure Protocol (TDISP) for virtualized environments, and more. Our IDE security module gives you a complete, fully integrated, and configurable solution with very low latency overhead.

To learn more, check out how XConn achieved first-time silicon success for CXL switch SoC with Synopsys CXL and PCIe IP products or download our Synopsys IDE Secure Module for CXL 2.0 datasheet.

Continue Reading