Error Correction Code (ECC) in DDR Memories

 Vadhiraj Sankaranarayanan, Sr. Technical Marketing Manager, Synopsys

Introduction

Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM or simply DRAM) technology is the widely used for main memory in almost all applications today, ranging from high-performance computing (HPC) to power-, area-sensitive mobile applications. This is due to DDR’s many advantages including high-density with a simplistic architecture, low-latency, and low-power consumption. JEDEC, the standards organization that specifies memory standards, has defined and developed four DRAM categories to guide designers to precisely meet their memory requirements: standard DDR (DDR5/4/3/2), mobile DDR (LPDDR5/4/3/2), graphic DDR (GDDR3/4/5/6), and high bandwidth DRAM (HBM2/2E/3). Figure 1 shows a high-level block diagram of a memory subsystem in a typical system-on-chip (SoC), which comprises of a DDR memory controller, DDR PHY, DDR channel, and DDR memory. As per JEDEC' definition, the DDR channel is composed of Command/Address and data lanes. The simplified DDR memory shown below can represent a DRAM memory component from any of the four categories. 

Figure 1: Memory subsystem block diagram in an SoC

As with any electronic system, errors in the memory subsystem are possible due to design failures/defects or electrical noise in any one of the components. These errors are classified as either hard-errors (caused by design failures) or soft-errors (caused by system noise or memory array bit flips due to alpha particles, etc.). As the names suggest, hard-errors are permanent and soft-errors are transient in nature. Although it is logical to expect the DRAMs (with large memory arrays and getting denser with every standards generation for a smaller process node) to be the bulk source of the memory errors, an end-to-end protection from the controller to the DRAMs is highly desirable for overall memory subsystem robustness. 

To handle these memory errors during runtime, the memory subsystem must have advanced RAS (Reliability, Availability, and Serviceability) features to prolong the overall system uptime at times of memory errors. Without RAS features, the system will most likely crash due to memory errors. However, RAS features allow the system to continue operating when there are correctable errors, while logging the uncorrectable error details for future debugging purposes.

ECC as a Memory RAS Feature

One of the most popular RAS schemes used in the memory subsystem is Error Correction Code (ECC) memories. By generating ECC SECDED (Single-bit Error Correction and Double-bit Error Detection) codes for the actual data and storing them in additional DRAM storage, the DDR controller can correct single-bit errors and detect double-bit errors on the received data from the DRAMs. 

The ECC generation and check sequence is as follows:

  • The ECC codes are generated by the controller based on the actual WR (WRITE) data. The memory stores both the WR data and the ECC code.
  • During a RD (READ) operation, the controller reads both the data and respective ECC code from the memory. The controller regenerates the ECC code from the received data and compares it against the received ECC code.
  • If there is a match, then no errors have occurred. If there are mismatches, the ECC SECDED mechanism allows the controller to correct any single-bit error and detect double-bit errors.

Such an ECC scheme provides an end-to-end protection against single-bit errors that can occur anywhere in the memory subsystem between the controller and the memory.

Based on the actual storage of the ECC codes, the ECC scheme can be of two types: side-band ECC or inline ECC. In side-band ECC, the ECC codes are stored on separate DRAMs and in inline ECC, the codes are stored on the same DRAMs with the actual data.

As DDR5 and LPDDR5 support much higher data-rates than their predecessors, they support additional ECC features for enhancing the robustness of the memory subsystem. On-die ECC in DDR5 and Link-ECC in LPDDR5 are two such RAS schemes to further bolster the memory subsystem RAS capabilities. 

Different Schemes of ECC

Side-band ECC

The side-band ECC scheme is typically implemented in applications using standard DDR memories (such as DDR4 and DDR5). As the name illustrates, the ECC code is sent as side-band data along with the actual data to memory. For instance, for a 64-bit data width, 8 additional bits are used for ECC storage. Hence, the DDR4 ECC DIMMs, commonly used in today’s enterprise class servers and data centers, are 72 bits wide. These DIMMs have two additional x4 DRAMs or a single x8 DRAM for the additional 8 bits of ECC storage. Hence, in side-band ECC, the controller writes and reads the ECC code along with the actual data. No additional WR or RD overhead commands are required for this ECC scheme. Figure 2 describes the WR and RD operation flows with side-band ECC. When there are no errors in the received data, side-band ECC incurs minimal latency penalty as compared to inline ECC. 

Figure 2: WR and RD operation flows with side-band ECC

Inline ECC

The inline ECC scheme is typically implemented in applications using LPDDR memories. As the LPDDR DRAMs have a fixed-channel width (16-bits for a LPDDR5/4/4X channel), side-band ECC becomes an expensive solution with these memories. For instance, for a 16-bit data-width, an additional 16-bit LPDDR channel needs to be allocated for side-band ECC for the 7 or 8-bit ECC code-word. Moreover, the 7- or 8-bit ECC code-word fills the 16-bit additional channel only partially, resulting in storage inefficiency and also adding extra load to the address command channel, possibly limiting performance. Hence, Inline ECC becomes a better solution for LPDDR memories.

Instead of requiring an additional channel for ECC storage, the controller in inline ECC stores the ECC code in the same DRAM channel where the actual data is stored. Hence, the overall data-width of the memory channel remains the same as the actual data-width.  

In Inline ECC, the 16-bit channel memory is partitioned such that a dedicated fraction of the memory is allocated to ECC code storage. When the ECC code is not sent along with the WR and RD data, the controller generates separate overhead WR and RD commands for ECC codes. Hence, every WR and RD command for the actual data is accompanied with an overhead WR and RD command respectively for the ECC data. High-performance controllers reduce the penalty of such overhead ECC commands by packing the ECC data of several consecutive addresses in one overhead ECC WR command. Similarly, the controller reads the ECC data of several consecutive addresses from memory in one overhead ECC RD command and can apply the read-out ECC data to the actual data from the consecutive addresses. Hence, the more sequential the traffic pattern is, the latency penalty is less due to such ECC overhead commands. Figure 3 describes the WR and RD operation flows with inline ECC.

Figure 3: WR and RD operation flows with Inline ECC

On-die ECC

With each DDR generation, it's common for the DRAM capacity to increase. It is also common for DRAM vendors to shrink the process technology to achieve both higher speeds and better economies of scale in production. With the higher capacity and speed coupled with the smaller process technology, the likelihood of single-bit errors increases on the DRAM memory arrays. To further bolster the memory channel, DDR5 DRAMs have additional storage just for the ECC storage. On-die ECC is an advanced RAS feature that the DDR5 system can enable for higher speeds. For every 128 bits of data, DDR5 DRAMs has 8 additional bits for ECC storage.

The DRAMs internally compute the ECC for the WR data and store the ECC code in the additional storage. On a read operation, the DRAMs read out both the actual data as well as the ECC code and can correct any single-bit error on any of the read data bits. Hence, on-die ECC provides further protection against single-bit errors inside the DDR5 memory arrays. As this scheme does not offer any protection against errors occurring on the DDR channel, on-die ECC is used in conjunction with side-band ECC for enhanced end-to-end RAS on memory subsystems. Figure 4 describes the WR and RD operation flows with on-die ECC.

Figure 4: WR and RD operation flows with On-die ECC

Link-ECC

The Link-ECC scheme is a LPDDR5 feature that offers protection against single-bit errors on the LPDDR5 link or channel. The memory controller computes the ECC for the WR data and sends the ECC on specific bits along with the data. The DRAM generates the ECC on the received data, checks it against the received ECC data, and corrects any single-bit errors. The roles of the controller and the DRAM are reversed for the read operation. Note that link-ECC does not offer any protection against single-bit errors on the memory array. However, inline ECC coupled with link-ECC strengthens the robustness of LPDDR5 channels by providing an end-to-end protection against single-bit errors. Figure 5 describes the WR and RD operation flows with link-ECC.

Figure 5: WR and RD operation flows with Link-ECC

Conclusion

One of the widely used memory RAS features is the Error Correction Code (ECC) scheme. Applications using standard DDR memories typically implement side-band ECC, while applications using LPDDR memories implement inline ECC. With the higher speeds and hence pronounced SI effects on DDR5 and LPDDR5 channels, ECC is now supported even on DDR5 and LPDDR5 DRAMs in the form of on-die and link-ECC respectively. Synopsys’ DesignWare® DDR5/4 and LPDDR5/4 IP solutions offer advanced RAS features including all of the ECC schemes highlighted in this article.