# **DesignCon 2019**

# 6.4Gb/s Single-Ended Transceiver Techniques for DDR5 Server Application

Tingting Pang, Huawei Technologies

Tianyu Liang, Huawei Technologies

Zhihua Xu, Huawei Technologies

# Abstract

For the explosively increasing demand for higher bandwidth and larger density in server application, the single-ended DDR5 I/O operation frequency will be driven to 6.4 Gb/s in future. Since the multi-drops and parallel single-ended signaling architecture will be still employed in server application, the transmission line effect from channel, such as intersymbol interference (ISI) and cross-talk, will increasingly deteriorate the final margin in higher data-rate application. Therefore system designer will face big challenge to achieve the 6.4 Gb/s signal transmission in future. At the same time, the source synchronous unmatched structure (SSUS) will be adopted in DDR5 SDRAM, which is also applied in current LPDDR SDRAM for high-speed design requirement. The SSUS will weaken the correlated power supply induced jitter (PSIJ) tracking ability between data and strobe of the DDR system. And it will hurt the timing margin significantly if we don't take more care about that. To make the 6.4 Gb/s server application happen, this paper analyzes the DDR5 transceiver requirements based on the realistic channel topologies, including a combination of equalization techniques and PSIJ management techniques.

# Author(s) Biography

Tingting Pang, joined Huawei Technologies in 2017 as a signal/power integrity engineer. She has been working on DDR modeling, simulation and measurement. Her areas of interest are signal/power integrity and high-speed transceiver channel design. She received her master Degrees in Beihang University, for research in device and circuit design for spin-based nonvolatile memory.

Tianyu Liang, joined Huawei Technologies in 2012 as a principal engineer of signal and power integrity. His work focuses on the DDR system architecture design, modeling and simulation methodology and electrical validation methodology. He received his master degree in Chongqing University of Posts and Telecommunications for the research on baseband DSP algorithms and FEC technology in communication system.

Zhihua Xu, joined Huawei Technologies in 2016 as a signal/power integrity engineer. He has been working on DDR modeling, simulation and measurement. He focuses on the development of the tools and methodologies for efficient signal/power integrity analysis. He received his master degree in XiDian University, for research in signal/power integrity and fast time-domain Channel Simulation.

## Introduction

In server application, the higher bandwidth and larger density demands drive the singleended DDR5 I/O operating up to 6.4 Gb/s. However, there are several challenges constraining DDR5 I/O signaling for this high speed operation. ISI, cross-talk and PSIJ are the most highlight ones among those challenges. For cleaning up the roadblock on the way to 6.4 Gb/s destination, DDR5 transceiver requirements are analyzed in this paper, basing on multiple current realistic server scenario. And this investigation will try to figure out the phosphor characteristics for the future application.

Obviously, for lossy and long channel acting as a low-pass filter, the signal with higher data-rate will suffer more server attenuation. Although the Standard Committee has confirmed PTH connector will be replaced by SMT one in DDR5 generation, ISI issue still need to be taken care of at 6.4 Gbps data rate. Fortunately, this situation could be greatly relieved through implementing equalization techniques such as continuous time linear equalizer (CTLE) and decision feedback equalizer (DFE) at receiver side (RX). And feed-forward equalizer (FFE) at transmitter side (TX) [1]. But we should pay more attention on the trade-off between area, power-consumption and margin. Basing on realistic channels and statistical analysis algorithm, the benefits of those equalizers will be analyzed as well as their cost in the ISI management. And then the possible equalization combination strategies for 6.4 Gb/s eye opening will be recommended.

DDR5 SDRAM still employs parallel single-ended signaling, which leads to high density layout and compact placement in server product board. Thus channel cross-talk is becoming a major barrier for 6.4 Gb/s application. Conventionally, there are two solutions for channel cross-talk reduction: a. at the system level, increasing distance between signal lines or implementing a shielding line in physical design is effective to weaken coupling energy. However, it needs more layout space on PCB and SoC package, which will increase cost dramatically. Thereby it is infeasible for high-density parallel link system. b. at the circuit level, cross-talk equalizer could be applied at TX or RX to cancel out far-end cross-talk (FEXT). Though the circuit-based approach will result in circuit complexity and power consumption increasing, its good performance on cross-talk cancellation makes it necessary for the high speed parallel single-ended and low cost system. In the section of cross-talk management, we will analyze how the TX and RX cross-talk equalization technique for an eight-lane single-ended bus application. And the preferable design parameters will be investigated as well.

Beside the signal integrity (SI) management topic, the power integrity (PI) problem in DDR5 should also be considered carefully. High-speed source-synchronous systems, like DDR3 and DDR4 which employ forward clocking structure with 90 degree skew for DQS and DQ, have the ability to track correlated jitter between data and strobe, such as PSIJ [4]. Different form previous generations of DDR, DDR5-SDRAM uses an un-matched DQS-DQ path architecture (> 90 degree), which is also employed in LPDDR4. So the DQS-strobe will arrive at the SDRAM ball prior to the DQ signal by the amount of tDQS2DQ, which may amplify high frequency jitter and weaken the tracking ability of PSIJ. Then the timing margin will loss due to untracking jitter between DQ and DQS. In the PSIJ

management section, the impact of PSIJ for skew DQS-DQ path in DDR5 system running at 6.4 Gb/s will be analyzed. And the PI design suggestions will be provided for this new sampling architecture.

# **Channel Characteristics**

There is a long history for multi-drop topology employed in DDR system to obtain larger capacity. In a typical server memory system, per mother board channel usually connects SoC with multiple slots which can be populated by dual In-line memory modules (DIMMs), as shown in Figure 1. One DIMM per channel (1DPC) and two DIMMs per channel (2DPC) of two slots per channel (2SPC) are the most popular configurations in DDR4/5 generation. For 1DPC configuration, the far slot will be populated while the near slot will be left unpopulated. And for 2DPC configuration, both slots will be fully populated as shown in Figure 1, in which the far DIMM called DIMM0 and the near one called DIMM1. Besides, each Dual Rank DIMM will drive two SDRAMs, and signals are bi-directionally transferred between SoC and SDRAM. Due to this single-point to multi-point communication system, reflections arise form impedance discontinuous can easily occur at each component connection point. And the trace to the untargeted SDRAMs or untargeted DIMM act as the stubs to the target DRAM, thus when the data rate to be higher, such as 6.4 Gb/s, it will be more difficult for signal integrity design.



Figure 1. PTH and SMT topology

In DDR5 generation, the Standard Committee has confirmed PTH connector will be replaced by SMT one. Figure 1 shows insertion loss (IL) comparison between PTH design and SMT design. It can be seen that ~10dB bandwidth is extended from 3.5 GHz to 6 GHz, which will greatly reduce the bandwidth limit effect of the communication system. However, the signal still suffers from severe attenuation in this low-pass type channel as

data rate increasing. To consider the server application, a realistic long channel consisting of a mother board with two SMT connectors is adopted for simulation. Besides, the write operation mode (WROM) and read operation mode (RDOM) of 1DPC and 2DPC configurations will be considered in this paper.

The IL and FEXT of those cases are shown in Figure 2 respectively. And all IL results are normalized to DC value. For IL characteristics, the attenuation of 1DPC is around 3dB larger than that of 2DPC at 3.2 GHz no matter in WROM or RDOM. DIMM1 of 2DPC results more attenuation due to round-trip reflection caused by DIMM0. And IL in RDOM is more severe. For FEXT characteristics, the below one in Figure 2 shows the situation of FEXT with the largest coupling energy, which mainly caused by parasitic capacitive coupling from the nearest aggressor. The magnitude of FEXT at 3.2 GHz is about -30dB. And FEXT in RDOM is around 5dB higher than that in WROM. From the above IL and FXET results, it can be predicted there is more challenge in RDOM to get better margin.



Figure 2 Channel characteristics of server application

Basing on the channel analyzed above, ISI and cross-talk effect on final margin with bit error rate (BER) at 1E-16 in WROM and RDOM with 1DPC and 2DPC configurations and the corresponding management technology will be investigated in the following sections.

#### **ISI Management**

ISI can be caused by reflection due to improper termination scheme, large capacitive loads, and dispersion effects of channel where different frequencies are attenuated by different amounts. To minimize ISI effect, the combination equalization strategy are used, such as implementing CTLE and DFE at RX, and FFE at TX. As far as we know, 4-tap DFE will be adopted in SDRAM to compensate ISI in WROM. And there are a variety of equalization techniques can be chosen at SoC to enhance the robustness of system. In the following section, the benefits of those equalizers brought to the system will be analyzed according to the probability density function (PDF). And then we will investigate the different equalization combination strategies for 6.4 Gb/s eye opening.

In the following analysis, an equivalent IO buffer model is used for SoC modeling as well as for SDRAM. The FFE, CTLE, and DFE behavior model is build up through mathematic algorithm model in Matlab. Note that cross-talk effects and power noise effects are not taken into account in this ISI simulations.

#### **1DPC** Configuration



Figure 3 Pulse response of 1DPC in WROM without any equalization

Figure 3 shows the channel pulse response of 1DPC configuration with dual-rank RDIMM in WROM without any equalization. To deal with the effect of reflection, not only the target rank will have a termination with 240 ohm, the other rank will also be terminated with a 60 ohm resistance. It can be seen that the amplitude of main-cursor is 370 mV, the pre-cursor is 25 mV, and response lasts for several UIs to be stable zero. The most significant post-cursors are about within 6 UI and the max one is about 100 mV. As it said before, 4-taps DFE at SDRAM is used to cancel out the post-cursor effect. Furthermore, FFE like de-emphasis with multiple taps could be applied by SoC to enhance the ISI

cancellation performance. And it can deal with the pre and post-cursors, whereas the DFE only can resolve the post cursor problem.

The ISI cancellation performance with different equalization schemes are shown in Figure 4. And the optimal FFE tap coefficients are selected adaptively basing on the channel characteristics, number of taps and data rate. It can be seen that both energy of pre and post-cursors are weaken after implementing FFE of 1 pre-tap and 1 post-tap in SoC. The amount of ISI reduced from 150 mV to 87 mV. And the DFE benefit is from 150 mV to 74 mV. However, equalizers combination could lead to over compensation. The benefit of FFE is not so obvious when the DFE is enable. In other words, the overall effect is not equal to '1+1' when we put FFE and 4-TAP DFE together.



Figure 4 ISI PDF of different equalization scheme in WROM

The benefits of different FFE designs are shown in table 1. It can be found that 1 pre-tap FFE is not much effective. And the optimal equalization combination is 4-tap DFE with 1 post-tap FFE. Furthermore, 1 pre-tap FFE will increase the write latency of the system by more than 1UI, therefore the 1 post-tap FFE is effective for the write operation design.

| . a | able 1 Margin of 1DI C with different equalization combination in with |                    |         |         |  |  |  |
|-----|------------------------------------------------------------------------|--------------------|---------|---------|--|--|--|
|     | DFE                                                                    | FFE                | EH (mV) | EW (ps) |  |  |  |
|     | ×                                                                      | × ×                |         | 40.5    |  |  |  |
|     | ×                                                                      |                    | 188     | 106.0   |  |  |  |
|     | ×                                                                      | 1pre-tap           | 58      | 45.4    |  |  |  |
|     | ×                                                                      | 1post-tap          | 166     | 107.1   |  |  |  |
|     | ×                                                                      | 2post-tap          | 172     | 102.0   |  |  |  |
|     | ×                                                                      | lpre-tap&lpost-tap | 179     | 102.0   |  |  |  |
|     |                                                                        | 1pre-tap           | 197     | 101.3   |  |  |  |

Table1 Margin of 1DPC with different equalization combination in WROM

| <br>1post-tap          | 216 | 111.9 |
|------------------------|-----|-------|
| <br>2post-tap          | 212 | 107.0 |
| <br>1pre-tap&1post-tap | 210 | 105.5 |

The margin results of BER@1E-16 for the different equalization schemes are shown in Figure 5. Without any equalization technique, the eye nearly closed. With help of DFE, the eye-height increases to 188mv, and the eye-width increases to 106ps. When there is 1 post-tap FFE only, the eye-height increases to 166mv, and the eye-width increases to 107ps. And with the optimal scheme, the eye-height increases to 212mv, and the eye-width increases to 107ps, which looks not bad without cross-talk.



Figure 5 BER Eye Diagram of 1DPC with different equalization combination in WROM

Actually, we found FFE could deteriorate the final margin for some DQs when there exists large cross-talk. That is because that FFE will amplify high-frequency component of cross-talk and then weaken the final eye, which will be analyzed in cross-talk management section.

Another thing we need to consider in FFE implementation is that quantization effect. Because FFE realization bases on controlling unit-impedance-blocks (UIB). Fulfilling a fine-step FFE which can adapt to every byte lane characteristics, it need to make the value of per UIB be larger, such as 480 ohm or 960ohm. And the finer step, the more units need to put in the IO. The coarse step design will lead to the performance of FFE decrease seriously. But the better performance will cost more area and complexity of IO design.

As discussed before, IL of RDOM is more severe. The black line in Figure 6 shows the channel pulse response of 1DPC configuration in RDOM. Similarly, DFE at SoC could be used to alleviate the ISI effects. As well as WROM, a 4 tap DFE could be applied to weaken the post cursor influence effectively. Besides, RX CTLE is a common equalization technology used in high-speed series system. The CTLE behavior modeling is realized by a two-poles and one-zero filter, which amplifies the components around 3.2 GHz and filters off the higher frequencies.



Figure 6 Pulse response of 1DPC in RDOM with and without CTLE

It should be noticed that the AC-DC gain of CTLE need carefully adjust to optimize the ratio of the high frequency amplification over low frequency attenuation. Because if the low frequency components attenuate too much, it will be hard to distinguish noise from the useful signal, which results in an error. Margin of CTLE with different AC-DC gain is shown in Table 2.When the AC-DC gain decreases to 3 dB, the eye is closure without cross-talk at 6.4 Gb/s. As the gain increases, the eye opens gradually. But when the gain reach up to 9 dB, the eye gets small again. This is because the amplitude of post-cursors will also be amplified by CTLE, as well as the main-cursor is effectively amplified. Basing on the PDF results shown in Figure7, when there is only CTLE enable, ISI increases from 200 mV to 226 mV. Therefore when the improvement of main-cursor is smaller than the distortion from post-cursors, the margin of CTLE should be obtained by training.

| Tuble 2 Margin with only CTEE of anterent Tie De gain |        |        |  |  |  |  |  |
|-------------------------------------------------------|--------|--------|--|--|--|--|--|
| AC-DC (dB)                                            | EH(mv) | EW(ps) |  |  |  |  |  |
| 3                                                     | 0      | 0      |  |  |  |  |  |
| 5                                                     | 37     | 51.68  |  |  |  |  |  |
| 7                                                     | 64     | 57.37  |  |  |  |  |  |
| 9                                                     | 38     | 38.04  |  |  |  |  |  |

Table 2 Margin with only CTLE of different AC-DC gain



Figure 7 ISI PDF of different equalization scheme in RDOM

Because CTLE will amplify ISI, it needs DFE for collaboration, which has good performance on ISI cancellation. How the SoC RX equalizers collaboratively impact are shown in Figure 8 and Table 3. It is not enough to the system eye with CTLE only. With a 4-tap DFE and CTLE together, the eye-height increases dramatically by amount to 237mv, which is very sufficient to the 6.4 Gb/s system.

| DFE   | CTLE | EH(mv) | EW(ps) |
|-------|------|--------|--------|
| ×     | ×    | 0      | 0      |
| 4-tap | ×    | 111    | 97.12  |
| 6-tap | ×    | 124    | 97.83  |
| ×     |      | 64     | 57.37  |
| 4-tap |      | 237    | 87.94  |
| 6-tap |      | 250    | 86.47  |

Table3 Margin of 1DPC with different equalization combination in RDOM

In addition, CTLE also amplifies cross-talk, which is a critical issue for targeted frequency, and it will be discussed in following section. However, considering on the eye-height requirement, CTLE is still necessary for 6.4 Gb/s system with 2SPC RDIMM application.



Figure 8 BER Eye Diagram of 1DPC with different equalization combination in RDOM

#### **2DPC Configuration**

Similar to the analysis on 1DPC, the performance of combined equalization scheme in both WROM and RDOM is analyzed.

For WROM, the margin of the different equalization schemes are shown in Table 4. DFE still does a good job in 2DPC application. Both DIMM0 and DIMM1 have an eye opening, eye-height of DIMM0 increases to 181 mV as well as the eye-width increases to 106ps, similar improvement in DIMM1's eye. Similarly, FFE, except 1 pre-tap & 1 post-tap, does provide a little benefit to the system based on existence of 4-taps DFE. With FFE of 1pre-tap& 1post-tap, the eye-height is improved from 181 mV to 198 mV for DIMM0, and from 164 mV to 178 mV for DIMM1. As discussed in 1DPC, the 10 mV to 20mV benefit of FFE will decrease due to the quantization effect. If we want to alleviate the quantization effect, the area and complexity of IO design will be increased.

|       | Table 4 Wargin of 2DFC with different equalization combination in witcow |        |        |        |        |  |  |  |
|-------|--------------------------------------------------------------------------|--------|--------|--------|--------|--|--|--|
| DFE   | FFE                                                                      | DIM    | DIMM0  |        | DIMM1  |  |  |  |
| DFE   | ГГЕ                                                                      | EH(mv) | EW(ps) | EH(mv) | EW(ps) |  |  |  |
| ×     | Х                                                                        | 76     | 67.37  | 68     | 70.56  |  |  |  |
| 4-tap | Х                                                                        | 181    | 106.37 | 164    | 105.02 |  |  |  |
| ×     | 1pre-tap                                                                 | 84     | 68.14  | 77     | 69.54  |  |  |  |
| ×     | 1post-tap                                                                | 133    | 94.64  | 132    | 103.85 |  |  |  |
| ×     | 2post-tap                                                                | 156    | 107.2  | 130    | 103.59 |  |  |  |
| ×     | 1pre-tap& 1post-tap                                                      | 143    | 96.29  | 136    | 102.73 |  |  |  |

Table 4 Margin of 2DPC with different equalization combination in WROM

| 4-tap | 1pre-tap           | 188 | 106.51 | 172 | 104.41 |
|-------|--------------------|-----|--------|-----|--------|
| 4-tap | 1post-tap          | 189 | 103.63 | 167 | 100.8  |
| 4-tap | 2post-tap          | 182 | 104.16 | 168 | 100.47 |
| 4-tap | lpre-tap&lpost-tap | 198 | 103.64 | 178 | 99.86  |

For RDOM, the margin results for the SoC RX with different equalization schemes are shown in Table 5. With a 4-taps DFE, both the eyes of DIMM0 and DIMM1 open as predicted. And 6-taps DFE bring a slight improvement to the final margin. Basing on a 4-taps DFE in SoC, CTLE greatly improve the final eye height, which is similar in 1DPC system.

Table 5 Margin of 2DPC with different equalization combination in RDOM

| DFE   | CTLE | DIMM0  |        | DIMM1  |        |
|-------|------|--------|--------|--------|--------|
| DIFE  |      | EH(mv) | EW(ps) | EH(mv) | EW(ps) |
| ×     | ×    | 0      | 0      | 0      | 0      |
| 4-tap | ×    | 103    | 97.96  | 100    | 103.67 |
| 6-tap | ×    | 108    | 97.99  | 105    | 107.12 |
| ×     |      | 130    | 80.87  | 106    | 86.3   |
| 4-tap |      | 213    | 81.70  | 208    | 102.57 |
| 6-tap |      | 219    | 80.16  | 207    | 102.06 |

#### **Summary of ISI Management**

According to the discussion above, 4-taps DFE in RX is both necessary for WROM and RDOM. And for WROM, FFE may not be efficient for 6.4 Gb/s application. For RDOM, CTLE is still important for improving the eye height margin, if the 2SPC RDIMM application is still in consideration.

## **Cross-talk Management**



Figure 9 The impact of FFE on cross-talk pulse response

As mentioned above, cross-talk has become the most critical issue as data rate increases to be higher than 4.8 Gb/s. Although FFE, CTLE and DFE can be used to deal with challenge from ISI, it is hard to alleviate the pressure from cross-talk. Even worse, it may amplify cross-talk. FFE is implemented as a finite impulse response (FIR) filter. And it will amplify high-frequency component which is shown in Figure 9. Similar principle for CTLE, which is shown in Figure 10. The cross-talk amplification of CTLE is even stronger than FFE.



Figure 10 The impact of CTLE on cross-talk pulse response

Jolts of cross-talk radiate from nearby channels during symbol transitions. The higher frequency energy contained, the greater cross-talk influence. As data rate increases to 6.4 Gb/s, channel cross-talk is becoming a major barrier. Figure 11 shows the eye of 1DPC with and without cross-talk respectively in RDOM basing on 4-taps DFE and CTLE enable. Thanks to the DFE for ISI reduction and CTLE for main-cursor amplification, the eye is opened significantly if there are no cross-talk. However, the eye almost become closure under cross-talk affect. It reveals cross-talk issue has been to the point which must be solved. As discussed before, increasing gap between signal lines in the physical design is very high-cost for high-density parallel link system, and the cross-talk equalizer (CTE) could be a better choice for 6.4Gb/s DDR5 system.



Figure 11 BER eye of 1DPC with and without cross-talk in RDOM

CTE can be implemented at TX or RX, and the design concept is shown in Figure 12. For TX CTE, it is generally performed at the output driver with several taps to deal with the pre & post cursor of cross-talk. Optimal tap coefficient controlled by bit stream of adjacent DQs are fed into the main driver. Correspondingly, the main driver change output impedance to cancel FEXT. For RX CTE, there are generally two design types [2]. One is continuous time crosstalk canceller (CTXC), it uses analog filters to differential the received signals from the aggressors and then compensate for the victim with the appropriate gain to remove FEXT. The basic theory of CTXC is that FEXT is proportional to the derivative of the crosstalk source signal. Another one is decision feedback based crosstalk canceller (DFXC), it cancels the aggressor effect basing on the decision results from aggressors, then fed the decision results to the next coming victim bit. But this type of DFXC could only cancel the post cursors of FEXT.

In implementation, the analog filter setting of CTXC need to perform adaptive calibration according to the process, voltage and temperature (PVT) variation. And the calibration algorithm is highly complex, which will occupy the bus resources. Furthermore, the largest crosstalk may not come from adjacent ones, thus a full byte-lane crosstalk cancellation strategy would be more suitable for server scenario. And the power consumption and area request of CTXC is larger than that of DFXC, which is not feasible for a full byte-lane cancellation. Therefore, only DFXC will be analyzed here for RX CTE. In this section, how the performance of TX and RX CTE on canceling cross-talk of 1DPC and 2DPC in WROM and RDOM are analyzed respectively.



Figure 12 TX CTE and RX CTE design concept

#### **1DPC** Configuration

Figure 13 shows the FEXT PDF with TX and RX CTE. As the tap number of TX CTE increases to 2, its better performance for crosstalk cancellation starts to be more obvious in WROM. But for RX CTE, it seems a benefit-cap effect when the tap number increases to

2 or more. Since RX CTE only can deal with the post cursor influence, the performance of RX CTE is weaker than TX CTE.



Figure 13 Cross-talk PDF with TX and RX CTE in WROM

And the improvement on margin comparisons results are illustrated in the Table 5. In WROM analysis, 4-taps DFE is enabled at SDRAM side. As tap number increases, CTE@SoC improves eye-height and eye-width significantly. Although CTE@SDRAM did less on boosting eye-width than CTE@SoC, it also improves the eye-height margin. In conclusion, TX CTE is more valuable for timing-constrained system. And the tap number of CTE should be at least 2.

| ruole 5 e 12 margin improvement of 121 e m wreeth |                            |      |                 |                 |  |  |  |
|---------------------------------------------------|----------------------------|------|-----------------|-----------------|--|--|--|
|                                                   | CTE@SoC<br>ΔEH(mv) ΔEW(ps) |      | CTE@SDRAM       |                 |  |  |  |
|                                                   |                            |      | $\Delta EH(mv)$ | $\Delta EW(ps)$ |  |  |  |
| 1Tap                                              | -0.2                       | 0.4  | 6.5             | 2               |  |  |  |
| 2Taps                                             | 25.4                       | 17.8 | 14.8            | 3.3             |  |  |  |
| 3Taps                                             | 39.3                       | 25.1 | 17.8            | 3.4             |  |  |  |

Table 5 CTE margin improvement of 1DPC in WROM



Figure 14 Cross-talk PDF with TX and RX CTE in RDOM

For RDOM with 4-taps DFE at SoC, the FEXT PDF and margin improvement results are shown in Figure 14 and Table 6 respectively. Similarly, as tap number increases to 2, the performance of CTE@SDRAM is better than that of CTE@SoC. And more than 2 taps of CTE design is recommended.

|       | CTE@SDRAM<br>ΔEH(mv) ΔEW(ps) |      | CTE@SOC         |                 |
|-------|------------------------------|------|-----------------|-----------------|
|       |                              |      | $\Delta EH(mv)$ | $\Delta EW(ps)$ |
| 1Tap  | -1                           | -1.5 | 21              | 14.7            |
| 2Taps | 27                           | 21   | 24.4            | 15.3            |
| 3Taps | 41.9                         | 29.7 | 28.4            | 16.5            |

Table 6 CTE margin improvement of 1DPC in RDOM

### **2DPC Configuration**

3Taps

Similar with 1DPC, the performance of CTE in both WROM and RDOM are analyzed. Same with 1DPC, more than 2 taps of TX CTE can be implemented to cancel cross-talk with better performance, no whether in WROM or RDOM. Meanwhile, the benefit obtained from TX CTE in 1DPC configuration is much better than it on 2DPC, and DIMM1 of 2DPC can get more improvement than that on DIMM0.

CTE@SOC CTE@SDRAM  $\Delta EH(mv)$  $\Delta EW(ps)$  $\Delta EH(mv)$  $\Delta EW(ps)$ -0.2 0.3 8.5 -0.45 DIMM0 1Tap DIMM1 -0.1 0.1 10.3 1.64 DIMM0 33.1 16.3 14.3 0.5 2Taps DIMM1 49.9 26.7 16 3.3 DIMM0 44.5 22.3 20.3 0.9

Table 7 CTE margin improvement of 2DPC in WROM

Table 8 CTE margin improvement of 2DPC in RDOM

32.6

27.1

5.6

60.9

|        |       | CTE@SI          | DRAM            | CTE@SOC         |                 |
|--------|-------|-----------------|-----------------|-----------------|-----------------|
|        |       | $\Delta EH(mv)$ | $\Delta EW(ps)$ | $\Delta EH(mv)$ | $\Delta EW(ps)$ |
| 1      | DIMM0 | -1.6            | 0.01            | 7               | 3.6             |
| 1Tap   | DIMM1 | -1              | 0.07            | 17.3            | 11.4            |
| 27.000 | DIMM0 | 21.9            | 19.1            | 11.6            | 4.23            |
| 2Taps  | DIMM1 | 32.4            | 26.1            | 23.4            | 13.5            |
| 2Tong  | DIMM0 | 33.3            | 24.6            | 15.8            | 5.5             |
| 3Taps  | DIMM1 | 46.9            | 35              | 29.5            | 15.6            |

#### **Summary of Cross-talk Management**

DIMM1

According to the discussion above, more than 2-taps of TX CTE is recommended in WROM and RDOM. And it is a powerful technique to make 6.4 Gb/s system happen.

## **PSIJ Management**

DDR systems employ forward clocking structure. It has one advantage that the correlated jitter between data and strobe can be tracked. That makes the system margin is insensitive to periodic jitter, for example PSIJ. Different form DDR4 matched DQS-DQ path design in SDRAM, DDR5 uses an un-matched DQS-DQ architecture, which likes LPDDR4, to remove the timing design constraints and high speed requirement for SDRAM, as illustrated in the Figure 14. This architecture leads to a more than 90 degree offset between DQ and DQS in the SoC for sampling. Therefore the strobe must arrive at the SDRAM ball prior to the DQ signal by the amount of tDQS2DQ, which will bring two issues for DDR5 system.



Figure 15 Diagram of DDR4 matched structure and DDR5 unmatched structure

1. Skew is generated during passing through the sub-system, such as clock tree (CKT). Thus the PSIJ in skew CKT path will increase. As shown in Figure 15, because of the tDQS2DQ skew tree exists in SDRAM, SoC needs to add corresponding offset in DQ path to keep the sampling relationship between DQ and DQS. Even though the noise coupled in DQ and DQS path is almost same, the skew in CKT will make the net timing margin lost, due to the jitter between each other cannot be completely tracked out. Moreover, although the same amount and opposite phase of skew exist in SoC and SDRAM side, it will also cause untracking jitter for the different amplitude and frequency of noise suffered at SoC and SDRAM respectively. In conclusion, the SSUS will deteriorate the final timing margin and be more sensitive to power supply noise definitely.

2. Skew is generated before entering sub-system, such as IO buffer. Thus PSIJ generates in IO will increase as the input phase skew increases. Different from the mechanism of

skew generating in CKT, skew between DQ and DQS has been fixed before entering IO buffer. And this skew will also affect the PSIJ sensitivity profile of IO part.



Figure 16 PSIJ sensitivity of CKT part with different tdqs2dq parameter



Figure 17 PSIJ sensitivity of IO part with different tdqs2dq parameter

The PSIJ sensitivity profile of those two types mechanism is shown in Figure 16 and 17. Generally, larger skew will make the system more sensitive to power noise at low or middle frequency range. Besides, there are two power domains for IO power supply: VDD and VDDIO, and both effects needed to be considered in system design. As the skew increase, both setup and hold jitter sensitivity curves will shift left with the amplitude increase under 500MHz, which is the main frequency of noise energy. Compared with DDR4, both setup

and hold jitter sensitivity to VDD and VDDIO power noise of DDR5 will increase by an approximate amount of 0.4ps/mv and 0.1ps/mv in the worst case. So this timing margin loss cannot neglected in the 6.4 Gb/s system design. And comparing the results from the Figure 16 and Figure 17, CKT part is more sensitive to the skew than the IO part, which means we should pay more attention to the PI performance on CKT part.



Figure 18 PDN of different optimized scheme

Generally, the timing loss of PSIJ is determined by three factors: characteristics of PDN, switching current in ICs, and jitter sensitivity of circuits. To attain a better jitter sensitivity, transistors with large size or low threshold could be used to reduce the circuit delay. However, the over-using of transistors with higher driving capability would lead to more power consumption. So there needs some trade-off in system view. On the other hand, optimal PDN design through increasing on-die-decap and on-package-decap to control the noise amplitude and frequency range is effective. Figure 18 shows the PDN with on-package-decap, the peak amplitude of PDN decreases from 1.6 ohm to 0.25 ohm with extra 20 pkg-decap. Moreover, on-die voltage regulation can be adopted to shorten the path of power supply, which will also bring brilliant improvement in PI performance.

## **Open Discussion**

Accurately, even we arm the TX and RX with the equalizers strategies recommended above for mitigating challenges from ISI, cross-talk and PSIJ effect, it is still not easy to achieve 6.4Gb/s target in the current server application configuration. Because the physical design situation constrained, such as compact placement, high-density routing and multi-drop application, limits the bandwidth of server memory channel and makes the data rate hard to upgrade further. Therefore, RDIMM multi-drop application will encounter bigger challenges. It seems that the era of LR-DIMM, D-DIMM is coming. But if reducing the slot number of per-channel to only one, is it possible for RDIMM still have chance to keep its top position longer. Because the capacity increase of per-DIMM will get the benefit of multiple DIEs or 3D-Stackup package process development. Furthermore, also thanks to super large package technology for SoC, it can be integrated more DDR channel in it. Therefore the capacity of memory system may be not the problem when applying 1SPC configuration. Performance of 1SPC-1DPC configuration with equalizers combination of 4-taps DFE and 2-taps TX CTE and CTLE is analyzed. From the IL results firstly, the 1SPC channel relaxes the bandwidth limit of system. Accordingly, the comparing results are shown in figure19. The eye-height in both write and read operation increase by approximately 50mv, the eye-width also increases in a certain, which much alleviate the difficulty on system design to achieve the 6.4 Gb/s or higher data-rate.



Figure 19 BER diagram and IL comparison of 1SPC-1DPC and 2SPC-1DPC

## SUMMARY

In this work, the challenges on 6.4 Gb/s DDR5 system design such as ISI, cross-talk and PSIJ issues based on the realistic server scenarios have been analyzed. Meanwhile the corresponding recommendations have been proposed to cleanup those road blocks for the future 6.4Gb/s application, including a 4-taps RX DFE to eliminate ISI, more than 2-taps TX CTE to cancel cross-talk, and PDN design optimization method to minimize the PSIJ effect. Finally, an open discussion for 1SPC-1DPC configuration on future server application is presented.

## References

 S. Lehmann, F. Gerfers, "Channel analysis for a 6.4Gb/s DDR5 data buffer receiver front-end," in *New Circuits and Systems Conference (NEWCAS)*, Jun. 2017, pp. 109-112.
C. Aprile, et al., "An Eight-Lane 7-Gb/s/pin Source Synchronous Single-Ended RX With Equalization and Far-End Crosstalk Cancellation for Backplane Channels," *IEEE J. Solid-State Circuits*, vol. 53, no. 3, pp. 861-872, Mar. 2018.

[3] S.J. Bae, et al., "A 40nm 2Gb 7Gb/s/pin GDDR5 SDRAM with a Programmable DQ Ordering Crosstalk Equalizer and Adjustable Clock-Tracking BW," in *Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, Feb. 2011, pp.498-500

[4]J. Zerbe, et al., "System-Level Clock Jitter Modeling for DDR Systems," in *Electronic Components and Technology Conference (ECTC)*, May 2013, pp. 1350-1355.