# A Low-Power Reduced Swing Global Clocking Methodology

#### Farhad Haj Ali Asgari and Manoj Sachdev

Abstract—In this brief, we investigate the potential of reduced swing clock networks for low-power applications. We designed and laid out a full swing conventional and a reduced swing H-tree clock distribution network in 0.13- $\mu$ m CMOS technology operating at 500 MHz. In the reduced swing clock network, the swing was reduced in the global clock distribution network and was restored to the full swing in the local clock distribution domains. The post-layout simulation results of this research shows that a power saving of 22% under nominal operating condition is feasible.

*Index Terms*—Clock skew, digital CMOS, high-performance, low-power design, reduced-swing.

#### I. INTRODUCTION

Once considered to be negligible, interconnections are becoming dominant part of the propagation delay in submicron technologies. Global wires in chips are getting longer as designers want to integrate more blocks in a single chip. This gives rise to larger chip sizes and as a consequence longer global wirelengths. Therefore, in contemporary technologies, wire parasitics contribute to significant amount of delay and power which can no longer be ignored [1].

The above mentioned scenario is true in VLSI clock distribution. In a complex VLSI, the synchronous clock must be distributed all over the chip with minimum possible skew. The clocking network consumes significant amount of power in complex VLSIs [2]. The global nature of the clock distribution interconnects and their increased parasitics with scaling further result in the increased power consumption. Particularly, in high performance applications, clocking power could be 20%–50% of the total consumed power [3].

High-performance applications demand construction of zero or near-zero skew clock distribution network, such as H-tree and clock grid [4], [5]. Building such a clock distribution tree is imperative to ensure zero skew and a sharp slew rate for the clock edge. Typically, buffers are inserted within the clock network to isolate the downstream capacitance, thus reducing the transition times. On the other hand, clock distribution network and buffer insertion increase amount of power consumption substantially. In the context of the growing importance of low power designs for portable electronics, it is necessary to develop strategies to reduce the power dissipation of the clock network while maintaining the performance objectives.

The total power dissipation in a clock network, like any other CMOS digital circuit, consists of three components: (i) leakage, (ii) short-circuit, and (iii) dynamic. The leakage current is dependent on the technology and is relatively small component in a clock network. Similarly, keeping proper rise and fall times throughout the clock tree may also minimize the short-circuit power component. The clock network has high switching activity, therefore the dynamic power consumption is the dominant factor. Ignoring leakage and short-circuit power contributions, the clock network power dissipation is given as

$$P = f C_L V_{\rm DD} V_{\rm sw} \tag{1}$$

Manuscript received May 2, 2003; revised August 6, 2003. This work was supported by the Natural Scientific and Engineering Research Council of Canada's Low Power Strategic Grant.

The authors are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: ffhajali@ece.uwaterloo.ca; msachdev@ece.uwaterloo.ca).

Digital Object Identifier 10.1109/TVLSI.2004.826204

where f,  $C_L$ ,  $V_{DD}$ , and  $V_{sw}$  are, respectively, the clock frequency, the total capacitance of a clock tree, the supply voltage, and the output swing of the buffer. If the output of the buffer swings from *GND* to  $V_{DD}$ , then  $V_{sw} = V_{DD}$ , and the formula reduces to

$$P = f C_L V_{\rm DD}^2. \tag{2}$$

As it is apparent from (1), that the power can be reduced by reducing the clock frequency. However, the frequency f cannot be changed without significant architectural changes. Alternatively, power can be reduced by (i) reducing the total load capacitance,  $C_L$ , on all nodes; (ii) reducing  $V_{\rm DD}$ , which creates a quadratic reduction, if  $V_{\rm sw}$  is also simultaneously reduced by the same factor; (iii) reducing  $V_{\rm SW}$ , without reducing  $V_{\rm DD}$ , which corresponds to a linear reduction in the power dissipation.

In this brief, we investigate the potential of reduced swing,  $V_{\rm sw}$ , clocking technique in lowering the total clock network power. This brief is organized as follows. In the subsequent section, we address previous approaches to the reduced swing clock networks. In Section III, we introduce full swing to reduced swing (FS-RS) and reduced swing to reduced swing (RS-RS) buffers and their application in an H-tree clock network. We constructed two H-tree based clock networks (reduced swing and full swing) to compare the reduction in power consumption. This comparison of power, PDP and design robustness under varying process and environmental conditions is described in Section IV. Finally, in Section V conclusions are drawn.

# II. PREVIOUS REDUCED SWING CLOCKING SCHEMES

There are several types of clock network topologies that have been investigated [6]–[9]. Each of them is suitable for certain applications. Restle and Deutsch gave an overview of clock networks [6]. The gridbased clock distribution used in the DEC Alpha series of processors enabled low-clock skew [7]. The H-tree structure [8] is optimum for achieving low clock skew at an acceptable power consumption due to relatively smaller parasitic capacitance. The length-matched serpentine structure can match the H-tree structure in providing low clock skew, but the number of interconnects and total wire capacitance is significantly larger compared to the H-tree structure [6]. In our research, we selected the H-tree structure due to its ability to provide low clock skew. When compared to grid-based and serpentine structures, the total wiring capacitance of the H-tree is smaller [6]. Owing to these desirable characteristics, the H-tree is one of the most popular techniques for clock network routing. However, the usage of proposed reduced swing clocking technique is not limited to the H-tree and it can be used in any clock distribution structure.

In general, reduced swing clocking schemes can be categorized into two groups: (i) dual power supply voltage, and (ii) regular power supply voltage. The former achieves the reduced swing by using a separate reduced supply voltage usually generated on chip. This adds circuit and extra area complexity to the overall chip design and layout, which is not desirable [10]. However, the advantage of having a separate power supply for generating the reduced voltage swing is in the reduced number of clock network transistors, which leads to improved power saving [11]. Using only one power supply, the second method achieves the reduced swing voltage with circuit methods. However, the design of reduced swing buffers becomes challenging in the absence of the second power supply. Several papers have proposed reduced swing buffers. Most of them utilize a pMOS for passing low logic level and an nMOS for passing high logic level [3], [12]. Such a technique results in poor rise and fall times making it impractical for high performance applications [13]. Zhang et al. [3] described



Fig. 1. A typical H-tree clock distribution network.

various circuit level techniques to design full swing to reduced swing and reduced swing to full swing buffers. They applied these techniques to reduce the swing on general interconnect lines. Some other researchers investigated a half-swing clocking scheme to save clock power [14]. However, circuits require two different clocking networks. An additional clocking network may cost substantial area on a VLSI. Moreover, the method also requires a complex receiver design to restore the clock signal. In a recent paper, Pangjun and Sapatnekar proposed utilization of reduced swing, source follower buffer configuration [15] in a clocking network. A source follower buffer suffers from slow transition time and it may cause significant increase in the short-circuit current in subsequent buffers. In addition, the design of receiver buffer-amplifier becomes a nontrivial task.

### III. PROPOSED REDUCED SWING SCHEME

In a typical clock distribution network, the buffer at the source of the clock tree drives a number of first stage buffers. These first stage buffers in turn drive few second stage buffers. This process is repeated depending on the complexity and performance requirements. However, the number of buffers in a given stage increases as we move away from the center. Finally, the final stage (leaf) buffers drive a number of flipflops. Since there are a large number of leaf buffers, they should be made as power efficient as possible.

Fig. 1 illustrates the designed H-tree clock distribution network. The buffer at the source of the tree [full swing to reduced swing (FS-RS)] reduces the clock swing to a predetermined value. The subsequent stage buffers [reduced swing to reduced swing (RS-RS)] provide the buffering of the clock signal until it reaches the final stage buffers. The final stage [reduced swing to full swing (RS-FS)] buffers restore the clock signal to its full swing value. Since the clock swing is restored to

its nominal value before it reaches flip-flops, conventional flip-flops can be used.

In order to compare the benefits of the proposed technique, we designed two similar clock networks. The first network was constructed with conventional full swing buffers while the second network was constructed with FS-RS, RS-RS and RS-FS buffers. In order to have a fair comparison, in both networks, our buffers insertion was driven by following issues: (i) as few buffers should be used while maintaining the required rise and fall times under given load conditions, and (ii) the number of buffers should be the same in both networks.

### A. Reduced Swing Buffers

As mentioned in Section II, most of the circuit techniques to achieve reduced swing signals resulted in poor rise and fall times. For example, in source-follower-based FS-RS buffer [3], the output voltage is restricted between  $(V_{\rm DD} - V_{t_n})$  and  $|V_{t_p}|$ . As the output voltage approaches to these limits, the transistor drive current is decreased exponentially, resulting in poor rise and fall times.

We use the circuit shown in Fig. 2 as a FS-RS buffer. This circuit is also used as RS-RS driver. The first stage shown in the figure is a simple inverter (regenerator). It acts as the input buffer in FS-RS buffer while in RS-RS buffer it also acts as the level restorer. The output of the regenerator is sent to the delay chain. This delay chain creates a precise time window between signals A and B ( $t_d$ ). In our case, this delay chain provides a delay of approximately 125 ps. Signals A and B are applied to transistor pairs  $P_2/P_3$  and  $N_2/N_3$ . Therefore, the output capacitance,  $C_L$ , is charged or discharged for a predetermined time. Therefore, for a given  $C_L$ , by controlling the transistor sizes and the delay time, a desired voltage swing can be achieved.

Fig. 3 illustrates the simulation results for this FS-RS buffer working at 500 MHz in 0.13- $\mu$ m CMOS technology. As is apparent from this



Fig. 2. FS-RS/RS-RS cell schematic diagram.



Fig. 3. Full swing input, reduced swing output, and voltage waveforms at nodes A and B.

figure, the output swing is from 0.35 to 0.85 V, instead of from 0 to 1.2 V. The output swing is controlled by several variables such as  $t_d$ ,  $V_t$ , width of the output stage transistors, and the load capacitance  $(C_L)$ . For a given W,  $C_L$ , and  $t_d$ , usage of low  $V_t$  transistors allows us to

maintain sharp rise and fall times. Similarly, a set of FS-RS drivers can be designed for varying  $C_L$  conditions by correspondingly changing the width of the driver transistors. It is desirable to reduce the swing as much as possible in order to reduce the power consumption. However, the recovery of the signal becomes increasingly difficult as the swing is reduced. Furthermore, to keep the design of repeaters and RS-FS buffers simple, the following relationship is maintained:

$$V_{H_R} > V_{DD} - \left| V_{t_{p-h}} \right| \tag{3}$$

$$V_{L-R} < V_{t_{n-h}} \tag{4}$$

where  $V_{H\_R}$  and  $V_{L\_R}$  represent the high and low voltage levels of the reduced swing. The  $V_{t_{p\_h}}$  and  $V_{t_{n\_h}}$  represent the higher threshold voltages of pMOS and nMOS transistors, respectively.

We utilized the FS-RS buffer also as the RS-RS buffer with some sizing modifications to minimize the short-circuit current. In the design of RS-RS buffer, the input stage is implemented with high  $V_t$ transistors. Under these circumstances, the reduced levels are able to switch off the high  $V_t$  transistors, therefore, the short-circuit current is reduced. Furthermore, the signal is recovered easily. From our investigation, we concluded that the input reduced swing first must be amplified to the full level and subsequently reduced to the desired reduced swing. Such a strategy results in sharp rise and fall times. Fig. 4 depicts the simulation results of the RS-RS repeater; as can be seen, waveforms have fast rise and fall times.



Fig. 4. Simulation results of RS-RS buffer (reduced swing input and output).



Fig. 5. Reduced swing to full swing (RS-FS) buffer schematic.

In order to conserve the clock power, a simple design of reduced swing to full swing restorer is extremely important. We used a simple, noninverting buffer for this purpose. This circuit is illustrated in Fig. 5. The first inverter of the buffer is designed with high  $V_t$  transistors resulting in relatively smaller short-circuit current. The second inverter is designed to drive a load of 60 fF with 100 ps rise and fall times. Fig. 6 illustrates the simulation results of the RS-FS buffer with 60 fF load. The circuit has rise and fall times of 100 ps.

The FS-RS and RS-RS buffers were compared with a conventional buffer for propagation delay, power, and power delay product (PDP) for a given load ( $C_L = 2500$  fF). The conventional buffer was constructed using two cascaded inverters capable of driving the given load. Table I illustrates these results. As is apparent from this table, the FS-RS buffer consumes approximately 45% less power compared to the conventional buffer. Furthermore, we simulated power and PDP savings of FS-RS buffer with respect to the conventional buffer as a function of the load capacitance. For this purpose the savings are defined as follows:

$$\operatorname{Saving}_{\operatorname{power}}\% = \frac{P_{\operatorname{conv.}} - P_{\operatorname{prop.}}}{P_{\operatorname{conv.}}} \times 100\%$$
(5)

$$\operatorname{Saving}_{\operatorname{PDP}} \% = \frac{\operatorname{PDP}_{\operatorname{conv.}} - \operatorname{PDP}_{\operatorname{prop.}}}{\operatorname{PDP}_{\operatorname{conv.}}} \times 100\%.$$
(6)

Fig. 7 depicts the power and PDP savings as a function of the load capacitance  $(C_L)$ . At low values of  $C_L$  the reduced swing buffer is



Fig. 6. Simulation results of RS-FS buffer.

TABLE I FS-RS, RS-RS, AND CONVENTIONAL BUFFER SIMULATION RESULTS FOR CL = 2500 fF

|             | FS-RS  | RS-RS  | Conventional |
|-------------|--------|--------|--------------|
|             | Buffer | Buffer | Buffer       |
| $t_d, ps$   | 104.2  | 136.8  | 102.9        |
| Power, $mW$ | 2.21   | 2.4    | 3.75         |
| PDP, $fJ$   | 230.3  | 328.5  | 386.5        |



Fig. 7. Power and PDP efficiency of the FS-RS over the conventional buffer.

less efficient in terms of power and PDP. However, power and PDP efficiency are improved as the load is increased. At low loads, reduced swing buffer is inefficient due to power consumption in the delay chain. As load increases, the power consumed in the delay chain becomes smaller fraction of the total power resulting in improved power and PDP efficiencies. In general, the reduced swing buffer also exhibits larger propagation delay compared to the conventional buffer; therefore, PDP efficiency is relatively poor compared to the power efficiency. However, for clock networks this increased delay is of minor consequence.

## B. Reduced Swing Clock Distribution Network

The on-chip clock distribution network can be divided into: (i) a global clock distribution network, and (ii) a local clock distribution network. The clock network that distributes the clock signal to the sub-blocks is called the global clock network. The wires in the global clock network are longer and we need a clock distribution topology to

minimize the total capacitance and the clock skew. The clock network in sub-blocks is generally described as the local clock network. The local clock network is more susceptible to noise and has larger fan-out. Since reconstruction of full swing clock signal is power inefficient, especially for large fan-out applications, we amplified the reduced global clock swing before the local clock network. Furthermore, in our research, we did not want to put any restriction on flip-flop usage. In other words, flip-flops are driven by full swing clocks. Therefore, the swing was only reduced in the global clock distribution network.

Fig. 1 illustrates a complete clock distribution network. In the H-tree clock network, the primary clock driver is connected to the main "H." In a regular H-tree clock architecture the clock signal is then buffered at the four corners of the main "H." This process is repeated as shown in Fig. 1. The conductor widths are progressively tapered in an H-tree as we move away from the clock source. The output resistance of a buffer is usually much larger than the driven interconnect impedance. In deep-submicron technologies where interconnect impedance is not negligible, extra buffers (repeaters) must be introduced to improve the clock skew, jitter and dynamic parameters [5]. The number of the buffer stages between the clock source and the final clocked registers depends on several factors such as: (i) the total load capacitance comprising the combined interconnect and buffer input capacitance, (ii) the clock skew tolerance of the system, (iii) the latency requirements of the clock network, and (iv) the nature (e.g., asymmetrical/symmetrical) of clock distribution network.

The interconnect resistance can be reduced using wider wires. But increasing wire width does not necessarily decrease the interconnect delay. The increased width results in higher interconnect capacitance. As a consequence, *RC* time constant is not changed significantly [16]. However, wire widths are often constrained by acceptable current densities to avoid electro-migration. On the other hand tapered wires help to reduce the overall Elmore delay [17]. The size of the wires close to end nodes are usually designed to be close to minimum feature size to satisfy dense integration demands. Considering the structure of the H-tree that satisfies the tapering strategy and based on layout design rules  $16-\mu$ m-wide wires were used for the first stage. In each subsequent stage the wire width is reduced. The last stage wires were approximately 0.5- $\mu$ m wide.

The local clock network is often implemented using a grid as shown in Fig. 8. Such a topology allows to reduce the local clock skew in spite of nonuniform local load distribution [18]. However, the capacitance of such a network is higher than that of a tree. Furthermore, such a grid can be routed early in the physical design. For example, Tam *et al.* [19] divided the clock distribution network into (i) a balanced global H-tree network, (ii) a regional clock grid, and (iii) a local clock distribution network.

# IV. RESULTS AND COMPARISON

We designed two identical global H-tree clock distribution networks in 0.13- $\mu$ m-CMOS technology. The first network was constructed using the conventional, noninverting, buffers while the second network was constructed using FS-RS, RS-RS, and RS-FS buffers. Design of two networks allowed us to compare them in terms of latency, power consumption and PDP. In both cases, clock networks were designed for a hypothetical chip with an area of 1.5 cm  $\times$  1.5 cm. In the latter network, the full swing clock signal was reduced to a swing of approximately 500 mV by FS-RS buffer. Subsequently RS-RS buffers were inserted to maintain desired rise and fall times. Each of these buffers was designed to drive a load of approximately 2400 fF. Finally, the reduced clock signal is restored to its full value by RS-FS buffer before reaching the local clock grid. Therefore, the local clock grid remains unaffected.



Fig. 8. Global and local clock network interface.



Fig. 9. Efficiency of reduced swing clock network with respect to the conventional as a function of load capacitance.

The frequency of simulation was set at 500 MHz with 50% duty cycle and rise/fall times of 100 ps. The value of interconnect distributed impedance was calculated from the technology data sheet. In this exercise, we considered area and fringe capacitances for the clock network. The interwire coupling capacitance was ignored. Although this component can be significantly large, it requires capacitance extraction from the layout. In this study, we laid out only clock distribution networks, hence it is not possible to compute or estimate the interwire coupling capacitance. Furthermore, clock can be routed so as to reduce/minimize this component. Finally, for this comparative study, our conclusions are not influenced if we ignore the mutual coupling capacitance. The distributed impedance of the interconnecting wires was modeled by the  $\pi$  model [20].

## A. Power and PDP Saving

Fig. 9 illustrates the power, propagation delay, and PDP saving of the proposed clock network with respect to the conventional network as a function of the load capacitance of the final stage buffers. As it is evident from this figure, simulation results show considerable power and PDP saving of the proposed circuit compared to the



Fig. 10. Output pulse width of the conventional H-tree as a function of process and temperature.

conventional buffers. The power (PDP) efficiency varies from 32% (22%) to 19% (10%) for these load conditions. For larger loads the efficiency of the proposed buffers declines. It is attributed to the fact that the short-circuit current of the reduced swing buffer is larger than that of the conventional buffer. Larger load capacitance requires larger RS-FS buffers and therefore, the short-circuit current increases proportionately. As a consequence, the power and PDP efficiency goes down. The short-circuit current was minimized using dual  $V_t$  transistors in RS-FS buffers. Using low  $V_t$  transistors for reducing the clock swing and reconstructing to full swing using high  $V_t$  transistors in first inverter reduces the short-circuit current substantially. The short-circuit current can also be reduced by increasing the buffering levels. The proposed clock network has approximately 12%–13% larger delay compared to the conventional network.

#### B. Output Sensitivity to Process Corners and Temperature Variations

Two H-tree testbenches were also simulated for different process corners to estimate the influence of process variations on the robustness of the circuit. The rise and fall times of the proposed and conventional networks under nominal conditions were approximately 100 ps. The worst and best case rise and fall times of the conventional network were found to be 120 ps and 96 ps, respectively. Similarly, for the proposed network, the worst and best case rise and fall times were found to be 112 ps and 86 ps, respectively.

In addition, we also simulated the variation in pulse width with process and temperature. Figs. 10 and 11 show the output signal pulse widths of conventional and proposed H-tree for different process corners and temperatures, respectively.

In general, the proposed H-tree exhibits increased sensitivity with process and temperature variations compared to the conventional H-tree. For process corner FS (fast nMOS, slow pMOS) and SF (slow nMOS, fast pMOS), there is about 70 ps decrease, and 89 ps increase in pulse width at room temperature, respectively when compared with the pulse width at TT (typical nMOS, typical pMOS) process corner at the room temperature. On the other hand, the conventional H-tree network has a pulse width variation of 9.5 ps and 10.5 ps over the same conditions. Similarly, power supply noise sensitivity of both the networks must also be examined since power supply noise contributes significantly to the clock skew [21]. Simulations were carried out to compare the clock skew in both networks due to power supply variations. The power supply voltage of one RS-RS buffer and its corresponding RS-FS buffer is varied up to  $\pm 10\%$  in steps



Fig. 11. The output pulse width of the proposed H-tree as a function of process and temperature.



Fig. 12. Clock skew versus  $V_{DD}$  variation.

while keeping the rest of the circuit at the nominal supply voltage. Simulation results are shown in Fig. 12. As it is apparent from the figure, the proposed network exhibits larger skew compared to the conventional network.

The increased sensitivity of the proposed H-tree to process, temperature and power supply variations could be an issue for high performance applications. However, for low power applications where timing is not critical, the reduced swing methodology could result in reduced power consumption while not compromising timing robustness.

# C. Post-Layout Simulation Results

Both global clock distribution networks were laid out in 0.13- $\mu$ m CMOS technology in order to get realistic comparison. The proposed reduced swing (FS-RS and RS-RS) buffers occupy approximately 65% larger area compared to the conventional buffer. However, increased area for proposed buffers should not result in any significant increase in the overall chip area. A 60 fF capacitor was laid out as an equivalent load of the local clock distribution network. This capacitance was



Fig. 13. Normalized, post-layout power and propagation delay versus  $V_{DD}$ .



Fig. 14. Normalized, post-layout power and propagation delay versus temperature.

estimated based on various design and technology parameters. From the layout, both distributed resistive and capacitive parasitics were extracted which were used in the post-layout simulations. Fig. 13 shows normalized, post-layout power and propagation delay as functions of power supply variations for conventional and proposed clock distribution networks. Each parameter was normalized with respect to its value at nominal voltage ( $V_{\rm DD} = 1.2$  V) and temperature ( $T = 25^{\circ}$ C) conditions. The power supply voltage is varied by  $\pm 10\%$  of the typical  $V_{\rm DD}$ . As it is apparent from the figure, the proposed clock network exhibits slightly larger power and propagation delay variations with respect to the power supply voltage compared to the conventional clock network. Similarly, Fig. 14 depicts the normalized power and propagation delay variations as a function of the temperature. The temperature was varied from -25 °C to 125 °C. All parameters degrade with increasing temperature, however, the normalized power of the proposed clock network increased at a higher rate with temperature.



Fig. 15. Post-layout power efficiency versus  $V_{\rm DD}$ .



Fig. 16. Post-layout power efficiency versus temperature.

Figs. 15 and 16 illustrate the power efficiency as functions of  $V_{\rm DD}$  and temperature. The power efficiency is decreased as  $V_{\rm DD}$  is increased. As Fig. 13 shows, since the power of the proposed reduced swing clock network is increased at a higher rate when  $V_{\rm DD}$  is increased compared to the conventional clock network; therefore, the power efficiency of the proposed circuit is reduced as  $V_{\rm DD}$  is increased.

Similarly, the increasing temperature reduces the power efficiency. Reduced swing signals in the proposed clock network are not able to fully switch off transistors. Therefore, the leakage in such circuits is relatively higher. Moreover, this leakage exhibits stronger temperature dependence. As a consequence, increase in temperature results in relatively higher leakage in the proposed clock network compared to the conventional clock network resulting in lower power efficiency with respect to the temperature.

From Figs. 15 and 16 we can conclude that the proposed reduced swing clock network is 22% power efficient compared to the conventional clock network. We also observed 7% reduction in the post-layout power efficiency compared to pre-layout power efficiency.

## V. CONCLUSION

In this brief, we investigated the potential of the reduced swing clock network for low power applications. We designed and laid out a full swing conventional and a reduced swing H-tree clock distribution network in 0.13- $\mu$ m-CMOS technology operating at 500 MHz. In the reduced swing clock network, the swing was reduced in the global clock

distribution network and was restored to the full swing in the local clock distribution domains. The post-layout simulation results of this research show that a power saving of 22% under nominal operating condition is feasible.

The reduced swing clock distribution network exhibits greater sensitivity to temperature and voltage variations making it less suitable for high performance applications. However, for low power applications it may result in substantial power savings.

#### ACKNOWLEDGMENT

The authors acknowledge and appreciate constructive criticism by reviewers.

## REFERENCES

- J.-P Schoellkopf, "Impact of interconnect performances on circuit design," in *Proc. IEEE Int. Interconnect Technology Conf.*, June 1–3, 1998, pp. 53–55.
- [2] D. Duarte, V. Narayanan, and M. J. Irwin, "Impact of technology scaling in the clock system power," in *Proc. IEEE Symp. VLSI*, Apr. 2002, pp. 52–57.
- [3] H. Zhang, V. George, and J. M. Rabaey, "Low-swing on-chip signaling techniques: Effectiveness and robustness," in *Proc. Int. Symp. Low-Power Electronics and Design*, 16-17, Aug. 16-17, 1999, pp. 145–150.
- [4] J. Pangjun and S. Sapatnekar, "Clock distribution using multiple voltages," *IEEE Trans. VLSI Syst.*, pp. 145–150, 2000.
- [5] E. G. Friedman, Clock Distribution Networks in VLSI Circuits and Systems. Piscataway, NJ: IEEE Press, 1999.
- [6] P. J. Restle and A. Deutsch, "Designing the best clock distribution network," in *1998 Symposium on VLSI Circuits Digest of Technical Papers*, June 11–13, 1998, pp. 2–5.
- [7] B. A. Gieseke, R. L. Allmon, D. W. Bailey, B. J. Benschneider, S. M. Britton, J. D. Clouser, H. F. Fair III, J. A. Farrell, M. K. Gowan, C. L. Houghton, J. B. Keller, T. H. Lee, D. L. Leibholz, S. C. Lowell, M. D. Matson, R. J. Matthew, and V. Peng, "A 600 MHz superscalar RISC microprocessor with out-of-order execution," in *Proc. IEEE Int. Solid-State Circuits Conf.*, 1997, pp. 176–177.
- [8] E. G. Friedman, "Clock distribution networks in synchronous digital integrated circuits," *Proc. IEEE*, vol. 89, pp. 665–692, May 2001.
- [9] G. Geannopoulos and X. Dai, "An adaptive digital deskewing circuit for clock distribution networks," in *IEEE Int. Solid-State Circuits Conf. ISSCC* '98, 1998, pp. 400–401.
- [10] R. Golshan and B. Haroun, "A novel reduced swing CMOS BUS interface circuit for high speed low power VLSI systems," in *Proc. IEEE Int. Symp. Circuits and Systems*, vol. 4, May 1994, pp. 351–354.
- [11] Y. Moisiadis, I. Bouras, and A. Arapoyanni, "High performance level restoration circuits for low-power reduced-swing interconnect schemes," in *Proc. IEEE 7th Int. Circuits and Systems (ICECS)*, vol. 1, 2000, pp. 619–622.
- [12] Y. Nakagome, K. Itoh, M. Isoda, K. Takeuchi, and M. Aoki, "Sub-1-V swing internal bus architecture for future low power ULSIs," *IEEE J. Solid-State Circuits*, vol. 28, pp. 414–419, Apr. 1993.
  [13] A. Rjoub and O. Koufopavlou, "Efficient drivers, receivers and repeaters"
- [13] A. Rjoub and O. Koufopavlou, "Efficient drivers, receivers and repeaters for low power CMOS bus architectures," in *Proc. 6th IEEE Int. Symp. Circuits and Systems (ICECS)*, 1999, pp. 789–794.
- [14] H. Kojima, S. Tanaka, and K. Sasaki, "Half-swing clocking scheme for 75% power saving in clocking circuitry," *IEEE J. Solid-State Circuits*, vol. 30, pp. 432–435, Apr. 1995.
- [15] J. Pangjun and S. Sapatnekar, "Low-power clock distribution using multiple voltages and reduced swings," *IEEE Trans. VLSI Syst.*, vol. 10, pp. 309–318, June 2002.
- [16] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Reading, MA: Addison-Wesley, 1990, pp. 208–211.
- [17] J. P. Fishburn and C. A. Schevon, "Shaping a distributed-RC line to minimize elmore delay," *IEEE Trans. Circuits Syst. 1*, vol. 42, pp. 1020–1022, Dec. 1995.

- [18] K. Bernstein, K. M. Carrig, C. M. Durham, P. R. Hansen, D. Hogenmiller, E. J. Nowak, and N. J. Rohrer, *High Speed CMOS Design Styles*. Norwell, MA: Kluwer, 1999, pp. 267–268.
- [19] S. Tam and S. Rusu, "Clock generation and distribution for the first IA-64 microprocessor," *IEEE J. Solid-State Circuits*, vol. 35, pp. 1545–1552, Nov. 2000.
- [20] J.M. Rabaey, Digital Integrated Circuits—A Design Perspective. Englewood Cliffs, NJ: Prentice-Hall, 1996, pp. 471–476.
- [21] D. Harris and S. Naffziger, "Statistical clock skew modeling with data delay variations," *IEEE Trans. VLSI Syst.*, vol. 9, pp. 888–898, Dec. 2001.

# Small Area Parallel Chien Search Architectures for Long BCH Codes

#### Yanni Chen and Keshab K. Parhi

Abstract—To implement parallel BCH (Bose–Chaudhuri–Hochquenghem) decoders in an area-efficient manner, this paper presents a novel group matching scheme to reduce the Chien search hardware complexity by 60% for BCH(2047, 1926, 23) code as opposed to only 26% if directly applying the iterative matching algorithm. The proposed scheme exploits the substructure sharing within a finite field multiplier (FFM) and among groups of FFMs.

Index Terms—BCH (Bose–Chaudhuri–Hochquenghem) code, Chien search, low complexity.

#### I. INTRODUCTION

Forward-error correction codes used in long-haul optical communication systems should provide significant coding gains [error floor can only occur at much lower bit error rate (BER), such as  $10^{-15}$ ] with high code rate and moderate complexity. In International Telecommunication Union (ITU-T) G.975, the (255, 239) Reed–Solomon (RS) code has been standardized to resist burst errors for optical fiber submarine cable systems [1]. With only 7% overhead, this RS code can not only provide approximately 5.5 dB coding gain at the BER of  $10^{-12}$  for random errors correction, but also correct bursts of length up to 64 bit [2]. BCH and RS codes form the core of the most powerful known algebraic codes and are widely used [3]. From our simulation using hard decision errors-only decoding under AWGN channel, additional coding gain of approximately 0.6 dB is observed for binary BCH codes compared to RS codes with similar code rate and codeword length. Hence, BCH code and its decoder architecture are of great interest.

To increase the decoding throughput, a parallel decoder is derived by developing parallel architectures for various building blocks. among the three major building blocks in the syndrome-based BCH decoder, i.e., syndrome generator unit, key equation solver and the Chien search, the parallel Chien search block is the most area consuming unit according to [2]. It occupies more than 65% of logic core for both 10-

Manuscript received June 11, 2003; revised October 30, 2003. This work was supported by the Army Research Office under Grant DA/DAAD19-01-1-0705.

Y. Chen was with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA. She is now with DSP Solutions Research and Development Center, Texas Instruments Incorporated, Dallas, TX 75243 USA.

K. K. Parhi is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: ynchen@ece.umn.edu; parhi@ece.umn.edu).

Digital Object Identifier 10.1109/TVLSI.2004.826203