An 8–16 Gb/s, 0.65–1.05 pJ/b, Voltage-Mode Transmitter With Analog Impedance Modulation Equalization and Sub-3 ns Power-State Transitioning

Young-Hoon Song, Member, IEEE, Hae-Woong Yang, Student Member, IEEE, Hao Li, Student Member, IEEE, Patrick Yin Chiang, Member, IEEE, and Samuel Palermo, Member, IEEE

Abstract—Serial link transmitters which efficiently incorporate equalization, while also enabling fast power-state transitioning to leverage dynamic power scaling, are necessary to meet future systems’ I/O requirements. This paper presents a scalable voltage-mode transmitter which offers low static power dissipation and adopts an impedance-modulated 2-tap equalizer with analog tap control, thereby obviating driver segmentation and reducing pre-driver complexity and dynamic power. Topologies that allow for rapid power-up/down, including a replica-biased voltage regulator to power the output stages of multiple transmit channels and per-channel quadrature clock generation with injection-locked oscillators (ILO), enable fast power-state transitioning. Energy efficiency is further improved with capacitively driven low-swing global clock distribution and supply scaling at lower data rates, while output eye quality is maintained at low voltages with automatic phase calibration of the local ILO-generated quarter-rate clocks. A prototype fabricated in a general purpose 65 nm CMOS process includes a 2 mm global clock distribution network and two transmitters that support an output swing range of 100–300 mVppd with up to 12 dB of equalization. The transmitters achieve 8–16 Gb/s operation at 0.65–1.05 pJ/b energy efficiency and sub-3 ns power-up/down times.

Index Terms—Capacitance, high-speed I/O, injection-locked oscillator, low-power, power management, timing error calibration, transmit equalization, voltage-mode driver.

I. INTRODUCTION

Supporting the dramatic growth in high-performance and mobile processors’ I/O bandwidth [1], [2] requires per-channel data rates to increase well beyond 10 Gb/s due to packaging technology allowing only modest increases in I/O channel count. At these relatively high data rates, complying with thermal design power limits in high-performance systems and battery lifetime requirements in mobile platforms necessitates improvements in I/O circuit energy efficiency [3], [4] and dynamic power management techniques [2], [3].

Serial-link transmitters consume both significant dynamic power due to the high-speed serialization operation and static power due to driving the low-impedance channel. The inclusion of equalization at high data rates to compensate for frequency-dependent channel loss adds to the design complexity and power consumption. Circuit and parasitic mismatch also create challenges in long-distance clock distribution and maintaining proper phase spacing for the critical serialization clocks which determine the output eye quality. In order to improve I/O energy efficiency at high data rates, improvements in static and dynamic power consumption are required in a manner that allows for robust operation at both low-voltage and with the growing mismatch found in nanometer CMOS technologies.

Significant static power savings are possible by utilizing low-swing voltage-mode drivers [4]–[7], as differential channel termination allows the same output voltage swing at one-quarter the current consumption of current-mode drivers. However, implementing transmit equalization with voltage-mode drivers is generally more difficult, with resistive divider [6], channel-shunting [7], [8], impedance-modulation [9], and hybrid current-mode [5] approaches being proposed. These topologies often set the equalizer taps’ weighting via output stage segmentation [6]–[9], which adds complexity to the high-speed predriver circuitry and degrades the transmitter dynamic power efficiency.

Scaling the power supply voltage with data rate is an effective technique to achieve nonlinear dynamic power-scaling at reduced-speeds [10], [11]. While architectures which utilize a high multiplexing factor allow for reduced frequency operation of the transmit slices, and thus the potential for low supply voltages, they are more sensitive to timing offsets amongst the multiple clock phases [4], [10], [12], [13]. Furthermore, efficient generation and distribution of these multi-phase clocks is challenging in large channel-count transmitters.

Another effective approach to saving I/O power is to dynamically operate the required number of channels in a burst-mode manner based on the system bandwidth demand at a given time [2]. In order to effectively leverage this technique, transmitters with rapid turn-on/off capabilities are necessary. It is important
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 49, NO. 11, NOVEMBER 2014

Fig. 1. Multi-channel serial-link transmitter architecture with dynamic power management.

to quickly disable both switching and static power, which can be particularly challenging with voltage-mode drivers due to output-stage regulator decoupling capacitance.

This paper presents a scalable high-data-rate transmitter architecture that allows for low overall power consumption in a manner that allows for dynamic power management to optimize system performance for varying workload demands. Section II reviews key low-power design techniques employed in this design, including capacitively driven wires for long-distance clock distribution [14] and impedance-modulation equalization [9]. An overview of the proposed multi-channel transmitter architecture, which is able to maintain low-swing quarter-rate clocking through the global distribution and local multi-phase generation, is given in Section III. Section IV discusses the power/data rate scalable transmitter channel design which adopts an impedance-modulated 2-tap equalizer with analog tap control, employs automatic phase calibration for low-voltage operation, and utilizes a replica-biased voltage regulator to enable fast power-state transitioning. Experimental results from a GP 65 nm CMOS prototype are presented in Section V. Finally, Section VI concludes the paper.

II. LOW-POWER TRANSMITTER DESIGN TECHNIQUES

A typical low-power multi-channel serial-link transmitter architecture is shown in Fig. 1. In order to amortize clocking power, the output of a global clock generation circuit, such as a phase-locked loop (PLL), is distributed to all of the transmit channels. Here efficient global clock distribution techniques, such as low-swing CML signaling [11], [15], are often employed in high channel count systems which span several mm. Each transmit channel performs parallel data serialization, implements equalization to compensate for frequency-dependent channel loss, and allows for dynamic power management (DPM) with rapid turn-on/off capabilities. This section reviews key low-power design techniques employed in this design, including capacitively driven wires for long-distance clock distribution [14] and impedance-modulation equalization [9], with further improvements offered in Sections III and IV.

A. Global Clock Distribution

Distributing high-frequency clock signals over on-chip wires with multi-mm lengths is challenging due to wire RC parasitics that limit bandwidth, resulting in amplified input jitter and excessive power dissipation with repeated full-swing CMOS signaling [16]. As shown in Fig. 2(a), in order to reduce clocking power and avoid excessive jitter accumulation, low-swing non-repeated global clock distribution with an open-drain CML buffer driving on-die resistively terminated transmission lines has been previously implemented [11]. However, maintaining a minimum clock swing at high frequencies
can still result in significant static power dissipation due to the transmission lines’ loss and relatively low-impedance. While reduction of this static power is possible with inductive termination of the distribution wire [15], this creates a narrow-band resonant structure that prohibits scaling the per-channel data rates over a wide range. Another non-repeated technique to drive long wires involves AC-coupling a full-swing CMOS driver to the distribution wire through a series capacitor, as shown in Fig. 2(b). Relative to simple DC-coupling, this technique allows for smaller drivers due to the reduced effective load capacitance, savings in signaling power due to the reduced voltage swing on the long-wire, and bandwidth extension due to the inherent high-frequency emphasis caused by the capacitive coupling [14].

In order to compare the CML-based and capacitive-coupled low-swing clock distribution techniques, the global distribution circuitry of Fig. 2 are both designed for a 0.25 V low-frequency amplitude. The 65 nm CMOS simulation results of Fig. 3 show that, relative to CML clock distribution with 50 Ω termination, this capacitively driven approach offers 1.7X bandwidth extension and 73.1% power savings when distributing a differential 4 GHz clock over a 2 mm distance. Also, the power of the capacitively driven approach reduces significantly at lower clock frequencies. This provides the potential for further power savings at a given data rate, provided that there is efficient multi-phase clock generation and low-to-high-swing conversion at the local transmit channels. Also, no major phase noise penalty is observed with the 0.25 V capacitively driven distribution, as simulations with a 4 GHz LC-oscillator driving the input buffer show that at the end of the distribution wire there is only a 0.1 dB degradation at a 1 MHz offset.

B. Voltage-Mode Transmit Equalization

While it is relatively easy to implement FIR equalizer structures at the transmitter by summing the outputs of parallel current-mode stages weighted by the filter tap coefficients onto the channel and a parallel termination resistor [11], voltage-mode implementations are more difficult due to the series termination control. As shown in Fig. 4, these voltage-mode topologies often set the equalizer taps’ weighting via output stage segmentation [6]–[9]. One approach is to distribute the output segments among the main and post-cursor taps to form a voltage divider that produces the four signal levels necessary for a 2-tap FIR filter [6]. Here, all segments operate in parallel during a transition \( X[n] \neq X[n-1] \) to yield the maximum signal level, while the post-cursor segments shunt to the supplies to produce the de-emphasis level for run lengths greater than one \( X[n] = X[n-1] \). As ideally all the segments have equal conductance, a constant channel match is achieved independent of the equalizer setting. However, shunting the post-cursor segments to the supplies results in dynamic current being drawn from the regulator powering the output stage and a significant increase in current consumption with higher levels of de-emphasis [7]. To address this, adding a shunt path in parallel with the channel can either eliminate dynamic current variations [8] or allow for a decrease in current consumption with higher levels of de-emphasis [7]. Further power reduction is possible if a constant channel match is sacrificed by implementing the different output levels via impedance modulation, allowing for minimum output stage current [9]. Here all segments are on during a transition to yield the maximum signal level, while for run lengths greater than one the post-cursor segments are tri-stated to generate a higher output resistance and produce the de-emphasis level.

As shown in the 10 Gb/s pulse response simulation results of Fig. 5, the amount of residual ISI with a 2-tap equalizer depends on the equalization technique and channel type. In order to compare an impedance-modulated driver with an ideal 50 Ω driver, equal de-emphasis settings are utilized and the residual ISI is quantified by summing the absolute values of five pre-cursors and fifty post-cursors and normalizing by the main cursor value. For 20″ backplane channels [17], an ideal 50 Ω driver displays similar residual ISI for middle- (M20) and bottom-trace (B20) channels with 13.1 dB and 11.7 dB de-emphasis, respectively. When an impedance-modulated driver is used, Fig. 5(b) shows that reflections with the middle-trace channel (M20) degrade the residual ISI performance by 26.9% relative to the 50 Ω driver. However, this performance difference shrinks to only 12.8% for the bottom-trace channel (B20). With a shorter 12″ bottom-trace channel (B12) that offers less overall ISI, but also less reflection attenuation, with 9.4 dB de-emphasis the residual ISI improves for both drivers and the relative ISI increase is less than 18.4% with the impedance-modulated driver. For the well-designed single-board co-planar waveguide (CPW) channel used in the Section V experimental results, which has performance comparable to channels proposed for high-density I/O systems [3], with 6.0 dB de-emphasis the ISI performance of the two
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Fig. 5. 10 Gb/s voltage-mode 2-tap FIR transmit equalization performance comparison. (a) Channel frequency responses. The three backplane channels have 5.2″ total linecard traces and 12″ (B12), and 20″ bottom- (B20) and middle-layer (M20) backplane traces. The CPW channel is a single-board 5.8″ FR4 trace and 0.6 m SMA cable. (b) Simulated 10 Gb/s pulse response with M20 BP trace. (c) Simulated 10 Gb/s pulse response with CPW channel. (d) Residual ISI, normalized to the main-cursor amplitude, with ideal 50 Ω and impedance-modulated output drivers. Error bars account for ±15% RX termination mismatch.

Drivers is almost identical. The impact of receive-side termination mismatch is also considered, with the error bars of Fig. 5(d) showing that a ±15% mismatch of the ideal 100 Ω differential termination results in less than a 2% difference in the relative performance of the transmitters.

While impedance-modulated equalization may yield the best signaling current consumption, the output stage segmentation associated with this and other approaches can result in significant complexity and power consumption in the predriver logic. Overall, this predriver dynamic power, which increases with data rate and equalizer resolution, should be addressed in order to not diminish the benefits offered by a voltage-mode driver.

III. MULTI-CHANNEL TRANSMITTER ARCHITECTURE

Fig. 6 shows a conceptual diagram of the proposed multi-channel transmitter architecture, with 10 transmitter channels spanning across a 2 mm distance. All transmitters share both a global regulator to set the nominal output swing, and two analog loops to set the driver output impedance during the maximum and de-emphasized levels of the implemented 2-tap FIR equalizer. Utilizing a single global regulator to provide a stable bias signal that is distributed to all the channels provides for independent fast power-state transitioning of each output driver, as explained in more detail in Section IV. The sharing of these global analog blocks allows for their power to be amortized by the channel number and improves the overall I/O energy efficiency.

In order to reduce dynamic power, low-swing clocks are maintained throughout the global distribution and local generation of the quarter-rate clocks used by the transmitters. Rather than distributing four quarter-rate clocks globally, which offers challenges in maintaining low static phase errors and power consumption, a differential quarter-rate clock is distributed...
globally in a repeater-less manner via capacitively driven low-swing wires [14]. A voltage swing of

\[ V_{\text{sw,ink}} = \frac{C_s}{C_s + C_w} \cdot V_{\text{DD}} \]  

is present on the long global distribution wires from the voltage divider formed by the series coupling capacitor, \( C_s \), and the clock wire capacitance, \( C_w \). The \( C_s \) value is set for a swing of \( V_{\text{DD}}/4 \), which is 250 mV for the 4 GHz clocks used in 16 Gb/s operation with a 1 V supply. These low-swing distributed clocks are then buffered on a local basis by AC-coupled inverters with resistive feedback for injection into a two-stage injection-locked oscillator (ILO) which produces four full-swing quadrature clocks that are shared by a two-channel bundle. Utilizing a \( V_{\text{DD}}/4 \) distribution swing allows the ILO to achieve a locking range greater than 250 MHz, which ensures locking over 5% power supply variations. Simulation results show that the clock swing degrades by only 1% at the end of the 2 mm distribution wire. As transmit architectures which utilize quarter-rate clocks for serialization are sensitive to timing offsets amongst the four clock phases, particularly with the aggressive supply scaling employed in this low-power design, digitally calibrated buffers controlled by an automatic phase calibration (PC) loop produce the final clocks that control the data serialization.

Fig. 7 shows the two-stage ILO schematic, where quadrature output phase spacing is improved by AC-coupling the injection clocks, adding dummy injection buffers, and optimizing the locking range via digital control of the injection buffers’ drive strength. The ILO employs cross-coupled inverter delay cells which, relative to current-starved delay cell-cells [4], generate a rail-to-rail output swing with better phase spacing over a wide frequency range. Coarse frequency control is achieved via a dedicated power supply equal to \( DVDD \), but separated on-chip for noise isolation. The gated analog voltage, \( EN_{\text{VCTL}} \), finely controls the ILO frequency by setting the delay cell pull-down strength. While not implemented in this prototype, a periodically activated control loop could set \( EN_{\text{VCTL}} \) such that the ILO free-running frequency is equal to the injection clock [18] to reduce quadrature phase errors and provide increased robustness to PVT variations. This analog control voltage can also be rapidly switched between GND and its nominal value, enabling fast power-up/shut-down of the clock signals on a two-channel resolution.

IV. TRANSMITTER CHANNEL DESIGN

A block diagram of a transmitter channel is shown in Fig. 8. The transmitter exhibits two operating modes to provide transmitter equalization at higher data rates, while dramatically scaling energy efficiency at lower data rates by reducing the digital serialization and pre-driver supply (\( DVDD \)) and disabling equalization when it is not required. While an external supply is used to set the scalable \( DVDD \) in this prototype, an adaptive switching regulator [10] could efficiently generate this supply. Eight bits of parallel input data are serialized with an initial 8:4 multiplexer followed by two parallel 4:1 stages that produce the main and post-cursor tap signals for the 2-tap equalizer implemented in the differential low-swing impedance-modulated voltage-mode driver. The serialized data passes through level-shifting pre-drivers [4] that boost the voltage swing by a full scalable supply value, \( DVDD \), above the nominal nMOS threshold voltage, enabling reduced output stage transistor sizing for given impedance value. Power is saved by disabling the post-cursor tap pre-driver at lower data rates where equalization is not required. The clocks which synchronize the serialization are produced by passing the ILO quadrature outputs through buffers with duty-cycle and quadrature spacing correction via 5 bits of p-n strength and 5 bits of delay capacitance adjustment, respectively. Two of these phases are divided by two to perform the initial 8:4 multiplexing, while all four phases pass through conventional CMOS logic to generate the pulse-clock signals that switch the secondary 4:1 CMOS muxes.

A. Automatic Quadrature-Phase Calibration

While a transmitter architecture which utilizes quarter-rate clocks for serialization allows for reduced supply voltages in
the data path, this low-voltage operation results in increased phase-spacing variations amongst the critical serialization clock signals [4]. The resultant output deterministic jitter from static phase errors and duty cycle distortion of the quadrature clocks can severely degrade eye height and timing margins for data rates well in excess of 10 Gb/s. This design addresses this important issue and enables high-speed operation at low supply voltages by implementing the closed-loop calibration scheme detailed in Fig. 8. In calibration mode, the transmitter output for two complementary fixed patterns is sampled with a comparator clocked by an asynchronous 100 MHz signal. The uniformly spaced output samples obtained by employing this asynchronous clock provide information about the duty cycle and quadrature phase spacing errors [19], [20]. First, the duty cycle is corrected by comparing the count value obtained for a “1100” output pattern and its complement, followed by an FSM that adjusts the p-n strength of the local clock buffers. Second, quadrature phase correction is realized by utilizing a “1010” pattern and its complement, with the FSM then adjusting the relative delay of the buffers through capacitive tuning.

B. Impedance-Modulated Output Driver

Fig. 9 shows the low-swing all-nMOS output stage, where a new impedance modulation technique [9] is introduced. In addition to the M1 switch transistors controlled by the main-cursor data, extra transistors M3–5 are stacked to achieve 2-tap impedance-modulated equalization. Analog control of the stacked transistor impedance values provides the potential for high-resolution equalization tap control with a non-segmented output stage, dramatically reducing pre-driver complexity and resulting in significant power savings. During a transition bit in equalization mode (Fig. 9(a)) the maximum output swing is achieved with nearly a 50 Ω output impedance, when both the higher-impedance single-transistor M3 and lower-impedance two-transistor paths (M4 and M5) controlled by the post-cursor data are activated in parallel.

\[ R_{\text{tran,2bit}} = (R_{M4} + R_{M5}) | R_{M3} + R_{M1} = Z_o \]  

(2)

where \( Z_o \) is the characteristic channel impedance (50 Ω). The sizing overhead of this effective three-transistor stack is minimized because the switch transistors controlled by the main and post-cursor data see a large level-shifted overdrive voltage, \( V_{LS} = DVDD + V_{thn} \), when turned on. Only the higher-impedance single-transistor M3 pull-up/pull-down path is activated for run-lengths greater than one (Fig. 9(b)), with the de-emphasis level set by the analog control voltages, \( V_{zmeqUP} \) and \( V_{zmeqDN} \), provided by the global de-emphasis impedance modulation loop.

\[ R_{\text{de-em}} = R_{M5} + R_{M1} = \frac{(1 + 2\alpha)}{(1 - 2\alpha)} Z_o \]  

(3)
where $e_d$ is the equalization coefficient (Fig. 4) and the peaking ratio between the maximum and minimum output signal levels is

$$\frac{V_{pp, \text{max}}}{V_{pp, \text{min}}} = \frac{1}{1 - 2^{e_d}}.$$  \hfill (4)

In non-equalization mode, the output stage is placed in a standard configuration with a single series impedance-control transistor M2 in the pull-up/pull-down paths, where the control voltages, VzCUP and VzCDN, are provided by the global impedance control loop. Furthermore, the post-cursor pre-drivers are disabled to save power.

C. Global Impedance Control and Modulation Loop

The global replica-bias loops that produce the impedance control bias voltages for the 2-tap transmitter output stages are shown in Fig. 10. A 50 $\Omega$ channel match is obtained with the left circuit that contains two feedback loops which force a value of $(3/4)V_{\text{REF}}$ and $(1/4)V_{\text{REF}}$ on the positive and negative outputs, respectively, of a replica transmitter loaded by a precision off-chip 100 $\Omega$ resistor. Configuring the circuit in non-equalization mode places the stacked M2 impedance control transistors in the feedback loops to produce the VzCUP and VzCDN control voltages that bias the M2 transistors of the output stages. In equalization mode, the stacked parallel M3 and M4–5 paths are placed in the feedback loops to produce the VzCeqUP and VzCeqDN control voltages that bias the output stages' M4
transistors to achieve a 50 Ω match during a transition bit. De-emphasis-level reference voltages (3/4)VREF-(1/2)αVREF and (1/4)VREF+(1/2)αVREF are used in the right circuit to produce the M3 bias voltages, VzmeqUP and VzmeqDN, for the high resistance values used when the data run-length is greater than one. For all settings the M1 and M5 replica switch transistors bias is generated by a diode-connected nMOS whose source is connected to the scalable DVDD, producing a voltage level, VLS = Vth + DVDD, consistent with the level shifting pre-driver output.

High-resolution equalization settings are possible with low power overhead via a low-frequency global DAC to set the de-emphasis voltage levels used in the replica bias loop. This compares favorably with achieving tap value control via a highly segmented output stage, which requires complex pre-driver circuitry switching at the full data rate [6]–[9]. While there is some power overhead associated with the global analog feedback loops, power amortization in a multi-channel system minimizes the impact on the overall transmitter energy efficiency.
D. Fast Power-State Transitioning Global Voltage Regulator

A global regulator with a replica output stage load similar to Fig. 10 is utilized to set the output swing value and the transmitters’ output supply, VREG, as shown in Fig. 11. A dual-supply topology is employed for the global regulator to both improve error amplifier accuracy and reduce the output stage supply voltage. An error amplifier powered from the nominal 1 V supply produces the bias voltage for a 0.5 V source-follower stage that supplies current to a replica transmit output driver. This error amplifier output voltage is also distributed globally to bias the source-follower stages present at the transmitter channels’ output stages. The voltage headroom provided by the nominal 1 V supply allows the error amplifier to have higher gain and support the output bias levels to increase the tunable transmit swing range to 100–300 mVpp. As with the global impedance control loops, this global regulator’s power is also amortized in a multi-channel system.

Since the global regulator stability is not a function of the transmit channels’ output stage decoupling capacitance value, this replica-bias architecture allows this capacitance to be switched in/out on a per-channel basis to speed up the power up/down process. When a particular channel is powered-down, the nMOS source follower gate bias is simultaneously switched from the global bias voltage to GND while the decoupling capacitance is disconnected from the VREG node and connected to the global VREF node. As shown in the simulation results of Fig. 12, this allows for the output nodes to discharge much quicker relative to a design with the decoupling capacitance always connected. Fast power-up is also achieved on a...
per-channel basis through staggered switching of the output stage’s decoupling capacitance. Delayed enabling of the output stage’s decoupling capacitance by $\sim 550$ ps relative to when the nMOS source follower is switched to the global bias voltage allows for faster charging of VREG and minimal charge sharing when the decoupling capacitance is reapplied. Note that it is important to connect the decoupling capacitor to the global VREF node during power-down, as simulations indicate that the VREG node takes in excess of 7 ns for 95% settling if the decoupling capacitance is completely discharged.

While this replica-bias scheme allows for fast power-state transitions, there is the potential for mismatch between the output stages’ regulated voltage VREG and the global VREF due to the impedance modulation amount and mismatches in the channel and termination impedance. The source follower transistors are sized to keep this mismatch under 10% in transient operation with up to 12 dB de-emphasis and 5% RX termination mismatch. Note that, while not implemented in this prototype, higher accuracy can be achieved by segmenting the output stages’ source follower transistors for fine adjustment via digital control.

V. EXPERIMENTAL RESULTS

Fig. 13 shows a die microphotograph of the proposed transmitter, fabricated in a general purpose 65 nm CMOS process. While chip area constraints prevented a full 10-channel prototype, the concept is accurately emulated by placing a two-transmitter bundle at the end of a snaked on-chip 2 mm clock distribution. Each transmitter channel occupies 0.006 mm$^2$, and the combined area of the injection-locked oscillator, global impedance control and modulation loop, bias circuitry, and global regulator is 0.014 mm$^2$. ESD diodes with $\sim 40$ fF parasitic capacitance are present at the high-speed transmitter outputs.

The functionality of the automatic phase calibration is demonstrated with a chip-on-board test setup, with the die directly wirebonded to the FR4 board and the transmitters driving short 2" traces. Lower data rates display worse inherent phase spacing performance due to the reduced voltage operation, with Fig. 14 showing a 28.5% uncorrected eye width variation at 8 Gb/s and a 0.75 V supply. These phase errors are reduced to 4.7% when the closed-loop phase calibration is enabled. At 16 Gb/s and 1 V operation, the phase calibration loop improves the eye width variation from an uncorrected 13.1% to 5.4%, limited by nonlinearities in the duty-cycle tuning range. Note that while a 1 V DVDD is required for 16 Gb/s operation, transient simulations indicate that the level-shifted pre-drive signals generate a maximum $V_{GS}$ and $V_{GD}$ which does not exceed 1.1 V in the switched output stage transistors due to the stacked design. These voltage levels are below the 10-year lifetime requirements.

A channel consisting of a 5.8" FR4 trace and a 0.6 m SMA cable (Fig. 5), with 15.5 dB loss at 8 GHz, is used to characterize the transmitter’s equalization capabilities. Fig. 15 shows that the global impedance modulation loop precisely controls the required impedance for a given equalization coefficient to within 7% of the ideal value, while low-frequency output patterns with a peak 300 mV pppd output swing verify the equalizer’s functionality up to the maximum 12 dB setting. The transmitter transient performance at a maximum 16 Gb/s data rate is verified with the $2^7–1$ PRBS eye diagrams shown in Fig. 16, where a previously near-closed eye is opened to a 55 mV height and 33.4 ps width when the impedance-modulation equalization is enabled.

In order to demonstrate the transmitter’s scalable energy efficiency over data rate, the eye diagrams of Fig. 17 are produced with the same channel as Fig. 16 and with a minimum 50 mV pppd eye height and $\sim 0.5$UI eye width. From these eye diagrams, the total jitter can be decomposed into deterministic and random components of 31.6 ps and 2.27 ps respectively, at 8 Gb/s, and 25.2 ps and 1.19 ps at 12 Gb/s. For this performance level the transmitter achieves 8–16 Gb/s operation at 0.65–1.05 pJ/b (Fig. 18) by optimizing the transmitter’s scalable supply and output swing and disabling equalization at the lowest 8 Gb/s data rate. While the dynamic clocking and serialization power dominates, as shown in the power versus data rate of Fig. 18(b) and the detailed 16 Gb/s power breakdown of Table I, scaling the digital supply reduces this contribution and allows for overall improved energy efficiency at lower data rates.

By routing the internal power-state enable signal off-chip using a channel that is delay-matched to the transmitter output, the power-state transitioning behavior is observed in Fig. 19. The ability to disable the source-follower bias voltage of the local regulators and disconnect the decoupling capacitance allow for a rapid 0.5 ns disable time. A longer 2.9 ns is required for...
Fig. 17. Transmitter eye diagrams and jitter decomposition at (a) 8 Gb/s and (b) 12 Gb/s.

Table I

<table>
<thead>
<tr>
<th>TRANSMITTER POWER BREAKDOWN AT 16 Gb/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>Global Regulator (amortized across 2 TX) &amp; Output Driver (300mVpp with EQ)</td>
</tr>
<tr>
<td>Serializer, Pre-drivers, Clocking</td>
</tr>
<tr>
<td>Global Impedance Control &amp; Modulation Loop, Bias Circuit (amortized across 2 TX)</td>
</tr>
<tr>
<td>Global Clocking (amortized across 2 TX)</td>
</tr>
<tr>
<td>ILO (amortized across 2 TX)</td>
</tr>
<tr>
<td><strong>Total Energy Efficiency</strong></td>
</tr>
</tbody>
</table>

to observe a stable transmit output amplitude during power-up due to the ILO start-up time. As shown in Table II, the performance achieved in this work compares favorably to other designs which have emphasized fast power-state transitioning [2], [3], [21].

Table III compares this work with recent voltage-mode transmitters with 2-tap equalization. The low-voltage architecture allows for a dramatic increase in data rate at near 1 pJ/b energy efficiency, which is only achieved at 10 Gb/s in [7], while the impedance-modulated equalization is capable of obtaining open eyes over the highest loss channel. Moreover, as indicated by the 16 Gb/s power breakdown in Table I, further improvements in energy efficiency are possible through increased amortization with a higher channel count and by scaling the design to a more advanced CMOS process that allows for reduced dynamic power.
This paper has presented a low-power, scalable, high-data-rate transmitter architecture. In order to reduce clocking power, low-swing clocks are maintained throughout the capacitively driven low-swing global distribution and local ILO quadrature phase generation. Improved dynamic power consumption is achieved with aggressive supply scaling at lower data rates, while automatic quadrature phase calibration allows for uniform output eyes at low voltages. By realizing a 2-tap equalizer with analog-controlled impedance modulation, output stage current is reduced and driver segmentation is obviated, allowing for reduced pre-driver complexity and further dynamic power savings. Employing a global regulator that provides a replica-bias voltage to the transmit channels, along with staggered switching of the output stage decoupling capacitance, allows for rapid enabling/disabling of the output drivers on a per-channel basis. Leveraging the proposed transmitter design can allow for low-power operation with the capabilities to efficiently support equalization for high-data-rate operation and enable dynamic power management to optimize system performance for varying workload demands.

VI. CONCLUSION

This paper has presented a low-power, scalable, high-data-rate transmitter architecture. In order to reduce clocking power, low-swing clocks are maintained throughout the capacitively driven low-swing global distribution and local ILO quadrature phase generation. Improved dynamic power consumption is achieved with aggressive supply scaling at lower data rates, while automatic quadrature phase calibration allows for uniform output eyes at low voltages. By realizing a 2-tap equalizer with analog-controlled impedance modulation, output stage current is reduced and driver segmentation is obviated, allowing for reduced pre-driver complexity and further dynamic power savings. Employing a global regulator that provides a replica-bias voltage to the transmit channels, along with staggered switching of the output stage decoupling capacitance, allows for rapid enabling/disabling of the output drivers on a per-channel basis. Leveraging the proposed transmitter design can allow for low-power operation with the capabilities to efficiently support equalization for high-data-rate operation and enable dynamic power management to optimize system performance for varying workload demands.

REFERENCES


**Young-Hoon Song** (M’14) received the B.S. and M.S. degrees in electrical engineering from the University of Texas at Arlington, TX, USA, in 2002 and 2004, respectively, and the Ph.D. degree in electrical engineering from Texas A&M University, College Station, TX, USA, in 2014. In 2008 and 2012, he was an intern at Broadcom, Irvine, CA, USA, and at IBM T. J. Watson Research Center, Yorktown Heights, NY, USA. He is now with Freescale Semiconductor as a high-speed transceiver design engineer. His research interests include mixed-signal integrated circuit design primarily as applied to power-efficient serial link transceivers for high-speed digital communications.

**Hae-Woong Yang** (S’13) received the B.S. and M.E. degrees in electrical and computer engineering from Texas A&M University, College Station, TX, USA, in 2007 and 2009, respectively. He is currently working toward the Ph.D. degree at the Analog and Mixed Signal Center (AMSC) of Texas A&M University.

His interests include low-power high-speed electrical link circuits, clock generation circuits, and signal integrity.

**Hao Li** (S’14) received the B.S. degree in microelectronics from Tsinghua University, Beijing, China, in 2009, and the M.S. degree in computer architecture from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2012. He has been working towards the Ph.D. degree in electrical engineering at Oregon State University, Corvallis, OR, USA, since 2012.

His current research interests include energy-efficient high-speed circuits for electrical and optical I/O.

**Patrick Yin Chiang** (S’99–M’04) received the B.S. degree in electrical engineering and computer sciences from the University of California, Berkeley, CA, USA, in 1998, and the M.S. and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, USA, in 2001 and 2007.

He is an Associate Professor (on sabbatical) at Oregon State University, Corvallis, OR, USA. He currently is a 1000-Talents Young Professor at the ASIC and System State Key Laboratory at Fudan University, Shanghai, China. In 1998, he was a design engineer at Datapath Systems (now LSI), where he designed a low-power, standard cell library for xDSL. In 2003, he was a research intern at Velio Communications (now Rambus), investigating 10 GHz clock synthesis techniques. In 2004, he was a consultant at Telegent Systems (now Spreadtrum), where he analyzed low-phase VCOs for TV tuners. In 2006, he was a visiting NSF Research Fellow at Tsinghua University, China, investigating super-regenerative RF transceivers. His interests are in energy-efficient circuits and systems, such as low-power wireline and photonic interfaces, energy-constrained medical sensors, and reliable near-threshold computing.

Dr. Chiang was a recipient of a 2010 Department of Energy Early Career award and a 2012 NSF CAREER award. He is an associate editor of *IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS*, and is on the technical program committee for the IEEE Custom Integrated Circuits Conference.

**Samuel Palermo** (S’98–M’07) received the B.S. and M.S. degrees in electrical engineering from Texas A&M University, College Station, TX, USA, in 1997 and 1999, respectively, and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, USA, in 2007.

From 1999 to 2000, he was with Texas Instruments, Dallas, TX, USA, where he worked on the design of mixed-signal integrated circuits for high-speed serial data communication. From 2006 to 2008, he was with Intel Corporation, Hillsboro, OR, USA, where he worked on high-speed optical and electrical I/O architectures. In 2009, he joined the Electrical and Computer Engineering Department of Texas A&M University, where he is currently an Assistant Professor. His research interests include high-speed electrical and optical links, clock recovery systems, and techniques for device variability compensation.

Dr. Palermo was a recipient of a 2013 NSF CAREER award. He is a member ofEta Kappa Nu. He currently serves as an associate editor for *IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II*, and served on the IEEE CAS Board of Governors from 2011 to 2012. He received, as a coauthor, the Jack Raper Award for Outstanding Technology Directions Paper at the 2009 IEEE International Solid-State Circuits Conference.