#### UNIVERSITY OF CALIFORNIA

Los Angeles

## Energy-Performance Characterization of CMOS/Magnetic Tunnel Junction (MTJ) Hybrid Logic Circuits

A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Electrical Engineering

by

Fengbo Ren

2011

© Copyright by Fengbo Ren 2011 The thesis of Fengbo Ren is approved.

Kang L. Wang

Chih-Kong Ken Yang

Dejan Marković, Committee Chair

University of California, Los Angeles 2011

To my dear parents, REN Shusen and CHENG Qiu.

### TABLE OF CONTENTS

| 1        | Intr   | $\operatorname{roduction}$                                |
|----------|--------|-----------------------------------------------------------|
|          | 1.1    | Magnetic Tunnel Junctions                                 |
|          | 1.2    | Motivation for Integrating MTJ with CMOS for Logic Design |
|          | 1.3    | Overview of Previous Work                                 |
|          | 1.4    | Thesis Outline                                            |
| <b>2</b> | MT     | J Model                                                   |
|          | 2.1    | Considerations for MTJ Modeling                           |
|          | 2.2    | MTJ Modeling 12                                           |
| 3        | Ene    | rgy-Performance Characterization of Logic-in-Memory MTJ   |
| Lo       | ogic ( | Circuit                                                   |
|          | 3.1    | Circuit Architecture                                      |
|          |        | 3.1.1 Dynamic Current-Mode Logic (DyCML)                  |
|          |        | 3.1.2 LIM-MTJ                                             |
|          | 3.2    | Energy-Performance Comparison                             |
|          |        | 3.2.1 Comparing Method and Simulation Setup               |
|          |        | 3.2.2 Simulation Results and Discussions                  |
|          | 3.3    | Switching Energy Analysis of MTJ 28                       |
|          |        | 3.3.1 Modeling the Switching Energy of MTJ                |
|          |        | 3.3.2 Scaling Trend                                       |

| 4  | Ene   | ergy-Po | erformance Characterization of MTJ Reading Circuits | 29 |
|----|-------|---------|-----------------------------------------------------|----|
|    | 4.1   | Circui  | it Architecture                                     | 30 |
|    |       | 4.1.1   | CMSA-Based Reading Circuit                          | 30 |
|    |       | 4.1.2   | XINV-Based Reading Circuit                          | 31 |
|    | 4.2   | Energ   | y-Performance Comparison                            | 33 |
|    |       | 4.2.1   | Simulation Setup                                    | 33 |
|    |       | 4.2.2   | Simulation Results and Discussions                  | 34 |
| 5  | Enc   | marr D  | erformance Characterization of CMOS/MTJ Hybrid      |    |
| _  |       |         |                                                     |    |
| LC | DOK-U | Jp Tat  | ble Based Logic Architectures                       | 37 |
|    | 5.1   | Circui  | it Architecture                                     | 37 |
|    |       | 5.1.1   | CMOS-LUT                                            | 37 |
|    |       | 5.1.2   | CMOS/MTJ Hybrid LUT                                 | 38 |
|    | 5.2   | Energ   | y-Performance Comparison                            | 43 |
|    |       | 5.2.1   | Simulation Setup                                    | 43 |
|    |       | 5.2.2   | Simulation Results and Discussions                  | 45 |
| 6  | Cor   | nclusio | ns                                                  | 52 |
| U  |       |         |                                                     |    |
|    | 6.1   | Summ    | hary of Research Contributions                      | 52 |
|    | 6.2   | Futur   | e Work                                              | 53 |
| R  | efere | nces .  |                                                     | 54 |

## LIST OF FIGURES

| 1.1 | Sketch of basic MTJ structure and illustration of MTJ resistance   | 2  |
|-----|--------------------------------------------------------------------|----|
| 1.2 | Illustration of STT writing scheme.                                | 4  |
| 1.3 | Example R-I curve of the MTJ.                                      | 5  |
| 2.1 | Normalized critical current density $J_C$ as a function of current |    |
|     | pulse width $\tau$                                                 | 11 |
| 2.2 | $J_C$ as a function of $\tau$ at each switching probability        | 12 |
| 2.3 | Simulated R-I curve of MTJ                                         | 16 |
| 3.1 | Illustration of DyCML logic style                                  | 18 |
| 3.2 | Schematic of SCMOS 1-bit full adder                                | 19 |
| 3.3 | Illustration of LIM-MTJ logic style                                | 20 |
| 3.4 | Switching Waveform of LIM-MTJ 1-bit Full Adder                     | 21 |
| 3.5 | Illustration of energy-delay tradeoff in logic circuits            | 22 |
| 3.6 | Energy-delay comparison of 1-bit adder implementations in SC-      |    |
|     | MOS, DyCML and LIM-MTJ logic styles                                | 24 |
| 3.7 | Switching energy of MTJ as a function of switching time            | 27 |
| 4.1 | Illustration of the reading operation of a CMSA-based reading cir- |    |
|     | cuit                                                               | 30 |
| 4.2 | Illustration of the reading operation of an XINV-based reading     |    |
|     | circuit                                                            | 31 |
| 4.3 | Waveform of XINV-based reading circuit                             | 32 |

| 4.4  | Energy-delay comparison between XINV-based and CMSA-based    |    |
|------|--------------------------------------------------------------|----|
|      | reading circuit at various TMR Ratios                        | 34 |
| 4.5  | Instant power comparison between XINV-based and CMSA-based   |    |
|      | reading circuits                                             | 35 |
| 4.6  | Read error rate comparison between XINV-based and CMSA-based |    |
|      | reading circuit at various TMR Ratios                        | 36 |
| 5.1  | Architecture of CMOS-LUT                                     | 39 |
| 5.2  | Architecture of Hybrid-LUT1                                  | 40 |
| 5.3  | Schematic of "READ1XMTJ" block                               | 41 |
| 5.4  | Architecture of Hybrid-LUT2                                  | 42 |
| 5.5  | Schematic of "READ8XMTJ" block                               | 43 |
| 5.6  | Illustrations of power gating in idle mode                   | 44 |
| 5.7  | Configuration energy comparison between CMOS-LUT, Hybrid-    |    |
|      | LUT1 and Hybrid-LUT2                                         | 45 |
| 5.8  | Delay comparison between CMOS-LUT, Hybrid-LUT1 and Hybrid-   |    |
|      | LUT2                                                         | 46 |
| 5.9  | Leakage power comparison between CMOS-LUT, Hybrid-LUT1       |    |
|      | and Hybrid-LUT2.                                             | 47 |
| 5.10 | Operation energy (100 MHz) comparison between CMOS-LUT,      |    |
|      | Hybrid-LUT1 and Hybrid-LUT2.                                 | 48 |
| 5.11 | Operation energy (250 MHz) comparison between CMOS-LUT,      |    |
|      | Hybrid-LUT1 and Hybrid-LUT2                                  | 49 |
| 5.12 | Operation energy (500 MHz) comparison between CMOS-LUT,      |    |
|      | Hybrid-LUT1 and Hybrid-LUT2                                  | 49 |

| 5.13 | Summary of LUT | architectures. |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 5 | 1 |
|------|----------------|----------------|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|---|---|
|------|----------------|----------------|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|---|---|

## LIST OF TABLES

| 2.1 | MTJ Characteristics                               | 15 |
|-----|---------------------------------------------------|----|
| 5.1 | Summary of Device Count                           | 43 |
| 5.2 | Summary of Stand-By Power at Each Technology Node | 50 |

#### ACKNOWLEDGMENTS

First, I would like to sincerely thank my advisor, Professor Dejan Marković for all the support and guidance he has been giving me through the entire study. I learned a lot not only from his words but also from the good example that he sets with his diligence, passion and preciseness in research, all of which deep influence me. His tirelessness in giving helpful advice, sharing with me his knowledge and brilliant ideas, along with his friendliness and sense of humors truly inspire my enthusiasm and make this study a enjoyable experience. There is no way I would have done this work without his help and support.

I am also very grateful to Professor Chih-Kong Yang and Professor Kang Wang for being on my thesis committee and providing useful comments, which have helped a lot in revising the thesis.

A special thank goes to Dr. Ajey Jacob from Intel, who has given us tremendous help by providing technology updates, useful data and insights for our research. His help with the manuscript of my paper is also appreciated.

In addition, I would like to thank our group members for their help on various aspects. I wish to thank Richard Dorrance for providing the quality MTJ model. It is a pleasure to thank Fang-Li Yuan for sharing his ideas and useful tools with me. Also, I would like to show my gratitude to Chengcheng Wang, Tsung-Han Yu, Victoria Wang, Chia-Hsiang Yang and Vaibhav Karkare for their patience in answering my questions and sharing me their knowledge and experiences. Especially, I want to thank Sarah Gibson, Yuta Toriyama and Richard Dorrance for proofreading my thesis and their great help on revising the thesis. Fruitful discussions with other group members during group meetings are also greatly appreciated. Acknowledgement is also due to Amr Amin for the helpful discussions on MTJ reading circuits. My acknowledgements also go to my friend Wenyao Xu for his thoughtful comments and suggestions on this thesis.

Last, I would like to acknowledge Western Institute of Nanoelectronics for funding this project. Above all, I wish to express my measureless gratitude towards my parents for their never-ending giving and loving. I also want to give great thanks to my beautiful Shufan and her "Maomao" for giving me the greatest support.

#### Abstract of the Thesis

## Energy-Performance Characterization of CMOS/Magnetic Tunnel Junction (MTJ) Hybrid Logic Circuits

by

#### Fengbo Ren

Master of Science in Electrical Engineering University of California, Los Angeles, 2011 Professor Dejan Marković, Chair

Magnetic Tunnel Junction (MTJ) devices are CMOS compatible with high stability, high reliability and non-volatility. All these features are promising for building non-volatile CMOS/MTJ hybrid logic circuits that do not consume offstate leakage current and that supports ultra-low-power operation. However, most existing proposals for this purpose so far lack an energy-performance analysis and a comparison to CMOS circuits. In this work, we analyze and compare the energy-performance characteristics of a wide range of CMOS/MTJ hybrid circuits over the device, circuit and architectural levels. This will include device switching energies, logic-in-memory MTJ (LIM-MTJ) logic circuit, two MTJ reading circuits and two CMOS/MTJ hybrid lookup table (LUT) architectures. Our analysis shows that the existing LIM-MTJ logic style has no advantage in energy-performance over its equivalent CMOS design, and that with the switching energy of MTJ considered, the CMOS/MTJ hybrid circuit requiring frequent MTJ switching is hardly energy efficient. Our simulation results also show that the cross-coupled inverter based MTJ reading circuit has 4 times greater performance and 30 times lower energy than the current-mirror sense amplifier based reading circuit. It is also shown that the proposed CMOS/MTJ hybrid LUT based logic architecture, which requires no MTJ switching during logic operations, is able to incorporate the non-volatility of the MTJ to alleviate the leakage problem of CMOS, and to thereby supports ultra-low power operation in advanced technology nodes (32-nm and beyond).

### CHAPTER 1

### Introduction

The explosive growth of the semiconductor industry over the past decade has been driven by the rapid scaling of complementary metal-oxide-semiconductor (CMOS) technology. However, the evolutionary CMOS scaling has resulted in physical constraints and will likely become very difficult at and below the 22-nm node. As the physical gate length of CMOS device is getting closer to the physical constraint [1], many short channel effects arises, resulting in very high device leakage and performance instability, which greatly deteriorate the energy efficiency and functionality of CMOS circuits. The high leakage can not only cause loss of information during unexpected power supply interruptions (volatility), but can also give rise to high standby power, creating difficulty in implementing designs for low-power applications.

In order to extend the scaling and to reduce the energy dissipation for ultralow-power applications, various emerging approaches for realizing new electrical switches with a variety of nano-scale technologies have been suggested in the ITRS roadmap [2]. However, CMOS technology will continue to advance along lines as prescribed in the next decade and to lead technology innovations despite its increasing scaling problems [2]. Thus, in short term, people will keep looking for new switches that supplement CMOS, are CMOS-compatible and can support low-power operation. Spin-based devices are among the candidates for these goals, as the energy needed to change an electron spin is much smaller than what



Figure 1.1: Sketch of basic MTJ structure and illustration of resistive states, (a)  $R_P$ , (b)  $R_{AP}$ .

is needed to move the electronic charge [3].

#### **1.1** Magnetic Tunnel Junctions

The magnetic tunnel junction (MTJ) is one of the most basic and also most significant spin-based device. The basic structure of the MTJ is shown in Fig. 1.1. The MTJ consists of two layers of ferromagnetic material separated by an extremely thin, nonconductive tunneling barrier (MgO,  $Al_2O_3$  etc). The thicker layer, which has a certain layer stack structure (not shown in Fig. 1.1) fixing its magnetic orientation, is called the fixed layer or the pinned layer. The thinner layer whose magnetic orientation can be changed freely according to an external magnetic field is called the free layer. The MTJ exhibits two resistive states depending on the relative orientation of the magnetization directions of the two ferromagnetic layers due to the spin-dependent tunneling involved in the electron transport between the majority and minority spin states. If the spin orientations are parallel (P), applying a voltage across the MTJ is more likely to cause electrons to tunnel through the thin barrier without being strongly scattered, resulting in a high current flow and, therefore, low resistance  $(R_P)$ . On the other hand, the resistance is high  $(R_{AP})$  if the spin orientations are anti-parallel (AP). The resistance change is measured using the tunnel magnetoresistance (TMR) ratio, which is defined as  $\Delta R/R = (R_{AP} - R_P)/R_P$ . A high TMR ratio is one of the key parameters desired in both logic and memory applications. With the MgO oxide barrier, the TMR ratio can reach 500% at room temperature and 1010% at 5K [5]. Most practical MTJs have TMR ratios between 50% and 150%.

The conventional writing operation of the MTJ (in memory applications) is carried out by applying two "half-select" magnetic fields generated by currents flowing through metal wires on top of the free layer [4]. However, the current required in this writing scheme is extremely high, and it scales inversely with the device size [5]. The discovery of the spin-transfer-torque (STT) phenomenon in 1996 brought the breakthrough of writing scheme [6]. Slonczewski's theory indicates that the magnetization orientation of magnets can be controlled by the direct transfer of spin angular momentum from a spin-polarized current. Therefore, a current flowing through an MTJ being polarized by the fixed layer will exert a torque on the magnetization of the free layer, and may eventually, switch the magnetization direction if the current density is sufficiently high. The STT writing scheme is illustrated in Fig. 1.2. In STT writing, the switching between  $R_P$  and  $R_{AP}$  is controlled by the direction of the writing current. Writing current flowing from the free layer to the fixed layer will write the MTJ into a parallel state  $(R_P)$ , while that flowing in the opposite direction will result in an anti-parallel state  $(R_{AP})$ . To ensure switching, the density of writing current has to be higher than the critical current density  $J_C$ , where  $J_C$  is defined as the minimum current density required to switch the MTJ for a given switching time.



Figure 1.2: Illustration of STT writing scheme. (a) write from AP to P, (b) write from P to AP.

The MgO-barrier MTJs have been shown to exhibit a wide range of  $J_C$  from  $8 \times 10^5$  to  $2 \times 10^7 \ A/cm^2$ , in the literature [5]. Most of the practical MTJs have  $J_C$  in the range of  $2 \cdot 7 \times 10^6 \ A/cm^2$  [7] (equivalent to 0.5-1.5 mA switching current assuming practical MTJ size). Consequently, STT writing consumes much less energy than conventional writing. More importantly, the current required for STT writing scales linearly with the dimension scaling of the MTJ [5].

With the STT writing scheme, the MTJ can be used in circuit design as a current or bias voltage controlled variable resistance device. So knowing how the resistance of the MTJ changes with current is as important as understanding the I-V curve of CMOS transistor. Fig. 1.3 shows an example of the resistancecurrent (R-I) curve of the MTJ. As shown in this figure, the MTJ can have asymmetric switching currents. The switching current of AP to P ( $I_{S(AP->P)}$ ) can be up to 3 times smaller than that of P to AP ( $I_{S(P->AP)}$ ). However, it can be easily driven by a 90-nm CMOS transistor, which can deliver 1 mA current



Figure 1.3: Example R-I curve of the MTJ. Data is from [8].

per 1  $\mu$ m gate width. Therefore, the MTJ is compatible with CMOS technology from this point of view.

## 1.2 Motivation for Integrating MTJ with CMOS for Logic Design

It has been demonstrated that MTJs can play significant roles in spin-torquetransfer random access memory (STT-RAM) [10][11], which is considered to be a strong candidate for universal memory [4][12]. Any memory device can be used to build a logic circuit, at least in theory, and the MTJ is no exception, as it has relatively high TMR ratio, which keeps getting improved with the invention of MgO as the tunneling barrier. Also, the MTJ is CMOS-compatible with high stability, reliability and non-volatility [5]. In addition, the MTJ can be directly fabricated on top of CMOS devices (3D stack) to reduce the area cost [8]. All these features are promising for building a 3D-stacked, non-volatile CMOS/MTJ hybrid logic circuit that does not consume off-state leakage current, thereby alleviating the leakage problem of CMOS.

As the leakage in CMOS devices tends to increase exponentially with technology scaling, leakage power has exceeded dynamic power and has become the major power consumption in advanced CMOS technology [13], and it will continue to increase. Moreover, the volatility of CMOS devices limits the usage of leakage reduction techniques, such as power gating, in many memory intensive applications, resulting in high standby power. By introducing the MTJ's non-volatility into CMOS in these applications, the stand-by power can be reduced. Therefore, CMOS/MTJ hybrid circuits may be able to support ultra-low-power operation at more advanced technology nodes, as their advantage of saving leakage power will become increasingly significant with technology scaling.

#### **1.3** Overview of Previous Work

So far, several CMOS/MTJ hybrid computing architectures have been proposed in the literature. Among these proposals, some suggest to use the magnetic field interaction caused by the current input lines passing through the MTJ element to change the magnetization of the free layer to implement logic [14], or to use a sense amplifier to read the total resistance difference between two groups of the MTJ's stack to implement logic [15][16][17]. Some propose to use MTJs as memory cells and CMOS as control circuits needed to conduct writing and reading operations in order to implement a non-volatile flip-flop [18].

However, almost all of the proposals on CMOS/MTJ hybrid circuits are conceptual, with rare energy and performance analysis or comparisons to CMOS circuits. Many of these studies lack circuit simulations [15][16][17][18]. There is only one paper that reports some simulation result on power and performance comparisons to the CMOS implementation of an 1-bit adder design [19]. The paper claims that a logic-in-memory MTJ (LIM-MTJ) 1-bit full adder has both lower dynamic and static power than the static CMOS (SCMOS) implementation. But in Chapter 3 we will shows that [19] omits the dynamic CMOS implementation, considers only one point in the energy-delay space, and does not include the time and energy for writing an MTJ cell. Besides, [19] models the MTJ as a simple resistor in circuit simulations, which omits many non-ideal characteristics of the MTJ, and therefore its conclusions are less convincing.

Therefore, the aim of this work is to analyze the energy-performance characteristics of CMOS/MTJ hybrid logic circuits using simulations at the device, circuit and architecture levels to determine which structure is best for this new technology and by comparing them with their equivalent CMOS implementations to see how much improvement can be achieved. In our simulations, a compact Verilog-A MTJ model [21] that is accurate to  $\pm 3\%$  of the micro-magnetic simulation is used.

#### 1.4 Thesis Outline

The subsequent chapters will present in detail the MTJ model we used, the circuit structures we studied, the comparison methods and the results. Chapter 2 briefly introduces MTJ modeling and discusses some considerations in MTJ modeling. Chapter 3 presents an energy-performance analysis of the LIM-MTJ logic style on the circuit level, along with a switching energy analysis of the MTJ on the device level. The energy, performance and reliability comparison of two different MTJ reading circuits are discussed in Chapter 4. In Chapter 5, an architecture level study on the energy-performance characteristics of CMOS/MTJ hybrid LUT

based logic architectures, which we believe are the most suitable structures for CMOS/MTJ hybrid logic, and its comparison to the CMOS-LUT architecture are presented. Chapter 6 summarizes the contributions of this work, concludes the thesis and discusses future work.

### CHAPTER 2

### MTJ Model

Computer-aided-design (CAD) tools play a significant role in modern circuit design [22]–[28]. With device models, we can simulate and verify the functionality of circuits to avoid failures before fabrication. However, the modeling of MTJs for circuit simulation purposes is still in its initial stage. In this chapter, some considerations for MTJ modeling from a circuit point of view are discussed. Also, the compact Verilog-A MTJ model presented in [21] is briefly introduced. This model is used for all the circuit simulations in this study.

#### 2.1 Considerations for MTJ Modeling

In circuit design, the MTJ is usually used as a current or bias voltage controlled variable resistance ( $R_P$  and  $R_{AP}$ ) device. Thus, an accurate R-I curve for MTJs is the key to MTJ modeling. The MTJ has much more complicated resistance characteristics than a resistor with a constant resistance. As shown in Fig. 1.3,  $R_{AP}$  is highly dependent on the current flowing through the MTJ, while  $R_P$ is more stable and varies little with the current. The current induced resistance ( $R_{AP}$ ) lowering could significantly deteriorate the effective TMR ratio, which may cause reading errors. So this current, or equivalently, bias voltage dependency is an important consideration in MTJ modeling. Another important consideration is the asymmetry of switching currents of the MTJ. For lately developed MTJs,  $I_{S(P->AP)}$  is usually larger than  $I_{S(AP->P)}$ . The skew ratio can be up to 2-3. Accurate modeling of this asymmetry is crucial for the estimation of writing margin so that both AP to P and P to AP switchings can be guaranteed in the writing operation.

MTJs are sensitive to thermal noise, as ferromagnetic materials are sensitive to temperature variation. Higher temperatures tend to increase the thermal fluctuation of magnets, resulting in a larger initial angle between the magnetization direction of the free layer and the fixed layer [30]. So at higher temperatures MTJs exhibit less switching current and smaller TMR ratio. Therefore, when the temperature increases, reading errors and accidental switching may occur. Unfortunately, MTJs can be easily heated up in real circuit environment by either currents flowing through it or by direct heat propagation from the switching CMOS devices beneath. Thus, temperature dependency should be considered in MTJ modeling.

In fact, the critical current density  $(J_C)$  of the MTJ is not fixed but a function of the current pulse width  $(\tau)$  [9]. In other words,  $J_C$  is a function of the switching time  $(t_s)$ . Fig. 2.1 shows a typical relation between  $J_C$  and  $\tau$  for 50% switching probability. It is shown that MTJ switching takes place in three regions. In the thermally activated switching region,  $J_C$  decrease linearly with the logarithmic increase of  $\tau$ , while in the processional switching region,  $J_C$  is inversely proportional to  $\tau$ . The middle region, which is called dynamic reversal, is a combination of precessional and thermally activated switching. This indicates that for a given MTJ characteristic, we have many design options to choose from. We could choose our design point to be fast switching with higher current or slow switching with lower current. Thus, we have to find out the optimal design point with respect to a certain metric, e.g. energy. Considering this insight, we find



Figure 2.1: Critical current density  $J_C$  (Norm. to  $J_{C0} = 5 \times 10^6 \ A/cm^2$ ) as a function of current pulse width  $\tau$  for 50% switching probability. Data is from [9]. that the modeling of the dependency of  $J_C$  on  $\tau$  is very useful. Note that this modeling will be introduced in Section 3.3.1.

To be more specific,  $J_C$  is a function of  $\tau$  at each switching probability, which means the curve in Fig. 2.1 is actually a band consisting of a series of curves at different switching probabilities. Fig. 2.2 shows an example of  $J_C$  as a function of  $\tau$  at each switching probability. In any MTJ based application, switchings in the writing operation should be always guaranteed, while accidental switchings in the reading operation should be always avoided. Therefore, for a given MTJ characteristic, the design region for the writing and reading operations should be the areas in red and blue as shown in Fig. 2.2, respectively. We should leave enough margins for both reading and writing operations to avoid the band in between, where switchings may happen with a certain probability. However, modeling the switching probability is not a easy task. So modeling the  $J_C$  for 0% and 100% would be enough since these are the only bands we care about.



Figure 2.2:  $J_C$  as a function of  $\tau$  at each switching probability. [Courtesy of Prof. J.P. Wang, UMN.]

Therefore, for circuit simulation purposes, circuit designers have a great need for an MTJ model that can provide accurate R-I curves and with the following taken into account.

- Bias voltage dependency
- Asymmetric switching current  $(I_{S(P->AP)} > I_{S(AP->P)})$
- Temperature dependency
- Current pulse width dependency
- Probability of switching

#### 2.2 MTJ Modeling

The MTJ model used in this work is the compact Verilog-A model presented in [21]. It has incorporated asymmetric switching current, bias voltage dependency and temperature dependency. In this MTJ model, the motion of magnetization of free layer  $(\vec{M})$  in presence of STT is described by the generalized Landau-Lifshitz-Gilbert (LLG) equation,

$$\frac{\partial \vec{m}}{\partial t} = -\gamma M_S \vec{m} \times \left(\frac{\vec{H}_{eff}}{M_S} + \frac{J_e}{J_p} b(\theta) (\vec{m} \times \vec{p}) - \alpha \frac{\partial \vec{m}}{\partial t}\right), \tag{2.1}$$

where  $\vec{m}$  is the unit vector in the direction of  $\vec{M}$ , t is time,  $\gamma$  is the absolute value of gyromagnetic ratio,  $M_S$  is the saturation magnetization,  $\vec{H}_{eff}/M_S$  is the effective magnetic field.  $\vec{p}$  is the unit vector in the direction of the magnetization of fixed layer  $(\vec{P})$ ,  $J_e$  is the current density,  $\theta$  is the angle between  $\vec{M}$  and  $\vec{P}$  $(\theta = 0^\circ \text{ for P and } \theta = 180^\circ \text{ for AP})$  and  $\alpha$  is the Gilbert damping constant  $(\alpha > 0)$ .  $J_p$  is the characteristic current density defined by

$$J_p = \gamma M_S \frac{eM_S d}{g_e \mu_b},\tag{2.2}$$

where e is the absolute value of electron charge, d is the thickness of free layer,  $g_e$  and  $\mu_b$  are constants.

 $b(\theta)$  in Eq. 2.1 is the efficiency factor of spin-polarization given by

$$b(\theta) = \frac{P}{X + Y\cos(\theta)} \tag{2.3}$$

where P is the percentage of electrons in currents polarized in the direction of  $\vec{P}$ , X and Y are two fitting parameters that model the difference of spin-polarization between P and AP states, thereby modeling the asymmetry of switching currents.

The temperature dependency of  $M_S$  in Eq. 2.1 and P in Eq. 2.3 are described as

$$M_S(T) = M_{S0}(1 - \frac{T}{T_C})^{\beta}, \qquad (2.4)$$

and

$$P(T) = P_0(1 - \alpha_{sp}T^{\frac{3}{2}}), \qquad (2.5)$$

where  $M_{s0}$  is the saturation magnetization at absolute zero,  $P_0$  is the spinpolarization at absolute zero,  $T_C$  is the Curie temperature,  $\beta$  and  $\alpha_{sp}$  are materialdependent constants.

The MTJ conductance is modeled as a function of  $\theta$ ,

$$G(\theta) = G_T 1 + P^2 \cos(\theta) + G_{SI}, \qquad (2.6)$$

where  $G_T$  is the conductance component due to direct elastic tunneling and  $G_{SI}$  is the conductance component due to imperfections in barrier layer. Since P and Gare both temperature-dependent parameters, the TMR ratio is also temperaturedependent. According to the definition, the TMR ratio with zero applied voltage will be given by

$$TMR_0 = \frac{1/G(180^\circ) - 1/G(0^\circ)}{1/G(0^\circ)}.$$
(2.7)

Replacing  $G(\theta)$  with Eq. 2.6, and substituting P in Eq. 2.6 with Eq. 2.5, Eq. 2.7 is then expressed as

$$TMR_0(T) = \frac{2P_0^2(1 - \alpha_{sp}T^{\frac{3}{2}})^2}{1 - P_0^2(1 - \alpha_{sp}T^{\frac{3}{2}})^2 + \frac{G_{SI}}{G_T}}.$$
(2.8)

The bias voltage dependency is included by adding a simple fitting function, given by

$$TMR(T,V) = \frac{TMR_0(T)}{1 + (\frac{V}{V_0})^2},$$
(2.9)

where  $V_0$  is a fitting parameter.

In this model, three fitting parameters, X, Y and  $V_0$ , are used so that the model can be adjusted to fit a wide range of MTJ characteristics. For this work, they are tuned to fit an scaled MTJ with the characteristics shown in Table 2.1. An MTJ with such low switching currents may be realized in several years if the device size keep scaling down. Therefore, by using the MTJ modeling in this study we hope to get some predictive energy and performance numbers for the future CMOS/MTJ hybrid circuit technology.

| Table 2.1: MTJ Cha                       | <u>aracteristi</u> cs |
|------------------------------------------|-----------------------|
| $R_P$                                    | 700 $\Omega$          |
| $R_{AP}$                                 | 1400 $\Omega$         |
| TMR Ratio                                | 100%                  |
| $I_{S(\mathbf{P}->\mathbf{AP}),t_s=3ns}$ | 500uA                 |
| $I_{S(AP->P),t_s=3ns}$                   | 228uA                 |

The simulated R-I curve and temperature dependency are shown in Fig. 2.3. Fig. 2.3 (a) demonstrates that the MTJ characteristics (Table 2.1) are well modeled at room temperature. The temperature dependency shown in Fig. 2.3 (b) has been calibrated to the data extracted from [29]. It is shown that at  $T=125 \ ^{\circ}C$ , the TMR ratio and switching currents drop by about 23% and 20%, respectively.



Figure 2.3: Simulated R-I Curve of MTJ (a) at room temperature, (b) at each temperature from -25 to 125  $^\circ C.$ 

## CHAPTER 3

# Energy-Performance Characterization of Logic-in-Memory MTJ Logic Circuit

So far most proposals on CMOS/MTJ hybrid circuits so far lack an energy performance analysis and comparison with CMOS circuits [14]–[18]. Only one paper reports any power and performance comparisons to CMOS implementation for an 1-bit adder design. The logic-in-memory MTJ (LIM-MTJ) logic style is proposed by [19]. The authors claim that a logic-in-memory MTJ (LIM-MTJ) 1-bit full adder has both lower dynamic and static power as compared to a static CMOS (SCMOS) implementation. In this chapter, we analyze the energy and performance of LIM-MTJ 1-bit full adder and compare it with two CMOS implementations. The following work has already been published in [20]. Furthermore, the switching energy of MTJ, and its scaling trend, are also analyzed.

#### 3.1 Circuit Architecture

#### 3.1.1 Dynamic Current-Mode Logic (DyCML)

DyCML circuits combine the advantages of MOS current-mode logic (MCML) circuits with those of dynamic logic families to achieve high performance at a low voltage-swing and a low power dissipation [33]. Fig. 3.1 (a) shows the general structure of DyCML logic. A function F is implemented using two pull-down



Figure 3.1: Illustration of DyCML logic style. (a) General structure, (b) Schematic of DyCML 1-bit full adder.

networks that one implement F and the other F'. Either the F or the F' branch will turn on, causing the logic output to evaluate. During the pre-charge phase (CLK = 0), both outputs are pre-charged to "1" and the capacitance transistor  $(C_L)$  is fully discharged. During the evaluation phase (CLK = 1), the pull-down network with the lower resistance will discharge its output to "0". At the same time, the cross-coupled PMOS transistor in the opposite branch will turn on to compensate the leakage current and charge its output to stay "1". As a result, the voltage levels of the two outputs separate and become complementary. The  $C_L$  serves as a virtual ground during the evaluation phase and eliminates static power. Thus, by adjusting the width of the  $C_L$  transistor, the voltage swing can be controlled, allowing the circuit to tradeoff between speed and power consumption.

A 1-bit full adder implemented with DyCML circuit is shown in Fig. 3.1 (b). It consists of 32 transistors as compared to 28 transistors in a SCMOS realization shown in Fig. 3.2.



Figure 3.2: Schematic of SCMOS 1-bit full adder.

#### 3.1.2 LIM-MTJ

Fig. 3.3 (a) shows the general structure of LIM-MTJ logic style. For a function F, two logic networks are constructed by MTJs and CMOS transistors satisfying the inequality R(X,Y) < R'(X,Y) when F = 0 and R(X,Y) > R(X,Y)' when F = 1. A current comparator is used to sense the current difference (or resistance difference) of the two pull-down networks. If I > I', Z = 0, otherwise I < I', Z = 1. The LIM-MTJ logic is implemented by using DyCML structure (Fig. 3.1 (a)). The only difference between LIM-MTJ and DyCML is that the pull-down network in LIM-MTJ has MTJs that serve as both memory and functional inputs, in addition to having regular CMOS transistors in the pull-down network. Therefore, LIM-MTJ can be considered as a MTJ-based DyCML.

Fig. 3.3 (b) shows A 1-bit full adder implemented with a LIM-MTJ circuit. It consists of 34 CMOS transistors (26 for logic, 8 for MTJ writing) and 4 MTJs. The use of MTJs cuts down the number of logic transistors to 26, but requires another 8 transistors to perform MTJ writing, giving no advantage in transistor



Figure 3.3: Illustration of LIM-MTJ logic style. (a) General structure, (b) Schematic of LIM-MTJ 1-bit full adder.

count. The MTJ is used to store complementary inputs (B and B'). In this case,  $R_{AP}$  represents "0", and  $R_P$  represents "1". The B and B' inputs are written via STT by the transistors shown in the shaded area and controlled by external WL and BL signals. The writing transistors are upsized to ensure that they can provide enough current to the MTJ to flip the magnetic state. Other transistors are sized to ensure they do not accidentally flip the MTJ while the circuit is in the evaluation mode. In order to best utilize the non-volatility feature of MTJs, the stored input should always be the one that is most infrequently changed, which is presumed to be the most significant bit (MSB) of the circuit in 2's complement arithmetic.

Fig. 3.4 shows an example waveform of switching in a LIM-MTJ 1-bit full adder. In this example, the clock is running at 100 MHz and the voltage swing is  $V_{DD}/2$ . For a certain input vector (for example A = 1, B = 1,  $C_i = 1$ ), both pull-down networks in the sum circuit will have relatively low resistance, differing by  $R_{AP} - R_P$ . Subsequently, both networks will drive at the beginning of the



Figure 3.4: Switching waveform of LIM-MTJ 1-bit Full Adder. The data is from HSPICE simulation with 90-nm predictive technology model.

evaluation phase. However, the branch with B = 1 ( $R_P$ ) will drive faster and turn on the cross-coupled PMOS of the B' branch to prevent its output (= 1) from pulling down. This fighting results in glitches on S and  $C_0$  as shown in Fig. 3.4. Since outputs usually serve as inputs to the next stage, this glitch (the voltage drop of output "1" at the beginning of the evaluation phase) is undesirable and will cause degraded performance or even the incorrect evaluation of the next stage. This voltage drop depends on both the absolute resistance of the pulldown network (with output "1") and the relative resistance difference between the two branches. The higher the resistance and the resistance difference are, the smaller the voltage drops. Also, signal degradation of S caused by leakage can be observed in the waveform (Fig. 3.4) for certain input vectors. This leakage current is caused by the DC current flowing through the cross-coupled PMOS



Figure 3.5: Illustration of energy-delay tradeoff in logic circuits

and the pull-down network with the higher resistance. It should be noted that a device with a higher TMR ratio would reduce the amount of leakage.

#### 3.2 Energy-Performance Comparison

#### 3.2.1 Comparing Method and Simulation Setup

To evaluate the potential improvements in performance and energy provided by new devices, we plot the energy-delay curve (EDC) for various circuit functions and compare designs with the new device technology with those in CMOS. The EDC is plotted by tuning circuit parameters such as transistor size, supply and threshold voltage. As shown in Fig. 3.5, the EDC is plotted with timeper-operation versus energy-per-operation. This plot not only shows the best performance and lowest energy design points, but also indicates the best energydelay tradeoff that can be achieved. The solid line in Fig. 3.5 shows the optimal EDC that we can achieve with a certain circuit topology and device. All design points in the region above the solid line are suboptimal, while the ones below are infeasible. The EDC plot is limited by minimum-delay point (MDP) and minimum-energy point (MEP), where usually one variable usually hits its upper or lower bound (e.g.  $V_{DD}$  is at the upper bound at MDP). From a circuit point of view, our goal of investigating the suitability of new devices is to find circuit implementations that operate at points marked as "X". Eventually, we expect "X" points below the solid line of CMOS designs to be more likely to be in the lower power region below the MEP, than in the high-performance region, beyond MDP. This is because one of the premises of new device technologies is to alleviate the leakage problem of CMOS. Previous work [34] has shown that with a minor delay increase (less than 25% delay increase from MDP), sizing is the most efficient way to reduce energy. For a delay increase greater than 25%,  $V_{DD}$  scaling is the most efficient way to reduce energy. Therefore, the EDC in the ultra-low-energy region (that is of interest to us) can be quickly estimated by simply sweeping  $V_{DD}$ .

Since LIM-MTJ can be regarded as MTJ-based DyCML, its real CMOS counterpart should be DyCML, and not SCMOS. Therefore, the EDCs of LIM-MTJ, DyCML and SCMOS 1-bit full adders are compared in HSPICE using predictive technology models (PTM). For insight into scaling trends, each EDC is plotted by scaling  $V_{DD}$  using 180nm, 90nm and 65nm PTM models, respectively. The capacitance transistor of LIM-MTJ is sized to achieve a voltage swing of approximately 50%  $V_{DD}$ , which assures the cross-coupled PMOS to be fully turned-on to stop the pull-down network from discharging the output "1". For a fair comparison, all three adders are loaded with a fan-out-4 output load, LIM-MTJ and DyCML are designed for the same voltage swing of 50%  $V_{DD}$  as compared to a full voltage swing in SCMOS. For the lowest possible energy of LIM-MTJ, the stored input is pre-written into the MTJ as a constant value and assumed static during the energy-delay simulations. We assume  $R_P = 1250 \ \Omega$  and TMR = 100%



Figure 3.6: Energy-delay comparison of 1-bit adder implementations in SCMOS, DyCML and LIM-MTJ logic styles with (a) 180nm, (b) 90nm and (c) 65nm model.

for the MTJs as in [19]. The Energy/Op. information for each adder is extracted by measuring the total energy  $(E_{tot})$  over a time interval in which 8 operations are performed. The energy is calculated as Energy/Op. =  $E_{tot}/8$ . The Delay/Op. information is extracted by measuring the worst-case delay of  $V_{DD} - V_{swing}/2$ switching between the input and output on the critical path for LIM-MTJ and DyCML, as compared to  $V_{DD}/2$  switching for SCMOS.

## 3.2.2 Simulation Results and Discussions

Fig. 3.6 shows the EDC results. The plots are normalized to MEP and MDP of SCMOS design. Results across each technology node indicates the same trend. Both LIM-MTJ and DyCML are better than SCMOS in the energy-delay space. For a 180nm technology, DyCML achieves a 10 times higher performance than SCMOS, while LIM-MTJ is about the same as SCMOS. For a 65nm design, both DyCML and LIM-MTJ can achieve a 3 times energy reduction as compared to SCMOS. It is interesting to note that both LIM-MTJ and DyCML comparatively loose speed but gain an energy reduction with technology scaling. The relative

speed degradation makes sense as we move away from using dynamic logic in highperformance designs today. However, it is important to underscore that DyCML always has a better energy-delay tradeoff than LIM-MTJ, not even considering switching energy of MTJ, which will be analyzed in the next section. This clearly demonstrates LIM-MTJ to be suboptimal and impractical.

## 3.3 Switching Energy Analysis of MTJ

The plots in Fig. 3.6 show only the best-case energy of LIM-MTJ, where the input stored on the MTJ is assumed a constant and no switching energy of MTJ is considered. This essentially implies an activity factor of zero, which is unrealistic for digital logic. The MTJ switching energy needs to be included in the energy estimates for any practical operation.

### 3.3.1 Modeling the Switching Energy of MTJ

The switching energy  $(E_S)$  of MTJ is defined as the energy dissipated as heat on MTJ while a switching current  $(I_S)$  flows through the MTJ stack. This energy is given as

$$E_S = I_S^2 \cdot R \cdot t_s, \tag{3.1}$$

where  $I_S$  can be calculated as the product of the critical current density  $(J_C)$  and the cross-sectional junction area (A),  $I_S = J_C \cdot A$ . The resistance R is calculated using the RA product  $(\delta)$ ,  $R = \delta/A$ , where A is usually proportional to the square of the junction size (L). For example, a MTJ with an ellipse shape (Fig 1.1) with an aspect ratio (W/L) ratio of 0.5 has  $A = 0.5 \cdot \pi \cdot L^2$ . Therefore, Acan be expressed as,  $A = K \cdot L^2$ , where K is some constant. Parameter  $t_s$  is the switching time, which can be assumed to be the same as the current pulse width  $\tau$  in Fig. 2.2. Thus, by substituting  $I_S$  and R,  $E_S$  is expressed as

$$E_S = K \cdot J_C^2 \cdot \delta \cdot L^2 \cdot t_s. \tag{3.2}$$

Recall Fig. 2.2, where  $J_C$  is a function of  $t_s$  ( $\tau$ ) at each switching probability. We should use the curve for 100% switching probability to analyze the switching energy of MTJ for practical designs, since switching should be always guaranteed in the writing operation. As suggested by [9], the  $J_C$  can be well modeled in three switching regimes separately as

$$J_{C1}(t_s) = J_{C0} \cdot [1 - \ln(\frac{t_s}{t_0})/\Delta], \text{ for } t_s > 10 \text{ ns},$$

$$J_{C2}(t_s) = J_{C1}(t_s) \cdot exp(-B_1 \cdot (t_s - 3)) \cdot \frac{10 - t_s}{10 - 3}$$
(3.3a)

$$+J_{C3}(t_s) \cdot exp(-B_2 \cdot (10 - t_s)) \cdot \frac{t_s - 3}{10 - 3}, \qquad (3.3b)$$

$$J_{C3}(t_s) = J_{C0} + \frac{C}{t_s}$$
, for  $t_s \le 3$  ns. (3.3c)

for  $3 < t_s \le 10$  ns,

where  $J_{C0}$  is the intrinsic critical current density.  $t_0$  is the intrinsic switching time which is on the order of 1 ns in most cases.  $\Delta = E/(k_b \cdot T)$  is the thermal stability factor where E is the energy potential between two spin states,  $k_B$  is Boltzmann constant, and T is the temperature. A thermal stability of  $\Delta \geq 40$ corresponds to a data retention time of approximately ten years or more.  $B_1, B_2$ in Eq. 3.3b and C in Eq. 3.3c are fitting parameters. By substituting  $J_C$  in Eq. 3.2 with Eq. 3.3,  $E_S$  now is expressed as a function of  $t_s$ , given by

$$E_{S}(t_{s}) = \begin{cases} K \cdot J_{C0}^{2} \cdot \left[1 - \ln(\frac{t_{s}}{t_{0}})/\Delta\right]^{2} \cdot \delta \cdot L^{2} \cdot t_{s}, & \text{for } t_{s} > 10 \text{ ns}, \\ K \cdot \left[J_{C1}(t_{s}) \cdot exp(-B_{1} \cdot (t_{s} - 3)) \cdot \frac{10 - t_{s}}{10 - 3} + J_{C3}(t_{s}) \cdot exp(-B_{2} \cdot (10 - t_{s})) \cdot \frac{t_{s} - 3}{10 - 3}\right]^{2} \cdot \delta \cdot L^{2} \cdot t_{s}, & \text{for } 3 < t_{s} \le 10 \text{ ns}, \\ K \cdot \left(J_{C0} + \frac{C}{t_{s}}\right)^{2} \cdot \delta \cdot L^{2} \cdot t_{s}, & \text{for } t_{s} \le 3 \text{ ns}. \end{cases}$$

$$(3.4)$$



Figure 3.7: Switching energy of MTJ as a function of switching time. For Ref. MTJ,  $\delta = 4.5 \ \Omega \cdot \mu m^2$ ,  $J_{C0} = 5.9 \times 10^6 \ A/cm^2$ ,  $\Delta = 55$  and  $L = 65 \ nm$ . It is based on  $J_C$  modeling for 100% switching probability.

Eq. 3.4 indicates that  $E_S$  of MTJ is dependent on  $t_s$  given the of MTJ parameters,  $\delta$ ,  $J_{C0}$ ,  $\Delta$  and L. Recently developed MTJs are ellipse-shaped with  $\delta$  between 3-20  $\Omega \cdot \mu m^2$ ,  $J_{C0}$  in the range of  $2-7 \times 10^6 \ A/cm^2$ ,  $\Delta$  of 30-70 [7] and L in the range of 50-200 nm. Fig. 3.7 shows  $E_S$  as a function of  $t_s$  for a reference MTJ with parameters  $\delta = 4.5 \ \Omega \cdot \mu m^2$ ,  $J_{C0} = 5.9 \times 10^6 \ A/cm^2$ ,  $\Delta = 55$  and  $L = 65 \ nm$ . MEP is found to be 1.1 pJ at 8.7 ns in the dynamic reversal region, which indicates the dynamic reversal region is more energy efficient than the other two switching regions. Precessional switching requires too much current, while thermally activated switching requires too much time. It is interesting to note that the increase in both switching time and energy renders thermally activated switching a suboptimal design region. Similar to the result shown in Fig. 3.7, the minimum writing energy reported in most references [5][19][31][35][36][37] are found to be on the order of 0.1-1 pJ. Considering that the switching energy of CMOS gates (e.g. 65-nm) are on the order of only a few fJ's, the switching energy of MTJs is about 2-3 orders of magnitude larger than that of a CMOS gate. Taking into account the energy dissipated in the transistor stack, due to the MTJ writing current, and the fact that a practical switching current is usually 2-4 times bigger than the minimum required switching current, the writing energy  $(E_W)$  of MTJ should be even higher. As a result, we can conclude that with the consideration of the switching energy of MTJ, a CMOS/MTJ hybrid logic circuit requiring frequent MTJ switching is hardly energy efficient. However, this must be taken cautiously, since the MTJ technology is still in the early development stages.

## 3.3.2 Scaling Trend

A significant decrease for each MTJ parameter in the Eq. 3.2 will help to make the switching energy of MTJ more competitive with CMOS devices. As indicated by Eq. 3.4,  $E_S$  scales linearly with  $\delta$  and quadratically with  $J_C$  and L. However, there is very little room left for the scaling  $\delta$  and  $J_C$ .  $\delta$  scaling is usually due to the scaling of the thickness of the tunnel barrier, which also results in reducing the breakdown voltage, while  $J_C$  scaling causes thermal stability degradation. Consequently, significant scaling of the device size L is desired to further scale down  $E_S$ . Future MTJs with parameters of  $\delta \leq 3 \ \Omega \cdot \mu m^2$ ,  $J_{C0} = 0.6 - 1 \times 10^6$  $A/cm^2$  and  $L \leq 20 \ nm$ , are expected to exhibits switching energy on the fJ-level. Such scaled device would be very compelling for integration with CMOS for a variety of applications.

# CHAPTER 4

# Energy-Performance Characterization of MTJ Reading Circuits

In most CMOS/MTJ hybrid circuits, MTJs are used as storage elements. The writing and reading operations are carried out by CMOS transistors. Thus, the design of writing and reading circuits is a crucial task in the design of CMOS/MTJ hybrid circuits. The energy and performance of MTJ writing circuits are less commonly considered since writing is limited by the switching energy and time of the MTJ. On the other hand, great demands for high-performance and low-energy operation have been put on the design of MTJ reading circuits. Many reading circuits [10][11][38] use current-mirror sense amplifiers (CMSA) to sense and compare the reading current with a reference to read out the data. In this chapter, we present a better MTJ reading circuit utilizing the positive feedback of cross-coupled inverters (XINV). Our simulation results show that it achieves a 4 times higher performance and 30 times lower energy as compared to a CMSA-based reading circuit.



Figure 4.1: Illustration of the reading operation of a CMSA-based reading circuit.

## 4.1 Circuit Architecture

## 4.1.1 CMSA-Based Reading Circuit

The idea of using CMSA to read out data on an MTJ is based on current sensing. Since the resistive states of MTJ can be reflected by reading currents  $(I_R)$  flowing through it  $(I_{RP} \text{ for } R_P \text{ and } I_{RAP} \text{ for } R_{AP})$ , CMSA is used to sense the reading current and compare it with a reference current  $I_{ref} = (I_{RP} + I_{RAP})/2$ . The difference between  $I_R$  and  $I_{ref}$  will charge or discharge the output so that a voltage difference between the output and reference node can be captured and amplified by a sense amplifier to get the data read out.

Fig. 4.1 shows a general structure of a CMSA-based reading circuit. In this example, two MTJs are read at a time due to the symmetric design. Two reference resistors are used to provide  $I_{ref}$ ,  $R_{ref0} = R_P$  and  $R_{ref1} = R_{AP}$ . Since all PMOS transistors are biased by  $V_{ref}$  and the middle two branches are connected,  $I_1$ ,  $I_2$ ,  $I_3$  and  $I_4$  will always end up with  $I_1 = I_2 = I_3 = I_4 = I_{ref} = (I_{RP} + I_{RAP})/2$ . Thus,  $I_{ref}$  is mirrored to  $I_1$  and  $I_4$ . Similarly, since all NMOS transistors are biased by  $V_{bias}$ ,  $I_{RP/RAP}$  will be mirrored to  $I_5$  and  $I_6$  based upon the resistive



Figure 4.2: Illustration of the reading operation of an XINV-based reading circuit.

states of MTJs  $(R_{P/AP})$ . If  $I_{5/6} > I_{1/4}$ ,  $V_{MTJ0/1}$  will be discharged and a negative  $(V_{MTJ0/1} - V_{ref})$  will be sensed and amplified by the sense amplifiers to output a "0". If  $I_{5/6} < I_{1/4}$ ,  $V_{MTJ0/1}$  will be charged and a positive  $(V_{MTJ0/1} - V_{ref})$  will be sensed and amplified by sense amplifiers to output a "1".

CMSA-based reading circuits are slow and power hungry because their critical paths involve at least 2 stages - current sensing and amplification. Both stages consume DC currents, resulting in constant static power, which greatly limits the energy efficiency.

#### 4.1.2 XINV-Based Reading Circuit

The basic principle of reading data from an MTJ in an XINV-based reading circuit is similar to that of a CMSA-based reading circuit. The difference is that in an XINV-based reading circuit, the sensing voltage difference is generated and amplified within the same stage in parallel. Also, no static power is consumed during operations.

Fig. 4.2 shows a simplified structure of an XINV-based reading circuit. It



Figure 4.3: Waveform of XINV-based reading circuit.

utilizes a differential dynamic logic structure, therefore requiring a CLK signal. When CLK="0" (pre-charge phase), both D0 and D1 are pre-charged to  $V_{DD}$ . When CLK="1" (evaluation phase), both pull-down networks start to discharge D0 and D1. A small voltage difference V(D0/1) - V(D1/0) will be generated due to the resistance difference between the MTJ and  $R_{ref}$ , where  $R_{ref} = (R_P + R_{AP})/2$ , and this difference will be directly amplified by the cross-coupled inverters.

Fig. 4.3 shows an example waveform of an XINV-based reading circuit. Bboth D0 and D1 are discharged at the beginning of evaluation phase. When  $R_{AP}$  is read, D0 is pulled down by a stronger branch ( $I_{ref} > I_{RAP}$ ), which turns on the PMOS transistor on the opposite branch and stops D1 from being pulled down. Thus, a voltage difference between D0 and D1 is amplified and digitized into differential outputs due to the positive feedback of the cross-coupled inverters.

## 4.2 Energy-Performance Comparison

We use the same method as introduced in Section 3.2.1 – an EDC comparison – to compare the two reading circuits. Also, a stability study is conducted to figure out which reading circuit is less sensitive to mismatch and can handle a lower TMR ratio.

#### 4.2.1 Simulation Setup

To compare the energy and performance, the reading of a 1 kb MTJ array is performed with both reading circuits in Cadence simulations using a 90-nm CMOS technology model and the compact Verilog-A MTJ model presented in Chapter 2. The assumed MTJ characteristics are shown in Table 2.1. For fair comparison, both reading circuits read one MTJ at a time. Twenty reads are performed for each circuit with half "0"s ( $R_P$ ) and half "1"s ( $R_{AP}$ ).

Unlike what has been suggested in [10], the reading current direction we choose is the anti-parallelizing direction in this study, in which read currents flow from the free layer to the fixed layer. This direction also corresponds to no disturbance for the  $R_P$  state so that only accidental switching from AP to P would occur in reading. This is because our MTJ has highly skewed switching currents. By using the anti-parallelizing direction, we will end up with  $I_R$  being constrained by a higher boundary  $I_{S(AP->P)}$  rather than  $I_{S(P->AP)}$  in the parallelizing case. To avoid the accidental switching, transistors are sized in the XINV-based reading circuit and  $V_{bias} = 0.7V$  is adopted in the CMSA-based reading circuit so that  $I_{RP} < 0.5 \cdot I_{S(P->AP)}$ .

The EDCs of both reading circuits, plotted by sweeping  $V_{DD}$  from 0.8V to 1.5V at each TMR ratio, are compared. To compare the stability, the read error



Figure 4.4: Energy-delay comparison between XINV-based and CMSA-based reading circuit at various TMR Ratios.

rates of performing 800 reads at each TMR ratio are studied for both circuits through Monte Carlo simulations with all possible mismatches and process corners considered.

### 4.2.2 Simulation Results and Discussions

Fig. 4.4 shows the EDC results. Overall, the XINV-based reading circuit outperforms the CMSA-based reading circuit in the energy-performance space, with 30 times lower energy and 4 times greater performance. The results also show that the XINV-based reading circuit is less sensitive to a TMR drop. With the TMR ratio changing from 145% to 25%, the CMSA-based reading circuit has 2 times performance degradation, while XINV-based reading circuit only has about 20% performance degradation. Therefore, the XINV-based reading circuit is more ro-



Figure 4.5: Instant power comparison between XINV-based and CMSA-based reading circuits.

bust against the TMR drop induced by higher temperature and changing bias voltage as shown in Fig. 2.3. With  $V_{DD} = 1.3V andTMR = 100\%$ , a read time of < 500 ps can be achieved by the XINV-based reading circuit, which gives promise to the realization of operational frequency on the order of GHz.

Fig. 4.5 shows the instant power of the two reading circuits while reading the MTJ array. It is clearly shown that the CMSA-based reading circuit has a high static power consumption due to the DC currents in CMSA and sense amplifiers. These DC currents consistently burn power regardless of whether the circuit is reading or not. To turn off the CMSA and sense amplifiers by shutting down the bias voltage when they are idle can save stand-by power. However, the long setup time for them to recover from off-mode makes it impractical to shut



Figure 4.6: Read error rate comparison between XINV-based and CMSA-based reading circuit at various TMR Ratios.

them down during the interval of two adjacent reads, and thereby wasting energy. In contrast, the XINV-based reading circuit only consume dynamic power while doing useful reads, which explains why it is much more energy efficient.

Fig. 4.6 shows the results from Monte Carlo simulations. According to the results, the XINV-based reading circuit is less reliable than the CMSA-based reading circuit. Both circuits exhibit zero read error rate beyond a TMR ratio of 50%. The CMSA-based reading circuit can tolerate even lower TMR ratio (30%) with zero error rate. Since TMR ratios of practical MTJs are usually in the range of 100%-150% and keeps getting improved, the XINV-based reading circuit is reliable enough for most MTJ applications.

# CHAPTER 5

# Energy-Performance Characterization of CMOS/MTJ Hybrid Look-Up Table Based Logic Architectures

In Chapter 3, we concluded that, because of the switching energy of MTJ, the CMOS/MTJ hybrid logic circuit requiring frequent MTJ switching is not very energy efficient. In this point of view, the look-up table (LUT) based logic style which requires no switching of the memory cell during logic operations becomes an attractive architecture for CMOS/MTJ hybrid logic. In this chapter, two CMOS/MTJ hybrid LUT based architectures utilizing the XINV-based reading circuit (Section 4.1.2) are proposed. Their energy-performance characterization and comparisons with CMOS-LUT are also studied.

## 5.1 Circuit Architecture

### 5.1.1 CMOS-LUT

The LUT is the basic element for many reconfigurable logic applications, like an FPGA [39]–[45]. Just like its name, the basic concept of LUT based reconfigurable logic is to store the truth table of a binary logic function in storage cells and read out the data based on input selections. Thus, different logic functions can be

implemented with the same hardware with different configurations of storage cells. The data stored in the storage cells are called configuration bits. For the same logic functionality, configuration is only needed once. No configuration is required in operation mode.

For a CMOS look-up table (CMOS-LUT), storage cells are usually SRAM cells or register files. According to a previous study [46], a register file with a scan chain has better trade-off in the energy-performance-area space than an SRAM with a decoder for ASIC synthesis of small size of memory. Therefore, in this study, we use a flip-flop based CMOS-LUT as a reference to compare with.

Fig. 5.1 shows the structure of two 3-input CMOS-LUTs, which can be configured to implement a 1-bit full adder. Each 3-input LUT consists of a storage block with eight D flip-flops and a logic block with two 4-1 muxes and one 2-1 mux. In total, 672 transistors are needed to implement two 3-input LUTs in a CMOS standard cell realization.

When EN="1", configuration bits can be shifted into the storage block from DIN through the scan chain. When EN="0", the configuration bits are latched, and can be read out through muxes based upon the 3-bit input selection, S[2:0]. CMOS-LUT is volatile, which means the functionality will be lost when the power supply turns off, and reconfiguration is required each time the power supply is turned on.

## 5.1.2 CMOS/MTJ Hybrid LUT

One simple way to build a CMOS/MTJ hybrid LUT is to replace all of the D flip-flops with an XINV-based reading circuit and some writing circuits as shown in Fig. 5.2.

In this architecture, each element "READ1XMTJ" is an XINV-based MTJ



Figure 5.1: Architecture of CMOS-LUT. This example includes two 3-input LUT-s.

reading circuit as shown in Fig. 5.3. For each 3-input LUT, the logic block is the same as a CMOS-LUT, but the storage block is replaced by STT-RAM consisting of 8 "READ1XMTJ" blocks and a simple writing circuit which is composed of transmission gates and buffers. In total, 448 transistors and 16 MTJs are needed in this realization of two 3-input LUTs. We denote this architecture (Fig. 5.2) as Hybrid-LUT1.

It is important to note that in Hybrid-LUT1, the MTJs are the storage cells,



Figure 5.2: Architecture of Hybrid-LUT1. This example includes two 3-input LUTs.

and all of the reading circuits read out the configuration bits in parallel only once at the beginning of operation mode, of which, only 1 bit is selected as the output at a time. Therefore, we can further increase the utilization rate of the reading circuits by MTJ sharing. For each 3-input LUT, the same reading circuit can be shared by all of the MTJs so that the density can be significantly increased. Based upon this idea, another CMOS/MTJ hybrid LUT architecture, denoted as Hybrid-LUT2, is proposed as shown in Fig. 5.4.

Hybrid-LUT2 is a completely different LUT architecture. In this architecture,



Figure 5.3: Schematic of "READ1XMTJ" block

each element "READ8XMTJ" is an XINV-based reading circuit shared by 8 MTJs as shown in Fig. 5.5. For each 3-input LUT, the storage block has only 1 XINV-based reading circuit along with the writing circuit. Instead of muxes, an 8-word decoder is used in the logic block to access the corresponding MTJ for reading based on input selections. The decoding process is carried out in the pre-charge phase of the reading operation so that the delay overhead can be minimized. Since the XINV-based reading circuit has a differential output, a mux is needed to select the correct output. In the schematic in Fig. 5.5, OUT0 is the output when MTJ0 to MTJ3 are accessed, while OUT1 is the output when MTJ4 to MTJ7 are accessed. A CLK signal is needed in the operation mode, since the reading circuit has to update the data by reading a different MTJ every time when the input changes. Realizing two 3-input LUTs in Hybrid-LUT2 only requires 174 transistors and 16 MTJs.

Table 5.1 summarizes the device count of two 3-input LUTs implemented in CMOS-LUT, Hybrid-LUT1 and Hybrid-LUT2. In the CMOS-LUT realization, the storage block contributes to 80% of the total transistor count. From the CMOS-LUT to hybrid-LUT1, the transistor count is reduced by 1/3 due to the



Figure 5.4: Architecture of Hybrid-LUT2. This example includes two 3-input LUTs.

usage of MTJs. In the hybrid-LUT2 implementation, a total reduction of 75% in transistor count is achieved due to MTJ sharing. Since MTJs consume no area due to its sitting on top of CMOS devices, the proposed CMOS/MTJ hybrid LUTs should have a smaller area cost than CMOS-LUT.



Figure 5.5: Schematic of "READ8XMTJ" block

| Structure   | Transistor |         |         | MTJ     | Total      |  |
|-------------|------------|---------|---------|---------|------------|--|
| Structure   | Logic      | Storage | Writing | IVI I J | Iotai      |  |
| CMOS-LUT    | 128        | 544     | N/A     | 0       | 672T       |  |
| Hybrid-LUT1 | 128        | 240     | 80      | 16      | 448T+16MTJ |  |
| Hybrid-LUT2 | 80         | 46      | 48      | 16      | 174T+16MTJ |  |

Table 5.1: Summary of Device Count

## 5.2 Energy-Performance Comparison

### 5.2.1 Simulation Setup

In order to compare the energies and performances of the circuits, two 3-input LUTs are implemented in CMOS-LUT, Hybrid-LUT1 and Hybrid-LUT2 in H-SPICE using 90nm, 65nm, 45nm and 32nm PTM models and the MTJ model presented in Chapter 2. The MTJ characteristics used in simulation are shown in Table 2.1. Both configuration and logic operations are simulated. All LUTs are first configured into a 1-bit full adder, and then operate at different frequencies.



Figure 5.6: Illustrations of power gating in idle mode. (a) CMOS-LUT, (b) CMOS/MTJ hybrid LUT.

 $V_{DD} = 1V$  for all technology nodes. The configuration energy, delay, leakage power, energy per operation, and stand-by power are simulated, plotted, and compared. The configuration energy is extracted by measuring the total energy in configuration mode, in which 16 bits are configured. The configuration of CMOS-LUT is carried out by a scan chain running at 1 GHz, while the configuration of hybrid LUTs runs at 250 MHz (> 3ns MTJ switching time). The delay is extracted by measuring the worst-case delay from all inputs to all outputs in operation mode. The total leakage power is measured during logic operations, and is broken down into two parts: logic and storage. The energy per operation is calculated as the total energy over a time interval divided by 8, in which 8 input vectors are tested at 100 MHz, 250 MHz and 500 MHz. The energy per operation is broken down into four categories: logic leakage, logic dynamic, storage leakage and storage dynamic.

In order to minimize the stand-by power, dynamic sleep transistors are used to gate the power supply in simulations as shown in Fig. 5.6. Note that the storage block of CMOS-LUT cannot be gated in idle mode because SRAM cells



Figure 5.7: Configuration energy comparison between CMOS-LUT, Hybrid–LUT1 and Hybrid-LUT2.

are volatile. In the hybrid LUTs, however, both storage and logic blocks can be power gated due to the non-volatility of MTJ, which reduces the leakage further. The sleep transistors are sized so that the maximum voltage drop across them is less than 5%  $V_{DD}$  in active mode. Stand-by power is measured as the average power of the whole circuit when SLEEP = "1".

## 5.2.2 Simulation Results and Discussions

Fig. 5.7 shows the configuration energy comparison result. The configuration energy of Hybrid-LUT1 and Hybrid-LUT2 is about 10 times bigger than that of CMOS-LUT, and it varies little with technology scaling. This results from the fact that the configuration energy of Hybrid-LUT1 and Hybrid-LUT2 is dominated by the high switching energy of MTJs, and this result matches our switching energy analysis in Section 3.3. It is also noted that from the 45-nm to 32-nm



Figure 5.8: Delay comparison between CMOS-LUT, Hybrid-LUT1 and Hybrid-LUT2.

technology node, the configuration energy of the CMOS-LUT increases rather than decreases. This is due to the fact that the increased leakage power exceeds the dynamic power and becomes the major power consumption in the 32-nm technology node. Since configuration is required only once for the same logic function, the switching energy overhead of MTJ in Hybrid-LUT1 and Hybrid-LUT2 can be ignored.

Fig. 5.8 shows the delay comparison result. CMOS-LUT and Hybrid-LUT1 have exactly the same logic block structure, and therefore have the same delay, which is just the gate delay of two muxes. The delay of Hybrid-LUT2 comes from an 8-word decoder plus an XINV-based reading circuit, resulting in about 1.5 times delay overhead.

Fig. 5.9 presents the leakage power comparison result. In CMOS-LUT and Hybrid-LUT1, more than 70% of the leakage power is from storage cells, which



Figure 5.9: Leakage power comparison between CMOS-LUT, Hybrid-LUT1 and Hybrid-LUT2.

can be also indicated by the transistor count shown in Table 5.1. It is noted that in CMOS-LUT, leakage power is significantly reduced from the 65-nm to the 45-nm technology node. This is because in PTM models, 90-nm and 65nm models are bulk technology, while 45-nm and 32-nm models are silicon-oninsulator (SOI) technology. Therefore, the leakage drop is due to the switch from bulk to SOI. However, in 32-nm even with SOI technology the leakage power jumps by about 10 times and becomes much larger than 65-nm, arising from the exponential nature of the leakage problem [13]. Overall, Hybrid-LUT2 is shown to exhibit 5 times lower leakage power than CMOS-LUT, which comes from a 10 times reduction on storage leakage and a 2 times reduction on logic leakage.



Figure 5.10: Operation energy (100 MHz) comparison between CMOS-LUT, Hybrid-LUT1 and Hybrid-LUT2.

The 10 times storage leakage reduction results from the fact that configuration bits are stored on MTJs in a magnetic form so that no power supply is needed to retain the data once the MTJs are programmed, which significantly reduces the leakage power and improves the energy efficiency. It can also be explained by the significant transistor count reduction (Table. 5.1) due to the use of MTJs.

Fig. 5.10, Fig. 5.11 and Fig. 5.12 show the operation energy comparison results at different frequencies. For both CMOS-LUT and Hybrid-LUT1, storage leakage dominates the total energy consumption, especially at lower frequencies and at smaller technology nodes. Therefore, Hybrid-LUT2 with 10 times smaller storage leakage power exhibits lower total energy, especially at lower frequen-



Figure 5.11: Operation energy (250 MHz) comparison between CMOS-LUT, Hybrid-LUT1 and Hybrid-LUT2.



Figure 5.12: Operation energy (500 MHz) comparison between CMOS-LUT, Hybrid-LUT1 and Hybrid-LUT2.

| Structure   | Stand-By Power ( $\mu W$ ) |                  |      |              |  |
|-------------|----------------------------|------------------|------|--------------|--|
| Structure   | 90nm                       | $65 \mathrm{nm}$ | 45nm | <b>32</b> nm |  |
| CMOS-LUT    | 6.5                        | 12.8             | 3.3  | 29.9         |  |
| Hybrid-LUT1 | 1.7                        | 1.8              | 0.47 | 1.0          |  |
| Hybrid-LUT2 | 0.84                       | 0.63             | 0.2  | 0.23         |  |

Table 5.2: Summary of Stand-By Power at Each Technology Node

cies and at smaller technology node. For example, at 100Mhz and 32-nm node, Hybrid-LUT2 exhibits 4 times lower total energy as compared to CMOS-LUT. It is important to note that Hybrid-LUT2 has a storage dynamic energy overhead which does not exist in CMOS-LUT and Hybrid-LUT1. This deteriorates its energy efficiency at higher frequency and in larger technology nodes, where dynamic energy components consume the most energy. For example, at 500MHz and 90-nm node, Hybrid-LUT2 has even larger total energy due to the dynamic energy overhead from the storage block. However, it is shown that as technology scales down, the storage dynamic energy overhead decreases and leakage energy increases dramatically. Therefore, the leakage advantage of Hybrid-LUT2 will become increasingly significant with technology scaling.

Table 5.2 summarizes the stand-by power of all three LUT architectures. The stand-by power of CMOS-LUT is limited by the leakage power of the storage block which is not power gated (Fig. 5.6 (a)). In contrast, the stand-by power of Hybrid-LUT1 and Hybrid-LUT2 are significantly reduced due to complete power gating (Fig. 5.6 (b)). This clearly demonstrates that by introducing the non-volatility of MTJs into CMOS circuits, stand-by power can be further reduced.

Finally, Fig. 5.13 summarizes the comparison result of the three LUT architectures.

| Legend: red (bad), green<br>(good), yellow (OK) | Structure              |              |              |  |  |
|-------------------------------------------------|------------------------|--------------|--------------|--|--|
|                                                 | CMOS-LUT               | Hybrid-LUT1  | Hybrid-LUT2  |  |  |
| Volatility                                      | Volatile               | Non-Volatile | Non-Volatile |  |  |
| Config. Energy                                  | Low                    | High         | High         |  |  |
| Speed                                           | High                   | High         | Medium       |  |  |
| Leakage Power                                   | High                   | High         | Low          |  |  |
| Stan-by power                                   | Storage Leakage        | Minimum      | Minimum      |  |  |
| Boot-up from off mode                           | Reconfiguration needed | Instant      | Instant      |  |  |

Figure 5.13: Summary of LUT architectures.

# CHAPTER 6

## Conclusions

## 6.1 Summary of Research Contributions

In summary, the main contribution of this work is that we analyze and compare the energy-performance characteristics of a wide range of CMOS/MTJ hybrid circuits over the device, circuit and architectural levels. This includes device switching energies, LIM-MTJ logic circuit, two MTJ reading circuits and two CMOS/MTJ hybrid LUT architectures. The analysis reveals that CMOS/MTJ hybrid LUT based logic architectures which require no MTJ switching during logic operations are the most suitable structure for CMOS/MTJ hybrid logic. The comparison between different circuits quantizes the advantage and improvements that can be achieved.

Our analysis results show that the existing LIM-MTJ logic style has no advantage in energy-performance over its equivalent CMOS design. The switching energy of MTJs is on the order of 0.1-1's pJ, which is about 2-3 orders of magnitude larger than that of a CMOS gate. Significant scaling of MTJ parameters targeting values such as  $\delta \leq 3 \ \Omega \cdot \mu m^2$ ,  $J_{C0} = 0.6 - 1 \times 10^6 \ A/cm^2$  and  $L \leq 20$ nm, are expected in the future to achieve a switching energy on the fJ level. Devices with such scaling would be very compelling for integration with CMOS for a variety of applications. Currently, however, we conclude that with the switching energy of the MTJs, the CMOS/MTJ hybrid logic circuit requiring frequent MTJ switching is hardly energy efficient.

For MTJ reading, our simulation results show that the XINV-based MTJ reading circuit outperforms the CMSA-based reading circuit by a 4 times greater performance and a 30 times lower energy. However, the CMSA-based reading circuit is more reliable and can handle lower TMR ratio.

The proposed CMOS/MTJ hybrid LUT based logic architectures, which require no MTJ switching during logic operations, have been shown to be capable of utilizing the MTJ's non-volatility to alleviate the leakage problem of the CMOS-LUT, especially in advanced technology nodes (32-nm and beyond), as well as supporting instant boot-up from off mode.

## 6.2 Future Work

So far, our analysis for CMOS/MTJ hybrid LUT architectures is based upon the LUT block only. For future work, suitable applications for the proposed architectures, such as hybrid LUT based non-volatile FPGA, will be investigated. Improvements and advantages at the system level will be evaluated.

## References

- Victor V. Zhirnov, Ralph K. Cavin, III, James A. Hutchby and George I. Bourianoff, "Limits to Binary Logic Switch Scaling – A Gedanken Model", *Proceedings of the IEEE*, vol. 91, no. 11, pp. 1934-1939, 2003
- [2] (2007), International Technology Roadmap for Semiconductors (ITRS), process integration, devices and structures. Available: http://www.itrs.net/ Links/2007ITRS/2007\_Chapters/2007\_PIDS.pdf
- [3] G. Zorpette, "The quest for the SPIN transistor," *IEEE Spectrum*, vol. 39, no. 12, pp. 30-35, Dec. 2001.
- [4] C. Chappert, A. Fert, and F. Nguyen Van Dau, "The emergence of spin electronics in data storage," *Nature Materials*, vol. 6, No. 11, pp. 813 - 823, Nov. 2007.
- [5] S. Ikeda, J. Hayakawa, M. L. Young, F. Matsukura, Y. Ohno, T. Hanyu, and H. Ohno, "Magnetic tunnel junctions for spintronic memories and beyond," *IEEE Trans. Electron Devices*, vol. 54, No. 5, pp. 991-1002, May. 2007.
- [6] J. C. Slonczewski, "Current-driven excitation of magnetic multilayers", Journal of Magnetism and Magnetic Materials, vol. 159, no. 1-2, pp. L1-L7, Jun. 1996.
- [7] H., Ohno, "A hybrid CMOS/magnetic tunnel junction approach for nonvolatile integrated circuits", Symposium on VLSI Technology, 2009, pp. 122-123.;
- [8] M. Sekikawa, "A novel SPRAM (SPin-transfer torque RAM)-based reconfigurable logic block for 3D-stacked reconfigurable spin processor", *IEEE International Electron Devices Meeting*, 2008, pp. 1-3.
- [9] Yiming Huai, "Spin-transfer Torque MRAM (STT-MRAM): Challenges and prospects", AAPPS Bulletin, Vol. 18, No. 6, Dec. 2008. Available: http: //www.cospa.ntu.edu.tw/aappsbulletin/data/18-6/33spin.pdf
- [10] T. Kawahara, et al., "2Mb Spin-Transfer Torque RAM (SPRAM) with Bit-by-Bit Bidirectional Current Write and Parallelizing-Direction Current Read", *IEEE ISSCC*, 2007, pp. 480-617.
- [11] K. Tsuchida, et al., "A 64Mb MRAM with clamped-reference and adequatereference schemes", *IEEE ISSCC*, 2010, pp. 258-259.

- [12] J. M. Slaughter, "Recent advances in MRAM technology," in 65th Annu. Device Research Conf., South Bend, IN, Jun. 2007, pp. 245-246.
- [13] T. Sakurai, "Perspectives on power-aware electronics", *IEEE ISSCC*, 2003, vol. 1, pp. 26-29.
- [14] J. Wang, H. Meng, and J.-P. Wang, "Programmable spintronics logic device based on a magnetic tunnel junction element," J. Appl. Phys., vol. 97, 10D509, 2005.
- [15] H. Meng, J. Wang, and J.-P. Wang, "A spintronics full adder for magnetic CPU," *IEEE Electron. Devices Lett.*, vol. 26, No. 6, pp. 360- 362, Jun. 2005.
- [16] S. Lee, S. Seo, S. Lee, and H. Shin, "A full adder design using serially connected single-layer magnetic tunnel junction elements," *IEEE Trans Electron Devices*, vol. 55, No. 3, pp. 890-895, Mar. 2008,.
- [17] S. R. Patil, X. Yao, H. Meng, J.-P. Wang, and D. J. Lilja, "Design of a spintronic arithmetic and logic unit using magnetic tunnel junctions," AMC Press Proc. of the 5th Conf. on Computing Frontiers, New York, New York, USA, May 2008, pp. 171-178.
- [18] W. Zhao, E. Belhaire, C. Chappert, F. Jacquet, and P. Mazoyer, "New non-volatile logic based on spin-MTJ," *phys stat sol (a)*, vol. 205, No. 6, pp. 1373-1377, Jun. 2008.
- [19] S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, H. Hasegawa, T. Endoh, H. Ohno, and T. Hanyu, "Fabrication of a nonvolatile full adder based on logic-in-memory architecture using magnetic tunnel junctions," *Appl. Phys. Express*, vol. 1, pp. 091301 (3 pages), 2008.
- [20] F. Ren and D. Markovic, "True Energy-Performance Analysis of the MTJ-Based Logic-in-Memory Architecture (1-Bit Full Adder)," *IEEE Trans. Elec*tron Devices, vol. 57, no. 5, pp. 890-895, May 2010.
- [21] Richard Dorrance, et al., "A Compact MTJ Model Incorporating Temperature and Voltage Dependencies using Verilog-A", manuscript in preparation.
- [22] Fang Gong, et al., "An Algorithm Based on Mixed Boundary Element Integral Formulations for Extracting Frequency-Dependent Impedances of 3D VLSI Interconnects", Journal of Computer Aided Design and Computer Graphics, Vol.19, No.10, Oct., 2007.

- [23] Shan Zeng, et al., "Efficient Frequency-Dependent Reluctance Extraction for Large-Scale Power/Ground Grid", 9th International Conference on Solid-State and Integrated-Circuit Technology, 2008, pp. 2292.
- [24] Fang Gong, et al., "Efficient Techniques for 3-D Impedance Extraction Using Mixed Boundary Element Method", in Proc. IEEE Asia South Pacific Design Automation Conference 2008 (ASPDAC'08), Seoul, Korea, Jan.21-24, 2008, pp. 158-163.
- [25] Xinmin Xu, et al., "Large girth LDPC codes and the improvement using kroneker product", International Journal of Communication Systems (IJCS), 2011, submitted.
- [26] Fang Gong, et al., "An Efficient Non-Iterative Parametric Yield Estimation with Performance Constraints", 47th IEEE Design Automation Conference (DAC'10), Anaheim, CA, June 13 18, 2010.
- [27] Mahsan Rofouei, et al., "Computing with Uncertainty in a Smart Textile Surface for Object Recognition", *IEEE International Conference on Multi*sensor Fusion and Integration for Intelligent Systems (MFI), Sept., 2010.
- [28] Fang Gong, et al., "A Parallel and Incremental Extraction of Variational Capacitance with Stochastic Geometric Moments", *IEEE Transactions on Very Large Scale Integration Systems (TVLSI)*, 2011, submitted.
- [29] C.J. Lin, et al., "45nm Low Power CMOS Logic Compatible Embedded STT MRAM Utilizing a Reverse-Connection 1T/1MTJ Cell", *IEDM*, 2009, pp. 1-4.
- [30] Se-Chung Oh, et al., "Bias-voltage dependence of perpendicular spintransfer torque in asymmetric MgO-based magnetic tunnel junctions", Nature Physics, vol. 5, pp. 898-902, Oct. 2009.
- [31] M. Hosomi, et al., "A Novel Nonvolatile Memory with Spin Torque Transfer Magnetization Switching: Spin-RAM", *IEDM Tech. Dig.*, Dec., 2006, pp. 473-476.
- [32] A. Mochizuki, H. Kimura, M. Ibuki, and T. Hanyu, "TMR-based logic-inmemory circuit for low-power VLSI," *IEICE T. Fund. Electr.*, vol. E88-A, no. 6, pp. 473-476, Jun. 2005.
- [33] M. W. Allam and M. I. Elmasry, "Dynamic current mode logic (DyCML): a new low-powerhigh-performance logic style," *IEEE J. Solid-St. Circ.*, vol. 36, pp. 550-558, Mar. 2001.

- [34] D. Markovic, "A power/area optimal approach to VLSI signal processing," *PhD dissertation*, Dept. Elect. Eng., UC Berkeley, Berkeley, CA, May 2006, Chapters 2-3.
- [35] W. H. Butler, X.-G. Zhang, T. C. Schulthess, and J. M.MacLaren, "Spin dependent tunneling conductance of Fe/MgO/Fe sandwiches," *Phys. Rev. B, Condens. Matter*, vol. 63, no. 5, pp. 054416 (12 pages), Jan. 2001.
- [36] F. J. Albert, J. A. Katine, R. A. Buhrman, and D. C. Ralph, "Spin-polarized current switching of a Co thin film nanomagnet," *Appl. Phys. Lett.*, vol. 77, pp. 3809-3811, Dec. 2000.
- [37] K. Konishi, T. Nozaki, H. Kubota, A. Fukushima, S. Yuasa, M. Shiraishi, and Y. Suzuki, "Current-field driven "spin transistor," *Appl. Phys. Express*, vol. 2, pp063004 (3 pages), 2009.
- [38] D. Halupka, et al., "Negative-Resistance Read and Write Schemes for STT-MRAM in 0.13μm CMOS", *IEEE ISSCC*, 2010, pp. 258-259.
- [39] Wenyao Xu, et al., "A novel placement algorithm for symmetrical FPGA", 7th IEEE International Conference on ASIC (ASICON07), Nov. 2007.
- [40] Kejun Xu, et al., "Task scheduling model and algorithm based on dual-Vdd dynamic reconfigurable FPGA", Journal of Zhejiang University (Engineering Science), Jan., 2010.
- [41] Wenyao Xu, et al., "In-Place FPGA retiming for mitigation of variational single-event transient faults", TCAS-I, Feb., 2011, to appear.
- [42] Nan Guan, et al., "Schedulability analysis of preemptive and nonpreemptive EDF on partial runtime-reconfigurable FPGAs", ACM Transactions on Design Automation and Electronic Systems (TODAES), Vol. 13, Issue 4, September 2008.
- [43] Fang Gong, et al., "A Fast Non-Monte-Carlo Yield Analysis and Optimization by Stochastic Orthogonal Polynomials", ACM Transactions on Design Automation of Electronic Systems (TODAES), 2011, submitted.
- [44] Wenyao Xu, et al., "Retiming for Single Event Transient Mitigation in FP-GAs", International Workshop on Logic Synthesis (IWLS), Nov. 2009.
- [45] Fang Gong, et al., "PiCAP: A Parallel and Incremental Capacitance Extraction Considering Stochastic Process Variation", in Proc. 46th IEEE Design Automation Conference (DAC'09), San Francisco, July. 26 31, 2009, pp. 764-769.

[46] C.-H. Yang, T.-H. Yu, and D. Marković, "A 5.8mW 3GPP-LTE Compliant 8×8 MIMO Sphere Decoder Chip with Soft-Outputs", Int. Symposium on VLSI Circuits, Jun., 2010, pp. 209-210.