ISSN:2319-9873

# Research & Reviews: Journal of Engineering and Technology

# Power and Area Minimization of Reconfigurable FFT Processor using Distributed Arithmetic

Anuradha M\*, Vishal R\*

<sup>1</sup>Electronics & Telecommunication Engineering Department, SavitribaiPhule, Pune University, India

# **Research Article**

Received date: 12/12/2015 Accepted date: 19/03/2016 Published date: 28/04/2016

#### \*For Correspondence

Anuradha M, Electronics & Telecommunication Engineering Department, SavitribaiPhule, Pune University, India.

E-mail: mokashianu@gmail.com

**Keywords:** Digital signal processing (DSP), Complex multiplier, Distributed Arithmetic, FFTs.

#### **ABSTRACT**

Fast Fourier transforms is one of the most important frequency analysis in signal processing. It has different application such as image processing, medical field, communication system, spectral analysis etc. Butterfly is the basic elements of FFT. In this work a Distributed arithmetic technique is used to implement the butterfly module. Distributed arithmetic is Multiplierless technique resulted more efficient butterfly element both in terms of power and area. Butterfly element is the most important building block of Reconfigurable FFT processor. Single precision is used to represent the data. IEEE 754 standard is used to represent the floating point numbers.

### INTRODUCTION

Fast Fourier transforms is the computation of DFT which reduces the complexity in terms of operations like multiplications & additions. The DFT and FFT results the same the only difference is that it will remove the unnecessary operations. It uses the periodicity property of twiddle terms which is mentioned after.

The FFT computation for x(n) is given by the following formula

$$X(K) = \sum_{n=0}^{N-1} x(n)e^{-j2\prod nk/N}$$

Where  $W_{N}^{k}=e^{-j2\prod nk/N}$  which is known as twiddle factor.

x(n)= input sequence

x(k)= output sequence

Distributed arithmetic is the Multiplier less technique is used which saves the power & area also<sup>[1]</sup>. It come into existence in 1974.FFT computation using conventional & Vedic method required the high power and area compared to distributed arithmetic technique<sup>[2]</sup>. In DA it takes only adders and shifter for operations instead of using multiplier block. Basic advantages of DA are it enables the functionality in field programmable gate array and ASICs design. It takes the advantages of look up tables. Reconfigurable is the technique in which we can useswitch to select the radix point without disturbing the hardware. It increases the flexibility<sup>[3]</sup>.

A generic low power reconfigurable architecture features high efficiency in terms of area and power which is applicable to most still picture image compression standards, telecommunication protocol and automatic control<sup>[4]</sup>.

# **BACKGOUND STUDY**

Out of two algorithms DIF (discrete in time) and DIF (discrete in frequency), DIF is mostly used for implementation of FFT. This radix2 DIF FFT algorithm is starts by dividing the samples in to two groups<sup>[5]</sup>. It goes till the only computation between two samples. This two point DFT is called as Butterfly block. The butterfly is the basic block of FFT computations<sup>[6]</sup>. The diagram is given below (**Figure 1**).



Figure 1. Butterfly Block.

$$x^{\parallel} = x + y \tag{1}$$

$$\mathbf{y}^{\parallel} = (\mathbf{x} - \mathbf{y}) \mathbf{W}_{\mathbf{N}}^{\mathbf{k}} \tag{2}$$

Where x and y are the input signals. Butterfly implementation is lead saving in terms of silicon area and power. The main objective of this paper is to implement the highly efficient butterfly design which could be used in reconfigurable FFT architecture. Butterfly design using distributed arithmetic technique is implemented<sup>[7]</sup>. In this architecture distributed arithmetic is used only for the butterfly implementation not the whole FFT architecture<sup>[8]</sup>. For the design of butterfly it requires two 32 bit inputs (16+16) and one twiddle factor. These numbers are represented in 16 bit floating point number which is represented in standard IEEE 754 format. We cannot be representing the number in binary form if the number is more or very less. This limitation of binary number representation removed in floating point number representation. Hence we prefer the floating point representation. Instead of using base 10 and power of 10 like in scientific notation, IEEE 754 uses binary fraction and exponent which is considered as a power of 2.The floating point reprentation using IEEE 754 format is given below (Figure 2).

| Sign | Exponent | Fraction/Mantissa |
|------|----------|-------------------|
| 1bit | 5 bits   | 11 bits           |

Figure 2. Floating point representation using IEEE 754 format.

Given number is in normalised form in which sign bit along with 5 bit exponent and 10 bit mantissa including 1 hidden bit (11 bit mantissa) is distinguished. Hidden bit is always taken an account in calculations. Multiplication in conventional FFT requires more power which in case of DA is completely eliminated<sup>[9]</sup>. Multiplication with twiddle factor is the main critical task in FFT computation, here in LUT based approach look up tables are prepared based on the inputs and then after proper bit adjustments and shifting these LUTs are accessed based on bit combination.

#### DESIGN AND IMPLEMENTATION

Working with 16-bit floating point numbers (imaginary and real separately) **Figure 3** shows the bit manipulation happens with the 16 bit imaginary and real and imaginary part of the data. Sign bit chooses the set of LUT to be taken into account.



Figure 3. Flow of operations.

5-bit exponent bits used to equalise the number by proper biasing and then start fetching twiddles from LUTs based on

mantissa bit combinations of imaginary and real numbers. These mantissa bits are obtained after shifting the mantissa bits with lesser exponent by an amount of the difference between exponents. This is explained more elaborately in the example given below<sup>[10]</sup>.

Let DAA is applied to a floating point butterfly module. Let rewrite the butterfly module equation 1 & 2

$$x^{\parallel} = x + y$$

$$y^{\parallel} = (x - y) W_{N}^{k}$$

Where x and y are the input sequence which is complex numbers.

Hence

$$x = xr + jxi \tag{3}$$

$$y = yr + jyi \tag{4}$$

xr, yr = real part

Xi,yi= Imaging part

$$W_N^k = Wr + jWi (5)$$

$$\mathbf{x}^{\parallel} = \mathbf{x}^{\parallel} \mathbf{r} + \mathbf{j} \mathbf{x}^{\parallel} \mathbf{i} \tag{6}$$

$$\mathbf{y}^{\parallel} = \mathbf{y}^{\parallel} \mathbf{r} + \mathbf{i} \mathbf{y}^{\parallel} \mathbf{i} \tag{7}$$

Each and every real part and imaginary part of data is 16 bit IEEE 754 single precision floating point number.

$$\mathbf{x}^{\parallel} = \mathbf{x}^{\parallel} \mathbf{r} + i \mathbf{x}^{\parallel} \mathbf{i}$$

$$x^{\parallel}=(xr+yr)+j(xi+yi)$$
 from 3 and 4

$$\mathbf{y}^{\parallel} = (\mathbf{x} - \mathbf{y}) \mathbf{W}_{N}^{k}$$

Let xr - yr = U1 and xi - yi = U2

Hence  $y^{\parallel} = (U1 + jU2)(Wr + jWi)$ 

$$y^{\parallel} = (U1Wr - U2Wi) + j(U1Wi + U2Wr)$$

$$\mathbf{y}^{\parallel}\mathbf{r} = \mathbf{U}\mathbf{1}\mathbf{W}\mathbf{r} - \mathbf{U}\mathbf{2}\mathbf{W}\mathbf{i} \tag{8}$$

$$\mathbf{y}^{\parallel} \mathbf{i} = \mathbf{U}\mathbf{1}\mathbf{W}\mathbf{i} + \mathbf{U}\mathbf{2}\mathbf{W}\mathbf{r} \tag{9}$$

Equation 8 and 9 are realized using DAA .A 16 bit number in single precision floating point is given by

$$X = (-1)^{S} [2^{E} * 1.M]$$

S=sign bit 1 bit

M=mantissa 10 bits

E=biased exponent 5 bits and its value is given by

$$X = (-1)^{S} \left[ 2^{E} * \sum_{n=0}^{11} Mn 2^{-n} \right]$$
 (10)

 $y^{\parallel}r = U1Wr - U2Wi$  From 8

$$y^{\parallel}r\!=\!\!\left\{\!(-1)^{S1}\!\left[2^{E1}\!*\!\sum\nolimits_{n=0}^{11}M1n2^{-n}\right]\!\right\}Wr\!-\!\left\{\!(-1)^{S2}\!\left[2^{E2}\!*\!\sum\nolimits_{n=0}^{11}M2n2^{-n}\right]\!\right\}Wi$$

Where M1 & M2 be the 11 bit mantissa of U1 & U2 along with hidden bit which is always 1 and E1=E2=E therefore

$$y^{\parallel}r = 2^{E} \left\{ \left\{ \left(-1\right)^{S1} * \sum\nolimits_{n=0}^{11} M1n2^{-n} Wr \right\} - \left\{ \left(-1\right)^{S2} \left[ 2^{E2} * \sum\nolimits_{n=0}^{11} M2n2^{-n} \right] \right\} Wi \right\}$$

Let sign bit is zero.S1=S2=0

$$y^{\parallel}r = 2^{(E)} \left\{ \left\{ M102^{0} + M112^{-1} + ?M1112^{-11} \right\} Wr - \left\{ M202^{-0} + M212^{-1} + ?M2112^{-11} \right\} Wi \right\}$$

$$y^{\parallel}r = 2^{E} \left\{ (M10Wr - M20Wi)2^{0} + (M11Wr - M21Wi)2^{-1} + \dots + (M111Wr - M211Wi)2^{-11} \right\}$$
(11)

Similarly

$$y^{\parallel}i = 2^{E} \left\{ (M10Wi + M20Wr)2^{0} + (M11Wi + M21Wr)2^{-1} + \ldots + (M111Wi + M211Wr)2^{-11} \right\} \tag{12}$$

The value inside the braces is calculated from the look up tables which is shown below (Table 1).

Table 1. LUT for real number calculation (S1&S2 are negative).

| M1n | M2n | LUT value |
|-----|-----|-----------|
| 0   | 0   | 0         |
| 0   | 1   | Wi        |
| 1   | 0   | -Wr       |
| 1   | 1   | -Wr+Wi    |

The value of the LUT is changes according to changes in the sign bits. The real part is calculated by using the scaled accumulator which accumulates the values from the LUT for each bit of the mantissa. Similar method is used for the imaginary part calculations.

# DA BASED BUTTERFLY ARCHITECTURE

The software implementation of Equation 11 & 12 is given below. The implementation is done by using Xilinx software and the coding is done in Verilog (Figure 4).



Figure 4. Simulation of DA butterfly using Xilinx Software.

# **RESULTS & DISCUSSIONS**

DA based butterfly unit which as the only luring feature of tremendous power saving shows area power in Cadence RTL compiler and timing responses are observed in Modelsim (Figure 5 and Table 2).

| Report Power                                                                                                                                                                                                         |       |              |               |            | _ = ×          |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|--------------|---------------|------------|----------------|
| Generated by: Encounter(R) RTL Compiler v09.10-p104_1 (Jun 18 2009) Generated on: Apr 22 2014 09:58:38 Module: top Technology library: tsmc18 1.0 Operating conditions: slow (balanced_tree) Wireload mode: enclosed |       |              |               |            |                |
| Instance                                                                                                                                                                                                             | Cells | Leakage (nW) | Internal (nW) | Net (nW)   | Switching (nW) |
| top                                                                                                                                                                                                                  | 6612  | 2669.29      | 2585051.31    | 2611566.61 | 5196617.92     |
| top/sm                                                                                                                                                                                                               | 3714  | 1619.90      | 1596586.77    | 1420151.41 | 3016738.18     |
| top/sm/im                                                                                                                                                                                                            | 740   | 300.45       | 286827.42     | 256019.41  | 542846.82      |
| top/sm/im/csa_tree_sub+50                                                                                                                                                                                            | 0     | 0.00         | 0.00          | 0.00       | 0.00           |
| top/sm/im/csa_tree_sub_38                                                                                                                                                                                            | 0     | 0.00         | 0.00          | 0.00       | 0.00           |
| top/sm/inc_add_400_13_5                                                                                                                                                                                              | 61    | 36.15        | 6994.11       | 1152.11    | 8146.22        |
| top/sm/re                                                                                                                                                                                                            | 740   | 300.45       | 280693.37     | 248512.97  | 529206.33      |
| top/sm/re/csa_tree_sub450                                                                                                                                                                                            | 0     | 0.00         | 0.00          | 0.00       | 0.00           |
| top/sm/re/csa_tree_sub_38°                                                                                                                                                                                           | 0     | 0.00         | 0.00          | 0.00       | 0.00           |
|                                                                                                                                                                                                                      |       |              |               | 237673.54  | 439388.85      |

Figure 5. Area and power reports from Cadence RTL compiler.

DA based butterfly unit which as the only luring feature of tremendous power saving shows area power and timing responses

are observed. Referring to the contents of **Table 2**, it is quite apparent that for same reference frequency of operation DA is proved to be more power efficient compared to Vedic based approach and conventionally used multipliers. Butterfly unit is operated at 88.33MHz (**Figure 6**).

| Table 2. Comparison | Area and powe | r reports from | Cadence RTL compiler. |
|---------------------|---------------|----------------|-----------------------|
|---------------------|---------------|----------------|-----------------------|

|           | Conventional          | Vedic                  | Da                     |
|-----------|-----------------------|------------------------|------------------------|
| Power     | 14.2 mW               | 7.2 mW                 | 4.8 mW                 |
| Cells     | 16353                 | 12612                  | 6612                   |
| Area      | 207.6 mm <sup>2</sup> | 223.92 mm <sup>2</sup> | 102.42 mm <sup>2</sup> |
| Frequency | 88.33 MHz             | 88.33 MHz              | 88.33 MHz              |



Figure 6. Butterfly output and calculation based on bit combination of mantissa.

#### CONCLUSION

Concluding the purpose of the thesis, Distributed arithmetic based multiplication is 28-30% power efficient than Vedic and compared to conventional multiplication it is beyond comparison level. Adders and shifter are the major part of the hardware and numbers of different cells used are comparatively less. Future scope of the work is to implement the whole architecture of FFT.

#### REFERENCES

- 1. Berkeman A, et al. A Low Logic Depth Complex Multiplier Using Distributed Arithmetic. IEEE. 2000;35:656-659.
- 2. White SA. Applications of distributed arithmetic to digital signal processing a tutorial review. IEEE. 1989;6:4-19.
- 3. Park SK. Efficient FPGA and ASIC realizations of DA -based Reconfigurable FIR Digital Filter. IEEE. 2014;61:511-515.
- 4. Melnikoff SJ, et al. Implementing a Simple Continuous Speech Recognition System on an FPGA. FCCM, IEEE. 2002.
- 5. Das A, et al. Efficient VLSI Architectures of Split-Radix FFT Using New Distributed Arithmetic. IJSCE. 2013;3:264-671.
- 6. Mactaggart IR and Jack MA. Radix-2 FFT butterfly processor using distributed arithmetic. IEEE. 1983; 19:43-44.
- 7. Distributed Arithmetic FIR Filter v9.0. Xilinx Product Specification IEEE. 2005.
- 8. Sanchez MA, et al. Automated Design Space Exploration of FPGA -based FFT architectures based on Area and Power Estimation, IEEE, 2006;127-134.
- 9. Wirthlin MJ. Constant Coefficient Multiplication Using Look-Up Tables. Journal of VLSI Signal Processing IEEE. 2004;36:7-15.
- 10. Rupp CR, et al. The NAPA adaptive processing architecture. IEEE. 1998;28-37.