ABSTRACT- FFT is a highly efficient procedure to reduce computation time and also improves the performance. The Radix $2^2$, $2^3$ and $2^4$ FFT architectures are not efficient because of its low utilization of components. Our proposed design will provide high data throughput and low complexity VLSI structure for Radix $2^5$ FFT architecture. Most of previous architectures were designed using the complex booth multipliers, but our proposed architecture uses canonical signed digit (CSD) multiplier circuit. This entire proposed architecture simulated in Xilinx 12.2 system edition software and implemented in Xilinx Virtex-5 XUP FPGA kit. To optimize the power, area and speed of the signal process, pipelining and parallel processing techniques have to be used in this proposal. In future this Radix $2^5$ FFT architecture will be incorporated in MIMO-OFDMA based software defined radio (SDR) architecture.

KEY WORDS : Fast fourier transform, pipelining, discrete fourier transform, radix

I. INTRODUCTION:

Today’s technology based on hardware and power efficiency for high performance. Application such as digital signal processing , communications etc are based on digital function which requires complex functionalities. For this fast fourier transform is one of the efficient method to implement discrete fourier transform due to its reduced computations. Our proposed design “RADIX 2$^5$” will provide high data throughput and low complexity VLSI structure .Power , area and speed of the signal processing can be optimized using pipelining techniques.

II. FAST FOURIER TRANSFORM:

The Fourier Transform decomposes a wave form basically any real world wave form into sinusoids. It is possible to generalize the Fourier transform on discrete structures such as Finite Groups. The Efficient Computation Of such structures, by fast Fourier transform, is essential for high speed computing. FFT algorithms are commonly employed to compute DFTs, but there is a clear distinction is that “DFT” refers to a Mathematical transformation, regardless of how it is computed, whereas “FFT” refers to a specific families of a algorithms for computing DFTs. Fast Fourier Transform (FFT) is developed by Cooley and Tukey in 1965.

Highly efficient procedure for computing the DFT of a finite series and requires less number of computations than that of direct evaluation of DFT. Fast Fourier transform (FFT) is based on decomposition and breaking the sequence into smaller sequences and combining them to get total sequence. The FFT time domain decomposition is usually carried out by a bit reversal sorting algorithm .There are various pipeline structures using radix-2, radix-4 and split radix algorithm. In 1998, HE and TORKESON suggested radix $2^2$ and $2^3$ FFT algorithm .These algorithms are characterized by the trait that reduces the number of non –trivial multiplications in the radix-2 algorithm architectures. It has same number of non –trivial multiplications at the same positions in the single flow graph as the same butter fly structure as that of the radix-4 algorithm but has the same butterfly structure as that of the radix-2 algorithm.Radix-2 algorithm is characterized according to the merit that it has some multiplicative complexity as the radix-4 algorithm.
III. RADIX-2:

The Radix indicates the size of FFT decomposition. In this paper Radix is 2 which is single-Radix FFT. For single Radix FFTs, transform size must be choose according to the power of Radix. Here we use 32and64 sizes, which is $2^5$ and $2^6$. The Radix-2 Decimation-in-Time FFT (DIT-FFT) is applied to the two Points N/2 DFT's. To find the number of butterfly stages required to compute N length, sequence can be M=$\log_2 N$, and N/2 butterfly operations are computed in each stage. In this paper there are 5 butterfly stages and 16 butterfly operations are computed to produce 32 Point FFT [2].

In DIT-FFT the given input sequence is in shuffled order and the output sequence is in natural order. By using Bit-Reverse input sequence gets shuffled. The Radix-2 decimation in time FFT is the basic form of Cooley-Tukey implementation algorithms [1]. Radix-2 first computes the DFT of the even index inputs and the odd index inputs and then combines the two results to produce the entire DFT sequence.

The basic computation block in the FFT is butterfly in which the two inputs are combined to give two outputs. The FFT operation of butterfly diagram is shown in the below figure, and the powers of the twiddle factors associated in butterflies are in natural order.

The twiddle factor exponent $k$ of each stage is calculated by using below equation:

$$K = N \, t/2m$$

when $t=0, 1, 2 \ldots \ldots 2m-1$

IV. RADIX 2$^2$ ALGORITHM:

Radix 2$^2$ algorithm derivation was derived by considering the first 2 steps of Cascade Decomposition in the radix 2 DIF – FFT together. The linear index mapping converted into 4-dimensional linear map [3].

$$n = (N/2n_1 + N/4n_2 + N/8n_3 + n_4)N$$

$$k = (k_1 + 2k_2 + 4k_3 + 8k_4)N$$

As previously explained, with cascade decomposition Twiddle Factor can be defined as

$$W_N^{(N/2n \cdot N/4n \cdot N/8n \cdot n_4}) = (-1) \cdot \sum_{k_1}^{N/2} \cdot \sum_{k_2}^{N/4} \cdot \sum_{k_3}^{N/8} \cdot \sum_{k_4}^{n_4} W_N^{(k_1 + 2k_2 + 4k_3 + 8k_4)}$$

After substitution of eqn (5),

$$X(k_1 + 2k_2 + 4k_3 + 8k_4) =$$

Hence a Booth Multiplier called a Programmable Multiplier is used instead of a Constant Multiplier.

V. RADIX 2$^3$ ALGORITHM:

Radix 2$^3$ Algorithm was derived by considering the first 3 steps of Cascade decomposition. The linear index mapping transformed into 5-dimensional linear map [3].

$$n = (N/2n_1 + N/4n_2 + N/8n_3 + N/16n_4 + n_5)N$$

$$k = (k_1 + 2k_2 + 4k_3 + 8k_4 + 16k_5)N$$

With cascade decomposition Twiddle Factor can be expressed in the form,

$$W_N^{(N/2n \cdot N/4n \cdot N/8n \cdot N/16n \cdot n_5)} = (-1) \cdot \sum_{k_1}^{N/2} \cdot \sum_{k_2}^{N/4} \cdot \sum_{k_3}^{N/8} \cdot \sum_{k_4}^{N/16} \cdot \sum_{k_5}^{n_5} W_N^{(k_1 + 2k_2 + 4k_3 + 8k_4 + 16k_5)}$$

After substitution of eqn (8),

$$X(k_1 + 2k_2 + 4k_3 + 8k_4 + 16k_5) =$$

M.R. Thansekhar and N. Balaji (Eds.): ICIET'14 1508
Hence the Complex Multiplication for the Twiddle Factors, $W_{16}^{ik}+W_{4}^{2k}$, can be reduced to a Constant Multiplier with some control logics.

VII. RADIX 2' ALGORITHM:

This paper proposes and concentrates on the design of 32 point FFT and its performance analysis. By using VHDL as a design entity the synthesis and simulation is done on Xilinx ISE Design Suite12.2. A DFT Decomposes a sequence of values into components of different frequencies. This operation is useful in many fields but computing it directly from the definition is often too slow to be practical. An FFT is a way to compute the same result more quickly: Computing a DFT of N points, takes $O(N^2)$ Arithmetical operations, while an FFT can compute the same DFT in only $O(N \log N)$ operations. FFTs can decomposed using DFTs of even and odd points, which is called a Decimation in time (DIT) FFT, or they can decomposed using another approach which is called a Decimation-in-frequency (DIF) FFT.

Computation of the end point DFT corresponds of computation of N samples of Fourier transform at N equally spaced frequencies. Consider the input $x(n)$ of length N is a complex data sequence, its DFT $X(k)$ is also complex data sequence of length N which is defined as:

$$X(K)=\sum_{n=0}^{N-1} x(n) W_N^{nk}, K=0, 1, \ldots, N-1.$$  

Where $W_N$ denotes, $\exp\{-j2\pi/N\}$, the Nth primitive root of unity with its exponent being evaluated modulo N the ‘n’ is the time index and the ‘K’ is the frequency index. Twiddle Factor coefficients are used to combine the results from the previous stage to form inputs to the next stage.

VIII. CANONIC SIGNED DIGIT MULTIPLIER

In this paper we are using canonic Signed Digit (CSD) Multiplier instead of Fixed Width Multiplier used in previous architecture. In Fixed Width Multiplier, in the resulting product a significant error will be introduced and it is undesirable for many DSP applications. To reduce the error of the Fixed Width Multiplier, a constant bias is added to the retained cells. However its product is still large [10]. To overcome this, CSD is used in this architecture. CSD Multiplier is used for encoding the floating point numbers in two’s complement representation [4]. The CSD Multiplier has the function to multiply successive input data values by one or more predetermined constant values, when the input data values are in binary format and finally result will make rounded to p-number of bits.

The constant value is in CSD format. Input data values are in the form $d0, d1, d2, \ldots, dM$ for each di for $i=0,1,2,\ldots,M-1$ takes one of the values 0 & +1 and where no consecutive bi are non-zero. The CSD Multiplier result was obtained addition and shift operations [8].

In normal way the multiplication operation involves two major steps, Partial product generation and Accumulation. The speed of multiplication can be improved by reducing number of partial product. Number of partial products depends on the number of non-zero digits. Number of non-zero digits is proportional to number of partial products.
The method of CSD is used for multiply the floating point. The representation of floating point contains a sign, mantissa and exponent. The floating point multiplication involves three steps:[11]

- First, Multiplication process involves in two mantissa numbers: Floating point stores in signed form but multiplier need with unsigned form. Mantissa have p-number of bits, the product will be 2p number of bits and finally result will make rounded to p-number of bits.
- Second, to compute the exponent: the exponent is represented as bias. It differs from various number (i.e) for single the bias is 127 and for double precision the bias is 1023.
- Third, To compute sign bit: by using Ex-or operation for 2 sign bits.

This method has the advantage of decreasing the number of additions/subtractions, needed, as well as handling negative multipliers. Results are obtained by expressing the multiplier in Canonic Signed Digit (CSD) form. CSD representation is useful for the design and implementation of digital filters such as the area-efficient programmable FIR digital filter architecture. It enables the reduction of the number of partial products that must be calculated fast and also low-power consumption and low area structure of a multiplier for DSP applications.

CSD representation is unique and has two main properties:
- the number of nonzero digits is minimal, and
- No two consecutive digits are both nonzero, that is, two nonzero digits are not adjacent.

The CSD representation of an integer number is a signed and unique digit representation that contains no adjacent nonzero digits. Given an n-digit binary unsigned number \(X=x_0, x_1, \ldots, x_{n-1}\) and then \((n+1)\) digit representation \(Y=y_0, \ldots, y_n\).

There are two encoding method used in binary representation of CSD. The encoding method have two variables \(y_i\) is sign bit and \(y_i\) is data bit[12].

<table>
<thead>
<tr>
<th>Encoding 1</th>
<th>Encoding 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>yi</td>
<td>yi</td>
</tr>
<tr>
<td>0</td>
<td>00</td>
</tr>
<tr>
<td>1</td>
<td>01</td>
</tr>
<tr>
<td>-1</td>
<td>11</td>
</tr>
</tbody>
</table>

To convert binary representation to CSD representation is based on:

\[2^{y_{n-1}} + 2^{y_{n-2}} + \ldots + 2^{y_1} + 2^{y_0} - 2^{y_1} - 2^0\]
Radix $2^5$ Fft Architecture Implementation In Xilinx Fpga

IX. CSD Coding:

<table>
<thead>
<tr>
<th>Xi+1</th>
<th>Xi</th>
<th>Cl-1</th>
<th>Yi</th>
<th>Ci</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Last two bits are zero</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>Last one bit is one</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>Any of the bit is one</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>Last two bits are one</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Last two bits are zero</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>Any one of the bit is zero</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>Beginning bit is one</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>Last two bits are one</td>
</tr>
</tbody>
</table>

The CSD has 2 advantages features:

- Reduces the critical path by computing it in parallel and it simplifies the algebraic expressions which minimizes the overall hardware implementations.
- Simulations are performed with the proposed circuits and it shows high efficiency in speed and area terms in comparison with other previous counterpart architecture[6].

By using CSD multiplier with common sub expression sharing technique , to reduce the area. If area is reduced delay is reduced and also power consumption is also reduced.

![Fig: Delay comparison versus Booth multiplier & CSD multiplier.](image)

![Fig: Power comparison versus Booth multiplier & CSD multiplier.](image)

X. CONCLUSION:

In this paper we have proposed 32point FFT design using Radix 2 algorithm. It was done by using XILINX synthesis tool on vertex kit. Here XILINX ISE design suite is used and also Vertex kit is named as V5XUPXLX110t hence the CCM is implemented by CSD Multiplier and CSS technique. In addition, the hardware complexity of the proposed CSD Multiplier used for the reduction of area and the power consumption by approximately 33%. The proposed architecture of expected to be incorporated in SDR.

REFERENCES:
