Keywords 
Modified Booth algorithm, Reduced area, Throughput 
INTRODUCTION 
Booth recoding is widely used to reduce the number of partial products in multipliers . The benefit is mainly
an area reduction in multipliers with medium to large operand widths (8 or 16 bits and higher) due to the massively
smaller adder tree, while delays remain roughly in the same range.Different recodings exist resulting in different gatelevel
implementations and performance. In this work the XORbased implementation gives lowest area and delay
numbers in most technologies due to the small selector size and the wellbalanced signal paths.An implementation of a
radix4 butterfly has been developed. The number of stages has been reduced. This reduction comes from the fact that,
to achieve a throughput comparable to that of radix2. Therefore, the implementation of the radix4 butterfly is suitable
for high speed applications, since the hardware cost, the power consumption and the latency are reduced. To reduce the
number of calculation steps for the partial products, MBA algorithm has been applied mostly where Wallace tree has
taken the role of increasing the speed to add the partial products. 
II. RELATED WORK 
Recent research activities in the field of arithmetic optimization have shown that the design of arithmetic
components combining operations which share data, can lead to significant performance improvements. Based on the
observation that an addition can often be subsequent to a multiplication (e.g., in symmetric FIR filters). Most digital
signal processing methods use nonlinear functions such as discrete cosine transform (DCT) or discrete wavelet
transform(DWT). Because they are basically accomplished by repetitive application of multiplication and addition, the
speed of the multiplication and addition arithmetic’s determines the execution speed proceedings and performance of
the entire calculation. Because the multiplier requires the longest delay among the basic operational blocks in digital
system, the critical path is determined by the multiplier, in general. For highspeed multiplication, the modified radix4
Booth’s algorithm (MBA) is commonly used. A signedbit MB recoder which transforms redundant binary inputs to
their MB recoding form. A special expansion of the preprocessing step of the recoder is needed in order to handle
operands in carrysave representation . proposes a twostage recoder which converts a number in carrysave form to its
MB representation. The first stage transforms the carrysave form of the input number in to signeddigit form which is
then recoded in the second stage so that it matches the form that the MB digits request. Recently, the technique of has
been used for the design of high performance flexible coprocessor architectures targeting the computationally intensive
DSP applications. 
The conventional design of the AM operator requires that its inputs are first driven to an adder and then the
input and the sum are driven to a multiplier in order to get output . The drawback of using an adder is that it inserts a
significant delay in the critical path of the AM. As there are carry signals to be propagated inside the adder, the critical
path depends on the bitwidth of the inputs. In order to decrease this delay, a CarryLookAhead (CLA) adder can be
used which, however, increases the area occupation and power dissipation. An optimized design of the AM operator is
based on the fusion of the adder and the MB encoding unit into a single data path block (Fig. 1) by direct recoding of
the sum to its MB representation The fused AddMultiply (FAM) component contains only one adder at the end (final
adder of the parallel multiplier). As a result, significant area savings are observed and the critical path delay of the
recoding process is reduced and decoupled from the bitwidth of its inputs 
The Partial Product Generator has hardware multipliers in that the generation of partial products is a necessary
step in the process known to the art for efficient production of a final product. A way to increase the speed of hardware
multipliers is through the use of the Booth algorithm. The alternate Booth partial product generation for hardware
multipliers of the present invention is directed to a method and apparatus for eliminating the encoding of the bits of the
multiplier prior to entering the partial product generating cell of the present invention which may result in less
hardware and increased speed.In a Wallace tree multiplier the addition of bits with in one column is rearranged still
further. Each column is characterized by the inputs to that column and the outputs from that column. 
The inputs to a column are the bits of the partial product(Booth or nonBooth encoded) the carry bits from one
column to the right . the sum bits that are generated within the column..The carry in comes CSA fashion from the
previous column. The carry out goes CSA fashion to the next column.A carrylook ahead adder (CLA) is a type
of adder used in digital logic. A carrylook ahead adder improves speed by reducing the amount of time required to
determine carry bits. The carrylook ahead adder calculates one or more carry bits before the sum, which reduces the
wait time to calculate the result of the larger value bitsThe generate function, Gi, indicates whether that stage causes a
carryout signal Ci to be generated if no carryin signal exists. This occurs if both the addends contain a 1 in that bit.
The propagate function, Pi, indicates whether a carryin to the stage is passed to the carryout for the stage. This occurs
if either the addends have a 1 in that bit:Pi 
A. SMb Recoder 
A standard approach that might be taken by a novice to perform multiplication is to "shift and add", or normal
"long multiplication". That is, for each column in the multiplier, shift the multiplicand the appropriate number of
columns and multiply it by the value of the digit in that column of the multiplier, to obtain a partial product. The partial
products are then added to obtain the final result:. It is possible to reduce the number of partial products by half, by
using the technique of radix 4 Booth recoding. 
The basic idea is that, instead of shifting and adding for every column of the multiplier term and multiplying
by 1 or 0, we only take every second column, and multiply by ±1, ±2, or 0, to obtain the same results. So, to multiply
by 7, we can multiply the partial product aligned against the least significant bit by 1, and multiply the partial product
aligned with the third column by 2: The advantage of this method is the halving of the number of partial products. This
is important in circuit design as it relates to the propagation delay in the running of the circuit, and the complexity and
power consumption of its implementation. 
B. 4:2compressor 
The characteristics of the 4:2 compressor are: 
• The outputs represent the sum of the five inputs, so it is really a 5 bit adder 
• Both carries are of equal weighting (i.e. add "1" to the next column) 
• To avoid carry propagation, the value of Cout depends only on A, B, C and D. It is independent of Cin. 
• The Cout signal forms the input to the Cin of a 4:2 of the next column. 
4:2 compressors are all of their logical decompositions employ a series of XORs. In this way, the speed 
comparison can be made independent of specific technologies though actual relative delays may be slightly different
depending on specific circuit families and fabrication technologies employed. Many multiplier designs use specially
designed counters and compressors depending on their processing technology. 
V. SUMPRODUCT IMPLEMENTATION 
The direct recoding of the sum of two numbers in its MB form leads to a more efficient implementation of the
fused AddMultiply (FAM) unit compared to the existing. The efficient design of FAM operators are used to attain the
optimization of the recoding for direct shaping of the MB form of the sum of two numbers (Sum to MB). The Sum
Modified Booth algorithm is structured, simple and can be easily modified in order to be applied in signed or unsigned
numbers, which consist of odd or even number of bits. SumModified Booth algorithm using unsigned and signedbit
Full Adders and Half Adders. The recoding technique reveals efficient solutions for the FAM design to obtain the
targeted operator and timing functional for a larger range of frequencies. It obtains a significant improvement in both
area and power consumption. 
The fused AddMultiply (FAM) contains only one adder at the end . As a result, significant area savings are
observed and the critical path delay of the recoding process is reduced and decoupled from the bitwidth of its inputs.
We present a new technique for direct recoding of two numbers in the MB representation of their sum.compare to
radix2 representation radix4 is more efficient . The area saving is the most significant, since in most signal processing
applications, It can be employed to achieve high throughput . Each digit is represented by three bits named s, one and
two. The sign bit shows if the digit is negative (s=1 ) or positive( s=0). 
A. Radix4 Booth Recoding 
To Booth recode the multiplier term, we consider the bits in blocks of three, such that each block overlaps the
previous block by one bit. Grouping starts from the LSB, and the first block only uses two bits of the multiplier since
there is no previous block to overlap. 
The overlap is necessary so that we know what happened in the last block, as the MSB of the block acts like a
sign bit. Since we use the LSB of each block to know what the sign bit was in the previous block, and there are never
any negative products before the least significant block, the LSB of the first block is always assumed to be 0. In the
case where there are not enough bits to obtain a MSB of the last block, as below, we sign extend the multiplier by one
bit. SMB recoding technique is used to recode the sum of two consecutive bits of the input A(a2j,a2j+1 ) with two
consecutive bits of the input B (b2j,b2j+1 ) into one MB digit so three bits are included in forming a MB digit. The most
significant of them is negatively weighted while the two least significant of them have positive weight. in order to
transform the two aforementioned pairs of bits in MB form we need to use signedbit arithmetic. For this purpose we
develop a set of bitlevel signed Half Adders (HA) and Full Adders (FA) considering their inputs and outputs to be
Signed. in this paper use two types of signed HAs referred as HA* and HA**. 
B. SMB1 RECODING 
The recoding scheme is refered to as SMB1 and it is obtained for both odd or even number of bit width of
input numbers .The encoding of the modified booth digits yj
mb,0<j<k1 is based on both digits S2j+1 and S2j are
extracted from the j recoding cell. A conventional full adder with inputs a2j,b2j,b2j1 and produces the carry c2j+1. 
C. SMB2 RECODING 
The recoding technique SMB2 is described for even and odd bitwidth of input numbers. consider the initial
values c0,1=0 and c0,2=0. As in the SMB1 recoding scheme, we use a conventional FA to produce the carry c2j+1 and the
sum s2j .The inputs of the FA are a2j, b2j and c2j,1.The bit c2j,1 is the output carry of a conventional HA which is part of
the (j1)recoding cell and has the bits a2j1,b2j1 , as inputs. The HA* is used in order to produce the negatively signed
sum and its outputs 
D. SMB3 RECODING 
The third recoding technique is SMB3 .we use a conventional FA to produce the carry c2j+1 and the sum s2j
.The bit c2j,1 is now the output carry of a HA*which belongs to the (j1) recoding cell and has the bits a2j1,b2j1 as
inputs. The negatively signed bit s2j+1 is produced by a HA** in which we drive c2j+1 and the output sum (negatively
signed) of the HA* of the recoding cell with the bits , as inputs a2j+1,b2j+1. The carry and sum outputs of the HA** are
obtained 
E. PERFORMANCE ANALYSIS 
A theoretical analysis and comparative study in terms of area complexity and critical delay among three
recoding schemes and the three existing recoding techniques. Our analysis is based on the unit gate model. More
importantly for our quantitative comparisons and the 2input primitive gates (NAND, AND, NOR, OR) count as one
gate equivalent for both area and delay, whereas the 2input XOR, XNOR gates count as two gate equivalent 
VI. RESULTS AND DISCUSSION 
The performance of the three proposed recoding schemes in a fused sumproduct operator and implemented
them using structural Verilog HDL for both cases of even and odd bitwidth of the recoder’s input numbers. We also
synthesized all designs at lower frequencies in order to explore how they behave considering different timing is
constraints in terms of area, timing and power consumption, we simulated the designs using Modelsim for the same set
of pairs of input numbers in 2’s complement representation. The performance of the Fused sum product designs
include the proposed in recoding schemes with respect to the bit width of the input numbers. The lowest area and
power values of each clock period are marked. 
VII. CONCLUSION 
The design of sumproduct are used to implement the direct recoding of the sum of two numbers in its
Modified Booth (MB) form to synthesize each Fused sumproduct design at the highest achievable frequency. This
paper also synthesized all designs at lower frequencies in order to explore how they behave different timing
constraints in terms of area, timing and power consumption. The proposed recoding schemes (SMB1, SMB2 and SMB3)
achieve area reduction . compare to existing recoders the power has been reduced. The performance of the Fused
sumproduct designs that include the proposed recoding schemes is considered with respect to the bit width of the input
numbers. The area and power measurements of all Fused sumproduct units for different clock periods for even (i.e., 8,
12, 16, 24 and 32 bits) and odd (i.e., 7,11, 15, 23 and 31 bits) bitwidth of the recoder. 
Tables at a glance 

Table 1 


Figures at a glance 



Figure 1 
Figure 2 
Figure 3 



Figure 4 
Figure 5 
Figure 6 


References 
 A. Amaricai, M. Vladutiu, and O. Boncalo, Ã¢ÂÂDesign issues and implementations for floatingpoint divideadd fused,Ã¢ÂÂ IEEE Trans. CircuitsSyst. IIÃ¢ÂÂExp. Briefs, vol. 57, no. 4, pp. 295Ã¢ÂÂ299, Apr. 2010.
 E. E. Swartzlander and H. H. M. Saleh, Ã¢ÂÂFFT implementation with fused floatingpoint operations,Ã¢ÂÂ IEEE Trans. Comput., vol. 61, no. 2,pp.284Ã¢ÂÂ288, Feb. 2012.
 J. J. F. Cavanagh,Digital Computer Arithmetic. NewYork:McGrawHill, 1984.
 S. Nikolaidis, E. Karaolis, and E. D. KyriakisBitzaros, Ã¢ÂÂEstimation of signal transition activity in FIR filters implemented by a MACarchitecture,Ã¢ÂÂIEEE Trans. Comput.Aided Des. Integr. Circuits Syst., vol.19, no. 1, pp. 164Ã¢ÂÂ169, Jan. 2000
 O. Kwon, K. Nowka, and E. E. Swartzlander, Ã¢ÂÂA 16bit by 16bitMAC design using fast 5 : 3 compressor cells,Ã¢ÂÂ J. VLSI Signal Process.Syst.,vol. 31, no. 2, pp. 77Ã¢ÂÂ89, Jun. 2002.
 L.H. Chen, O. T.C. Chen, T.Y.Wang, and Y.C. Ma, Ã¢ÂÂA multiplicationaccumulation computation unit with optimized compressors andminimized switching activities,Ã¢ÂÂ in Proc. IEEE Int, Symp. Circuits and Syst., Kobe, Japan, 2005, vol. 6, pp. 6118Ã¢ÂÂ6121.
 Y.H. Seo and D.W. Kim, Ã¢ÂÂA new VLSI architecture of parallel multiplierÃ¢ÂÂaccumulator based on Radix2 modified Booth algorithm,Ã¢ÂÂIEEETrans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 2, pp. 201Ã¢ÂÂ208, Feb. 2010.AddisonWesley, 2010, ch. 11.
