基于FPGA的高性能处理器设计(内含外文翻译)

更新时间:2024-05-16 07:16:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

一种基于FPGA的高性能处理器设计

摘要:论文提出了一种实时FFT处理器设计方法。通过优化内存映射的算法和旋转因

子的产生,一个基-4蝶形运算能在一个时钟周期内完成。本文也介绍了一种能不用中断计算机流水线运算就能避免溢出的自适应溢出控制方法。这个设计在FPGA芯片上实现了,并且工作频率达到了127MHz,能够在10.1μs内完成复杂的1024点FFT运算。

关键词:FFT处理器,FPGA,地址生成,溢出控制

1.引言

离散傅里叶变换(DFT),有利于采样信号在时域和频域之间的高校转换,是一种被广泛使用在数字信号处理中的算法。在这些应用中,长度很大的DFT运算是很费时的,因此高性能的快速傅里叶变换(FFT)是必须的,以满足许多实时要求,例如,DVB-T和DAB.但是目前许多商业和学术的FFT处理器要用几十μs来完成一个1024点的复杂变换[1],因此高性能的FFT处理器设计成为了实时系统的关键问题。目前提高FFT处理器性能的做法:提高工作平率和增强并行功能

[2][3]

。有了高并行计算单元,数据访问的速度成为系统的瓶颈[4]。

在过去的20年里,现场可编辑门阵列(FPGA)和可编程逻辑器件(PLD)发展迅

速,在现阶段基于FPGA的数字信号处理应用在信号处理的大多数领域..相比传统的ASIC设计流程,基于FPGA的设计具有灵活性和时间客观的优点。

在本文中,我们开发了一个基于FPGA实现FFT运算的软核设计。下一节介绍使用在论文开发中的必要背景,包括FFT算法和提升方案。第三节介绍有溢出控制的定点FFT处理器的结构。通过优化内存映射算法和旋转因子产生的方法,这个FFT处理器有更好的数据吞吐量可以在一个时钟周期内计算出基-4蝶形运算。此外,我们根据蝶形运算的结果提出了一种不用中断计算单元流水线的溢出控制方法。第四部分讨论了FPGA芯片上的FFT处理器仿真结果与其他设计的比较。最后,第五部分总结了整个论文。

2.算法

A.FFT算法

N点输入序列{x(n)},x=0,1,…,N-1,的DFT定义为 N?1 X(k)=?x(n)WnkN,for k ? 0,1,?,N?1

n=0当 2?

WN?e?jN, for N ? 2n. 根据(1),直接计算N点DFT需要O(N2)复数乘法和复数加法,这是一个大规模的计算。在1965年,J.W.Cooley和J.W.Tukey提出了一种能方便地在硬件中实现具有内存寻址方法性质的快速DFT算法;因此,大多数当今的DFT处理器都是基于该算法。

有各种各样的算法来实现FFT,如基-2,基-4和基任意值。基-2算法是最简单的一种,但是它的加法和乘法计算多于基-4的。通过比基-2,基-4更有效的能够处理4n点FFT。基于任意值的优势是比基-2,基-4的运算更少,但是基任意值难以完成L型的蝶形运算单元。考虑到算法的复杂度和硬件的成本,在本文中,我们在DIF(频域抽取法)上实现基-4算法。在抽样N = 4r 点,(1)按照这种模式频域抽取:

NN/2?1N?1 X(k)?/4?1?x(n)WnKN?x(n)WnK3N/4?1N?x(n)WnKN?x(n)WnKNn?0n??N/4n??N/2n??3N/4

(2)

让我们假设k=4m,k=4m+2,k=4m+1和k=4m+3,当m=0,1,…,N/4-1,我们得到

N/4?1 X(4m)??[(x(n)?x(n?N/2))?(x(n?N/4)?x(n?3N/4))]?WnmN/4 n?0

N/4?1X(4m?2)??[(x(n)?x(n?N/2))?(x(n?N/4)?x(n?3N/4))]?W2nN?WnmN/4

n?0N/4?1 X(4m?1)??[(x(n)-x(n?N/2))-j(x(n?N/4)?x(n?3N/4))]?Wn?WnmNN/4n?0

(3c)

N/4?1X(4m?3)??[(x(n)?x(n?N/2))?j(x(n?N/4)?x(n?3N/4))]?W3n?WnmNN/4

n?0(1)

(3a)

(3b)

(3d)

上面的方程表示基于DIF操作的基-4FFT,这通常叫做蝶形运算。因此,整个算法的计算复杂程度减少到0(N log4N)。 B.提升方案

最初,提升方案提出了一个工具来构建小波和完全重构滤波器(PR)[6]。变换结果表明是简单的和可逆的,尽管提升系数是量化和电压适应性强的。因为所有出现在FFT结构中的旋转因子都是量化的复数,也就是说,每一个旋转因子能被描述成ej?,其中θ是实数。然后旋转因子W的复数形式可以表达为

W?cos??jsin??c?js

通常,一个复数乘法可以分解为四个实数乘法。在这,让另一个复数X=x+jy,然后,这两个复数相乘的结果为

P?W?X?(c?js)(x?jy)

方便地表示成向量矩阵形式

P??1j???c?s??x??sc????y?? 假设c2+ s2=1,其中s?0(假如s=0,则W=1或-1,这不需要被量化),通过分解为三个提升步骤我们可以得到

??c?s??10??1?sc??c0??sc?????s/c1????01????01/c?? 我们可以用三个提升步骤假设它没有扩展(这里也假设s?0)

??c?s??1(c?1)/s??10??1(c?1)/s??sc?????01????s1????01?? 这符合几何里众所周知的事实,一个旋转总能分解为三个剪切。这个分解方

案有两个优点[7]。首先实部乘法的数量由4个减少到了3个。第二,它允许提升系数的量化,并且不用破坏PR性能就能计算每个乘法的结果。

3.FFT处理器设计

A.结构

本文的研究目标是通过优化结构实现和简化硬件设计来提高FFT处理器的

(4)

(5)

(6)

(7)

(8)

性能。在这里,我们通过优化设计的内存映射和旋转因子的产生来提高数据的吞吐量。除了这个处理器中实现的提升方案,我们提出一种有更少硬件成本和更高计算精度的自适应溢出控制方法。

图1 FFT处理器分层框图

图1显示了这个FFT处理器的层次方框图,包括了几个部分:四个2端口RAM块;四个周期读写地址I/O端口输出地址总线;包括提升功能单元和旋转因子产生的蝶形运算单元;每周期可以计算一个基-4蝶形;能够重新排列正确序列中数据的内部数据交换单元;能够根据蝶形运算控制定点操作数动态转移的溢出控制器。具体功能单元的实现将在接下来的部分讨论。 B.地址映射方法

每一个基-4FFT蝶形单元每周期需要4个操作数,然后产生4个结果,证明了并行数据访问是系统效率的关键问题。D.Cohen[8] 和Y. T. Ma [9]等等,提出了几种地址映射的方法,这并不适合单一蝶形单元结构。在本文中,我们说明了一种地址映射的方法和在DIF上基于内存寻址策略产生的方法实现基-4FFT。现在让我们讨论这个方法的地址映射。

每个操作数的地址可以表示为 r-1

A??4i?A(i),A(i)?0,1,2,3;r?logN4 i?0另一种形式

(9)

r?1 A?4??4i?1?A(i)?A(0)

i?1假设 r?1

B?4??4i?1(i)??r?1?A?i?1??A(i)??mod4

i?0?显然,在区间[0,N)内B是A的一一映射。 r?1让m??4i?1(i),b??r?1?A?i???A(i)??mod4.

1i?0?从而B可以用m和b代替为

B=4m+b

可以发现地址A映射到4个内存库,其中b是库的数值,m是库的地址;也就是说,基-4运算的四个操作数映射到不同的内存单元以避免冲突。

图2 地址产生器的实现

C.地址产生

考虑到地址映射方法和基-4DIF算法的特征,我们提出一种很容易在硬件上实现和扩展的地址产生方法。我们简要描述实现的细节。对应于p (p=0,1,,r-1)级蝶形运算,这只有四个操作数的r-p-1字节是不同的:0,1,2和3.这些操作数的地址库被描述为:

?1r?p?2

m?i?1?A(i)?4r?p?2?A(r?p?1)?i??r4r?p?4i?1?A(i) i?1分别让A(R-P-1)=0,1,2,3,然后

(10)

(11)

(12)

(13)

m0?i?r?p?4r?1i?1r?p?2?A(i)?i?14??A(i) i?1(14a) (14b) (14c) (14d)

m1?m0?1?4r?p?2 m2?m0?2?4r?p?2

m3?m0?3?4r?p?2

为了很快地计算以上方程的地址m,这里在硬件实现中用一个16位宽的计数器来表示

i?r?p?1?4r?2i?1r?p?2?A(i?1)??4i?1i?1?A(i),然后我们在(r-p-1)位的计数器中插

入A(R-P-1)来得到mj,这需要三个16位寄存器:计数器C,移位寄存器M,和结果寄存器R.如图2所示,C初始化置零,每次蝶形计算加一;M也初始化为0并且每级操作右移2位,并且高2位为11;R有初始值。依照(14),我们给(r-p-1)位相同的值得到另外三个地址。

这里库值b的模操作可以用半加器来实现;因为A(i)是基-4FFT的2位宽,我们可以用两个半加器来计算b。 D.提升结构的硬件实现

如上一节提到,一个复数乘法可以分解为4个实数乘法。一般来说,这需要更多的硬件资源,例如乘法器和寄存器来并行执行复杂的乘法;另一方面,串行执行模式消耗更多周期和节省一些硬件成本。

图3 提升方案流水线结构

考虑到旋转因子量化的事实,我们根据提升方案代替常规方法来实现复数乘法[7].注意三个乘法必须如提升方案中(8)描述的依次执行。

在这里,我们介绍一个流水线结构:它遵循(8)在序列中实现一个提升方案结构,这是在流水线模式中设计以避免超出时钟频率。图3显示了这个提升方案的流水线结构(此处,u=(c-1)/s)。

上面基于提升方案的设计需要三个乘法器和三个加法器来代替直接计算方法中的四个乘法器和两个加法器。还应该注意到这里需要在ROM中提前储存(c-1)代替c,这里没有额外的时间消费,因为(c-1)/s和c都能提前计算出储存在ROM中。 E.适应性溢出控制

一般来说。FFT处理器通过限制数据通路的位宽优点处理溢出问题,这是直接缩短溢出部分的数据;因此由于糟糕的精度这种方法不适合大多数应用程序。一种动态溢出控制可以做到通过根据i级计算结果选择i+1级输入模式。这种方法可以保证整个计算周期中没有溢出发生,然而,它需要中断计算的流水频率低。在本文中,我们提出一种不用中断流水线的新颖方法:根据i级回写结果选择i+1级回写模式。因此,这种溢出控制方法使流水线效率接近100%且有更少的右移操作。

图4 FFT处理器流水线图

从图4中明显看出,整个流水线包括四个阶段:读出数据,蝶形运算,溢出

处理和回写数据。注意每一级蝶形计算和下一级重叠阶段的回写阶段,如图4阴影部分所示。

不同于传统效率较低的溢出调试方法,本文中提到的溢出控制方法不需要中断流水线计算就能实现。检查回写操作数的高2位:如果下一级结果可能发生溢流退出,然后在下一级回写周期右移操作数1~3位。如果每一个操作数的高2位的最大值等于二进制代码的00,它需要右移1位操作来防止提升方案的溢出;如果最大值是01,它需要右移2位(一位是4个操作数的加法溢出,另一位是提升格式);如果最大值是二进制代码中的10或11,它需要右移3位(2位是加法溢出)。

右移的数量加到寄存器来决定最后输出的最终FFT的指数值。我们观察到这种方法很容易有更少的右移操作就在硬件中实现。

4.仿真和比较

表1 与其他定点FFT处理器的性能比较

通过优化这个FFT处理器的映射结构,我们简化了硬件复杂度并且提高性能。第三部分描述的结构用Verilog代码实现。这个设计在Virtex-II Pro30 FPGA上用Xilinx ISE 5.2i综合。我们得到一个时钟频率为127MHz的FFT处理器软核,可以在10.1 μs内完成1024点FFT的16位复数计算。表一显示了一些这个FFT处理器的仿真结果和当今其他定点FFT处理器的比较。

从表1中,我们观察到本文设计的FFT处理器比大多数可用的FFT处理器有更好的性能。

5.总结

为了在FPGA上开发一个高校的FFT处理器设计,我们优化了结构的存储系统,简化了硬件实现且提高了数据通路的吞吐量。同时,我们提出了一种不用中断流水线运算的自适应溢出控制方法,提高了较高计算精度处理器的效率。

这个设计是在ISE5.2i的Xilinx XCV2P30 FPGA 芯片上综合实现的。时钟周期达到127MHz, 因此这个FFT处理器可以在10.1 μs内实现16位宽的1024点FFT运算,这比大多数可用的FFT处理器都更好。

附录 外文翻译-原文部分

Design of a High Performance FFT Processor Based on FPGA

Abstract—The design method of a real-time FFT processor is presented. By optimizing algorithm of memory mapping and generation of twiddle factors, a radix- 4 butterfly can be calculated in one clock cycle. An approach to adaptive overflow control is also introduced to avoid overflow without interrupting he computing pipeline. The design is implemented on a FPGA chip and achieves the operating frequency at 127 MHz. It can complete a complex 1024-point FFT within 10.1 μs.

Index Terms —FFT processor, FPGA, address generation, overflow control

I. INTRODUCTION

The Discrete Fourier Transform (DFT), which facilitates the efficient transformation between the time domain and the frequency domain for a sampled signal, is one of the most used algorithms in digital signal processing. Among these applications the long-length DFT computation is rather time-consuming, hence high performance Fast Fourier Transform (FFT) processor is necessary to meet many real-time requirements, e.g., DVB-T and DAB. But at present most commercial and academic FFT processors take dozens of μs to compute a 1024-point complex transform [1], hence and therefore design of a high performance FFT processor become a key issue of real-time system. At present the approaches to improve the performance of FFT processor: to raise work frequency and to enhance functional parallel [2][3]. With highly parallel computing units, the speed of data access becomes the bottleneck of system [4].

In the past twenty years, Field Programmable Gate Array (FPGA) and Programmable Logic Device (PLD) have developed rapidly and at current stage digital signal processors based on FPGA are applied in most areas of signal processing.. Compared with traditional ASIC design flow, design based on FPGA has the advantages of flexibility and time to market objective.

In this paper, we develop a soft-core design to implement a FFT processor based on FPGA. The next section reviews necessary backgrounds used in the development of the paper including FFT algorithm and lifting scheme. Section III presents the architecture of the fix-point FFT processor with overflow control. By the methods of optimizing algorithm of memory mapping and generation of twiddle factors, this FFT processor with a better data throughput can calculate a radix-4 butterfly in one clock cycle. Furthermore, we propose an overflow control approach according to the result of butterfly calculation without interrupting the pipeline of calculation unit. Section IV discusses simulation result of FFT processor on a FPGA chip and comparison with other designs. Finally, Section V concludes the whole paper.

II. ALGORITHM

A. FFT Algorithm

The DFT for a N-point data sequence {x (n)}, x = 0, 1, ... , N ?1 , is defined as

N?1 X(k)=?x(n)WnkN,for k ? 0,1,?,N?1

(2)

n=0Where

2?WnN?e?jN, for N ? 2.

Pursuant to (1), direct computation of a N-point DFT requires O ( N 2 ) complex number multiplications and complex number additions, which are a large-scale computation. In 1965, J. W. Cooley and J. W. Tukey proposed a fast algorithm of DFT with the property of in-place memory addressing strategy, which is convenient to be implemented in hardware; therefore, most of current DFT processors are based on this algorithm [5].

There are various algorithms to implement FFT, such as radix-2, radix-4 and split-radix with arbitrary sizes. Radix-2 algorithm is the simplest one, but its calculation of addition and multiplication is more than radix-4’s. Through being more efficient than radix-2, radix-4 only can process 4n- point FFT. Split-radix has the advantage of less calculation than radix-2 and radix-4, but it is difficult to implement L-shaped butterfly function unit in split-radix. Taking account of algorithm complexity and hardware cost, in this paper, we implement the radix-4 algorithm in DIF (Decimation in Frequency). Sampled at N = 4r points, (1) is decimated in frequency accordance with this mode: N ?/4?13N/4?1NX(k)?x(n)WnKN/2?1N?N?n?0n??x(n)WnKN/4n??x(n)WnKN/2??1N?x(n)WnkNn?3N/4

(2)

Let us assume that k = 4m , k = 4m+2 , k = 4m+1 and k = 4m + 3 , where m = 0,1,..., N / 4 ?1 , we get

N/4?1 X(4m)??[(x(n)?x(n?N/2))?(x(n?N/4)?x(n?3N/4))]?WnmN/4 n?0N/4?1 X(4m?2)??[(x(n)?x(n?N/2))?(x(n?N/4)?x(n?3N/4))]?W2nN?WnmN/4n?0(3b)

N/4?1 X(4m?1)??[(x(n)-x(n?N/2))-j(x(n?N/4)?x(n?3N/4))]?WnN?WnmN/4n?0(3a)

(3c)

N/4?1 X(4m?3)??[(x(n)?x(n?N/2))?j(x(n?N/4)?x(n?3N/4))]?W3nN?WnmN/4n?0

(3d)

The above equation denotes the basic DIF operation of radix-4 FFT, which is

generally called butterfly calculation. Hence, the whole calculation complexity of this algorithm is reduced to 0(N log4N). B. Lifting Scheme

Initially, the lifting scheme has been proposed to be a tool in constructing wavelets and Perfect Reconstruction (PR) filter banks [6]. The resulting transforms are shown to be simple and invertible, even though the lifting coefficients are quantized and power-adaptable. Since all the twiddle factors appearing in the FFT structure are complex numbers with magnitude one, that is, every factors can be represented asej?, where θ is some real number. Then, a twiddle factor W in complex form can be expressed as

W?cos??jsin??c?js In general, a complex multiplication is decomposed into four real multiplications. Here, let another complex number be X =x+jy, and then the product of these two complex numbers is

P?W?X?(c?js)(x?jy)

In the vector-matrix form for convenience

P??1j???c?s??x??sc????y?? Assuming that c2+ s2=1 with s?0 (if s=0 , then W =1 or -1, which needs not to

be quantized), by decomposing into three lifting steps we can get

??c?s??10??1?sc??c0??sc?????s/c1????01????01/c?? We can also do it without scaling with three lifting steps as (here assuming s?0 too)

??c?s??1(c?1)/s??10??1(c?1)/s??sc?????01????s1????01?? (4)

(5)

(6)

(7)(8)

This corresponds to the well-known fact in geometry that a rotation can always be written as three shears. Here are two advantages of this factorization scheme [7]. Firstly the number of real multiplications is reduced from four to three, although, the number of real additions is increased from two to three. Secondly, it allows for quantization of lifting coefficients and result of each multiplication without destroying the PR property.

III. DESIGN OF FFT PROCESSOR

A. Architecture

The research objective of this paper is to enhance the performance of FFT processor by optimizing structure implementation and simplifying hardware design. Here, we improve the throughput of data path by optimizing the design of memory mapping and the generation of twiddle factor. In addition to lifting scheme implemented in this processor, we propose an adaptive overflow control approach with less hardware cost and higher computing precision.

Fig. 1 shows the hierarchical block diagram of this FFT processor, which consists of several parts: four 2-port BlockRAM to store data; address generator which outputs four read addresses and write addresses for I/O ports per cycle; butterfly calculation unit which comprises lifting function unit and generator of twiddle factors, can calculate a radix-4 butterfly per cycle; units of data interchange which rearrange data in right sequence; overflow controller which controls dynamically shift of fix-point operand according to result of butterfly calculation. The detailed implementations of function units will be discussed in the next subsections. B. Method of Address Mapping

Each butterfly unit of radix-4 FFT needs four operands per cycle, and then produces four results, which proves that parallel access to data is crucial issue for system efficiency. D.Cohen[8] and Y. T. Ma [9], etc., proposed several approaches of

address mapping, which are not appropriate to the structure of single butterfly unit. In this paper, we illustrate a method of address mapping and generation with in-place memory addressing strategy on the basis of [2] and implement this method in DIF of radix-4 FFT. Now let us discuss the method of address mapping.

The address of every operand can be expressed as r-1

A??4i?A(i),A(i)?0,1,2,3;r?logN4 i?0In another form r?1

A?4??4i?1?A(i)?A(0)

i?1Assuming that r?1

B?4??4i?1A(i)??r?1???i?1??A(i)??mod4

i?0It’s obvious that address B is corresponding to address A one by one in this area: ?1[0,N).m??r4i?1i),b??r?1?A(??i?1??A(i)?mod4.

i?0?Thus B can be replaced by m and b as

B=4m+b

It can be found that address A is mapping into 4 memory banks, where b is the

number of bank and m is the address in bank; that is to say, the four operands of radix-4 calculation are mapped into different memory banks to avoid conflict.

C. Address Generator

(9)

(10)

(11)

(12)

Considering the features of address mapping method and radix-4 DIF algorithm, we propose an approach of address generation, which is easy to be implemented and extended in hardware. We briefly describe the detail of implementation now. Corresponding to the butterfly calculation of the p (p=0,1,,r-1) level, here only the r-p-1 bits of four operands are different: 0, 1, 2 and 3. These addresses in banks of these operands are described as: 1r?p?2

m?i?1?A(i)?4r?p?2?A(r?p?1)?i??r?4r?p?4i?1?A(i) i?1Let A(R-P-1)=0,1,2,3 respectively, then ?1r?p?2 mi?1i?10??A(i)??A(i) i??r4r?p?4i?1 m?21?m0?1?4r?p m2?m0?2?4r?p?2

m3?m20?3?4r?p?

In order to compute address m of the above equations quickly, here a counter of 16-bit

width

is

implemented

in hardware to denote

?2r?p?2i?1?A(i?1)??4i?1?A(i), then we insert A(R-P-1) into the (r-p-1) bit of

i?r?r4?p?1i?1counter to get mj, which needs three 16-bit width registers: counter C, shift control register M and result register R. As shown in Fig. 2, C is set to zero initially and

added by one after every butterfly calculation; M is set to zero initially too and right-shifted two bits after every level operation, also the high two bits are filled with 11; R has initial value m0. Pursuant to (14), we can get another three addresses by setting some value to the (r-p-1) bit.

Here, modulus operation of bank number b can be implemented by using half-adder; since A(i) is 2-bit width for radix-4 FFT, we can compute b with two half-adders.

D. Hardware Implementation of Lifting Structure

As mentioned in the previous subsection, a complex multiplication can be decomposed into 4 real multiplications. Generally speaking, it needs more hardware resource, such as multipliers and registers to execute complex multiplications in

(13)

(14a) (14b) (14c) (14d)

parallel mode; On the other hand, execution in sequence mode wastes more cycles and save some hardware costs.

Considering the fact of twiddle factors with magnitude one in FFT, we implement complex multiplication in terms of lifting scheme instead of conventional method [7]. Notice that the three multiplications must be executed in sequence via the lifting scheme as described in (8).

Here, we introduce a pipeline structure: it follows from (8) that implements a lifting scheme structure in sequence, which is designed in pipeline mode to avoid the overhead of clock frequency. Fig. 3 shows the pipeline structure of this lifting scheme (here, u=(c-1)/s ).

The above design based on lifting scheme needs three multipliers and three adders instead of four multipliers and two adders in direct calculation method. It should also be noted that here it needs to store (c-1) instead of c in ROM in advance, which has no additional timing costs since both (c-1)/s and c can be pre-computed to store in ROM. E. Adaptive Overflow Control

In general, FFT processors deal with overflow problem by virtue of limiting datapath width, which directly truncates overflowing part of data; hence this method is not appropriate to most of applications because of bad precision. A dynamic approach of overflow control can be reached by selecting the input mode of the i +1 level according to calculation result of the i level. This approach ensures that no overflow takes place in the whole calculation period, however, it needs to interrupt the pipeline of computing with a low efficiency. In this paper, we propose a novel approach without interrupting pipeline: to select the writeback mode of the i + 1 level according to the writeback result of the i level. Therefore, this overflow control method keeps pipeline efficiency close to 100% in the whole calculation period and has less right-shift operations.

It’s clear from Fig. 4 that the whole pipeline consists of four stages: readout of data, butterfly calculation, overflow processing and writeback of data. Notice that the writeback stage of every level overlaps with butterfly calculation stage of the next level, as presented in the hatched area of Fig. 4.

Different from the traditional method of overflow tryout with a lower efficiency, the method of overflow control implemented in this paper needs not interrupt the pipeline of calculation. Checking out the top 2 bits of writeback operands: if there exits possibility to take place overflow of result in the next level, then shifter right-shift operands 1~3 bits in the next level writeback period. If the maximum of the top 2 bits of every operands is equal to 00 in binary code, it needs 1 bit right-shift operation to prevent from overflow of lifting scheme; if the maximum is 01, it needs 2 bits right-shift(one bit is for addition overflow of 4 operands, another is for lifting scheme); if the maximum is 10 or 11 in binary code, it needs 3 bits right-shift( two bits are for the addition overflow).

The number of right-shift is added up to a register to decide the exponent value of final FFT output in the end. We observe that this approach is easy to be implemented in hardware with less right-shift operations.

IV. SIMULATION AND COMPARSION

By optimizing the mapping structure of this FFT processor, we simplify the hardware complexity and improve the performance. The architecture as described in Section III is implemented after coding in Verilog. These designs are synthesized using Xilinx ISE 5.2i on Virtex-II Pro30 FPGA. We obtain a soft-core FFT processor with the clock frequency of about 127 MHz, which can complete a 16-bit complex number of 1024-point FFT computation in 10.1 μs. Some simulation results of the FFT processor and comparison with other current fix-point FFT processors are shown in TABLE I.

From TABLE I, we observe that the FFT processor designed in this paper is better in performance than most of available FFT processors.

V. CONCLUSION

To develop an efficient design for FFT processor on FPGAs, we optimize the structure of memory system, simplify the hardware implementation and improve the throughput of datapath. At the same time, we propose an adaptive method of overflow control without interrupting the pipeline of calculation, which improves the efficiency of the processor with a higher calculation precision.

These designs are synthesized and implemented on Xilinx XCV2P30 FPGA chip with ISE5.2i. The clock frequency arrives 127MHz, thus this FFT processor can finish FFT calculation of 1024-point complex of 16-bit width in 10.1 μs, which is better than most of available FFT processors.

ACKNOWLEDGEMENTS

This work was supported by the National High Technology Research & Development Program of China (Grant No: 2001AA111090) and National Natural Foundation of China (Grant No: 60303017).

REFERENCES

[1]B. M. Baas, “An approach to low power, high performance, fast Fourier transform processor design,” [PhD Thesis], Stanford University, 1999.

[2]C. H. Sung, K. B. Lee, and C. W. Jen, “Design and implementation of a scalable fast Fourier transform core,” ASIA-Pacific Conference on ASICs, pp. 295-298, 2002. [3]L. G. Johnson, “Conflict free memory addressing for dedicated FFT hardware,” IEEE Trans. Circuits and Sys. II, Vol. 39, pp. 312-316, May 1992.

[4]Y. K. Xie and B. Fu, “Design and implementation of high throughput FFT processor,” Journal of Computer Research and Development, Vol. 41, No. 6, pp. 1022-1029, 2004.

[5]J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Math. of Comp., Vol. 19, No. 90, pp. 297-301, April 1965. [6]I. Daubechies and W. Sweldens, “Factoring wavelet transforms into lifting steps,” Bell Labs, Lucent Technol., Red Bank, NJ, 1996.

[7]S. Oraintara, Y. J. Chen, and T. Q. Nguyen, “Integer fast Fourier transform,” IEEE Trans. Acoustics, Speech, Signal Processing, Vol. 50, No. 3, pp. 604-618, March 2002.

[8]D. Cohen, “Simplified control of FFT hardware,” IEEE Trans. Acoustics, Speech, Signal Processing, Vol. ASSP-24, pp. 577-579, 1976.

[9]Y. T. Ma, “An effective memory addressing scheme for FFT processors,” IEEE Trans. Signal Processing, Vol. 47, No. 3, pp. 907-911, 1999.

[10]“FFT MegaCore function user guide,” Version 1.02: Altera Inc., March 2001. [11]“High-performance 1024-point complex FFT/IFFT,” Version 2.0, Xilinx Inc., July 2000.

[12]B. M. Baas, “FFT Processor Chip Info http://www-star.stanford.edu/ ~bbaas/fftinfo.html, 2004.

Page,”

available

at

本文来源:https://www.bwwdw.com/article/erk7.html

Top