2_Fast Montgomery Modular Multiplication and RSA

更新时间：2023-06-05 01:21:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

2fast4you推荐度：
相关推荐

RSA paper

Fast Montgomery Modular Multiplication and RSA

Cryptographic Processor Architectures

Ciaran McIvor, Máire McLoone, John V McCanny

The Institute of Electronics, Communications and Information Technology

School of Electrical and Electronic Engineering

The Queen’s University of Belfast, Stranmillis Road, Belfast, BT9 5AH

E-mail: c.mcivor@ee.qub.ac.uk, maire.mcloone@ee.qub.ac.uk, j.mccanny@ee.qub.ac.uk

Alan Daly, William Marnane

Department of Electrical and Electronic Engineering

University College Cork

Ireland

E-mail: aland@rennes.ucc.ie, l.marnane@ucc.ie

Abstract - New, generic silicon architectures for implementing Montgomery’s multiplication algorithm are presented. These use Carry Save Adders (CSAs) to perform the large word length additions required by this algorithm when used for RSA encryption and decryption. It is shown that using a four-to-two CSA with two extra registers rather than a five-to-two CSA leads to a useful reduction in the critical path of the multiplier, albeit at the expense of a small increase in circuitry. For operand lengths of 1536-bits and greater, the percentage gain in data throughput rate outweighs the percentage increase in silicon area. Moreover, for a 2048-bit operand length, typical of what is required in many future generation applications, the gain in data throughput is 27.9% compared with a 9.9% increase in area. The practical application of this approach has been demonstrated by applying this to the design of RSA processor architectures with 512-bit and 1024-bit key sizes. The resulting Montgomery multiplier and RSA processor performance results presented are the fastest reported to date in the literature.

I. INTRODUCTION

Public-key cryptosystems are important as they provide confidentiality, authentication, data integrity and non-repudiation [1,2]. RSA [3] is the most widely used public-key cryptosystem. An RSA operation is a modular exponentiation, which requires repeated modular multiplications. For security reasons RSA operand sizes need to be 1024-bits or greater in length meaning that high data throughput rates are difficult to achieve.

The Montgomery multiplication algorithm [4,5] is the most efficient modular multiplication algorithm available. It replaces trial division by the modulus with a series of additions and divisions by a power of 2. Thus, it is well suited to hardware implementations. For this reason, this algorithm has formed the basis of many of the RSA hardware architectures reported to date [6,7,8].

To date, several techniques have been proposed in order to avoid carry propagation during the addition stages of the algorithm as this is a key factor in determining performance. One approach proposed by Elbirt and Paar [7] is to break these additions into x-bit stages, where x is an optimal bit length

chosen to take advantage of the fast carry chains available on modern FPGAs. However, an inevitable drawback of this approach is that architectures developed can be very heavily technology and implementation dependent. For example, it is unlikely that a design developed for a specific FPGA family will show the same speed advantages if migrated to a modern ASIC technology or indeed an alternative type of FPGA or PLD . An alternative approach presented by Blum and Paar [8] is based on the use of FPGA systolic array multiplier architectures with varying processing element sizes (i.e. 4-bits, 8-bits and 16-bits). However, these systems have again been tailored specifically for the Xilinx XC4000 FPGA series, limiting, we believe, the broader application of the architectural ideas presented.

In this paper we introduce new, generic Montgomery multiplier and associated RSA processor architectures, which can be rapidly reconfigured to accommodate varying operand sizes and are suitable for implementation across a wide range of technology (FPGA, PLD, ASIC) options. These architectures are therefore ideally suited to a multitude of applications, including those that require a very high level of security to those, which only need relatively little protection over a shorter time period. As will be discussed, the designs created are highly competitive and we believe produce the best performance results obtained to date.

As will be discussed, long carry propagation can be avoided by using carry save adders (CSAs) to perform the addition stages of Montgomery’s algorithm. The inherently parallel structure of these adders means that a shallow combinational logic depth is guaranteed independent of the target technology onto which the CSAs are implemented. This also makes them ideal for very long integer arithmetic, a key requirement for RSA implementation. As is discussed in Section , we propose and compare two different multiplier architectures, one using a five-to-two CSA and the other a four-to-two CSA with two additional registers. Each multiplier can perform a Montgomery multiplication in only k+1 and k+2 clock cycles respectively, where k is the operand bit length. It is shown that

RSA paper

the latter architecture outperforms the former in terms of data so that they can employ different CSA variants when throughput rate, but vice versa in terms of area, for operand implemented architecturally. The motivation is to produce sizes greater than 512-bits. However, the percentage increase in silicon circuits with a small critical path delay. the throughput rate outweighs the percentage increase in area

for very large operand lengths in excess of 1536-bits. Thus, the B. Five-to-two CSA Architecture

four-to-two CSA architecture will become increasingly The first of these variants is based on a five-to-two CSA, attractive in the future when RSA key sizes of 2048-bits and which is used to calculate (4). This type of adder comprises

three levels of carry save logic, where each level of logic greater in length are needed to protect information.

As will be discussed in Section , an RSA processor using implements (5) and (6).

SUM = X1 xor X2 xor X3 (5) the five-to-two multiplier can carry out encryption in only

CARRY = (X1 and X2) or (X1 and X3) or (X2 and X3) (6) (k+2)(ek+3) clock cycles, where k is the modulus bit length and

ek is the public exponent bit length. Here the Fermat prime, The sum of the bit vectors SUM and CARRY is equal to the F4=216+1, is used as the public exponent. RSA decryption can sum of the three input bit vectors X1, X2 and X3. An outline be performed in (k/2+2)(dk/2+3) clock cycles, where dk is the diagram of the five-to-two CSA operation is shown in Fig. 1. private exponent bit length, also using the five-to-two multiplier. The Chinese Remainder Theorem (CRT) technique [9] is used in this case.

The concepts developed have been verified and demonstrated through FPGA implementations, in this case using the Xilinx Virtex2 family of chips. The performance results of these Montgomery multiplier and RSA processor implementations are, to the authors’ knowledge, the best reported in the literature to date.

II. MONTGOMERY MULTIPLIER ARCHITECTURES

A. Montgomery Multiplication Background

Given an integer a < n, where n is the k-bit modulus, A is said to be its n-residue with respect to r if,

A = a*r (mod n) where r=2k(1)

Likewise, given an integer b < n, B is said to be its n-residue with respect to r if,

B = b*r (mod n) (2)

The Montgomery product of A and B can then be defined as,

Z = A*B*r-1 (mod n) (3)

where r-1

is the inverse of r, modulo n. The radix-2 version of Montgomery’s multiplication algorithm [10], which calculates the Montgomery product of Aand B, is summarised in the pseudo code below.

Algorithm : Montgomery Multiplication (A, B, n) S[0] = 0; for i in 0 to k-1 loop qi = (S[i]0 + Ai * B0) mod 2; S[i+1] = (S[i] + Ai * B + qi * n) div 2; end loop; return S[k]; The critical delay of Algorithm occurs during the calculation of the S values given by the three input addition, S[i+1] = (S[i] + AThe main contributing factor to this delay is the i * B + qi * n) (4) carry propagation resulting from the very large operand additions. However, this can be avoided by using CSAs to calculate (4). This observation led us to re-write Algorithm in alternative formats, which we have defined as Algorithm and Algorithm in Sections .B and .C below. They are explicitly derived

The carry save representation of the five input operands is output from the register after only one clock cycle using this

approach. Algorithm can be re-written as Algorithm , given below. Note that the input operands A and B and the output product S are now represented in carry save format as A1 and A2, B1 and B2, and S1 and S2 respectively. CSR stands for carry save representation.

SUM CARRY Fig. 1. Block Diagram of Five-to-two CSA

Algorithm : Five-to-two CSA Montgomery Multiplication (A1, A2, B1, B2, n)

S1[0] = 0; S2[0] = 0; for i in 0 to k-1 loop qi = (S1[i]0 + S2[i]0) + (Ai * (B10 + B20)) mod 2; S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + Ai * (B1 + B2) + qi * n) div 2; end loop;

return S1[k], S2[k];

The critical delay of Algorithm occurs during the calculation of the five-to-two carry save addition, S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + Ai * (B1 + B2) + qi * n) (7)

RSA paper

The determination of the correct Ai values means that a full addition of the input operands A1 and A2 needs to be carried out. This is calculated “on-the-fly” using a barrel register full

A1 A2 B1 B2 n CLKRST

adder component (BRFA) as shown in Fig. 2.

Fig. 2. Barrel Register Full Adder Component

The correct Asignificant bit of the i value is ready when needed by adding the least

A1 and A2 input operands using a single

full adder. Once the current Ai value has been used, the A1 and

A2 operands are barrel shifted one place to the right so that the next Acycles are required to complete a full addition of the i value can be determined. In this way no extra clock A1 and A2input operands. As a result, (7) can be computed in a single clock cycle. Also, an extra one clock cycle is required to reset the signals S1[0] and S2[0] to zero at the beginning of each multiplication. Thus, Algorithm can be executed in only k+1 clock cycles, where k is the bit length of the operands. Also, due to the parallel nature of the carry save adders, a high Montgomery multiplier data throughput rate can be achieved here.

In order to verify this approach, a silicon design based on this was captured generically in VHDL and implemented into the Xilinx Virtex2 series of FPGAs using varying bit lengths. Table provides performance results for these

implementations.

TABLE

PERFORMANCE RESULTS FOR FIVE-TO-TWO CSA MULTIPLIER Xilinx Area Device Length Speed Rate

(k) (MHz) (Slices) (Mb/s)

XC2V1500 512 105.57 4,962 105.36

XC2V6000 2048 55.19 20,984

55.16

It should be noted that an increase in the operand bit length by a factor of 4 from 512-bits to 2048-bits results in an increase of only 1.9 in the critical path delay through the multiplier. On the other hand, the area of the multiplier approximately doubles as the bit length doubles. Thus, the architectures are highly scalable and regular, which is a useful characteristic as it ensures that the multiplier design will be of use in the future when operand sizes will need to be greater than 2048-bits in length. It should also be noted that the area figures presented include all the I/O registers required to store the large bit length operands, as shown in Fig. 3.

S1(k) S2(k) Fig. 3. I/O Interface of Five-to-two CSA Montgomery Multiplier For a 1024-bit operand length, the inputs are registered into the microchip at a rate of 32-bits per clock cycle. Thus, this process is completed after thirty-two clock cycles. Likewise, the outputs are clocked out of the chip 32-bits per clock cycle over thirty-two cycles. This is necessary due to the limited number of I/O pins available on currently available silicon chips. The same I/O interface technique, as shown in Fig. 3, is used for the four-to-two CSA multiplier architectures and the RSA processor architectures, which are detailed in Sections .C and .B respectively.

The approach described can be further extended to improve data throughput rate at the expense of silicon area. This is discussed in the next section.

C. Four-to-two CSA Architecture This architectural approach is based on the use of a four-to-two rather than a five-to-two CSA resulting in a saving of a full level of carry save logic, as shown in Fig. 4. Again, the carry save representation of the four input

operands is output from the register after only one clock cycle.

Therefore, by using the four-to-two CSA to calculate (4), Algorithm can be reformulated as Algorithm . SUM CARRY

Fig. 4. Block Diagram of Four-to-two CSA

RSA paper

Algorithm : Four-to-two CSA Montgomery Multiplication (A1, A2, B1, B2, n) D1, D2 = CSR (B1 + B2 + n + 0) S1[0] = 0; S2[0] = 0;

for i in 0 to k-1 loop

qi = (S1[i]0 + S2[i]0) + (Ai * (B10 + B20)) mod 2; if Ai = 0 and qi = 0 then

S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + 0 + 0) div 2; elsif Ai = 1 and qi = 0 then

S1[i+1], S2[i+1] =

CSR (S1[i] + S2[i] + B1 + B2) div 2; elsif Ai = 0 and qi = 1 then S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + n + 0) div 2;

else

S1[i+1], S2[i+1] =

CSR (S1[i] + S2[i] + D1 + D2) div 2; end if; end loop;

return S1[k], S2[k];

The main computation to be performed here is the four-to-two carry save addition,

S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + y + z) (8) The value of y and z in (8) depends on the value of both the Aand q. Indeed, the determination of i the state of both these signals simultaneously, represents extra i signals from Algorithm control logic in the form of a more complex multiplexer than is required for the five-to-two CSA case. This needs to be taken into account in determining the critical delay. However, as shown below, this delay is still less than that implied by Algorithm . This decrease in the critical path delay comes at the expense of two extra registers, D1 and D2. These are used to store the carry save representation of the input operands B1, B2 and n, which are added at the beginning of each new multiplication. D1 and D2 are possible values of y and z from (8). The calculation of D1 and D2 at the beginning of each multiplication means that an extra clock cycle is required to complete a full Montgomery multiplication. Also, the Arequired in Algorithm must again be computed “on-the-fly” i values using the BRFA component shown in Fig. 2. Thus, Algorithm can be executed in only k+2 clock cycles.

The four-to-two CSA design was also captured generically in VHDL and implemented into the Xilinx Virtex2 series of FPGAs for varying bit lengths. Table gives performance results for these implementations. These results can be directly compared with those given in Table for the five-to-two CSA multiplier. Fig. 5 provides a graphical representation of the comparisons in terms of percentage increase in data throughput rate and area when using the four-to-two CSA architecture rather than the five-to-two design.

For operand lengths greater than 512-bits, the four-to-two architecture has a consistently higher throughput rate than the five-to-two architecture. However, the percentage increase in area is higher than the percentage increase in throughput for operand lengths up to approximately 1536-bits.

TABLE

PERFORMANCE RESULTS FOR FOUR-TO-TWO CSA MULTIPLIER

Xilinx Area Device Length Speed (k) (MHz) (Slices) Rate

(Mb/s) XC2V1500 512 105.56 5,689

105.15 XC2V6000 1536 70.97 17,300 70.88

For operand lengths greater than this, however, the throughput benefits begin to outweigh the negative effects of increased area. In fact, for an operand length of 2048-bits, the percentage increase in throughput is 27.9% in comparison to a relatively small increase in area of 9.9%. The overall trend points towards a constant rise in the percentage throughput rate coupled with a decline in the percentage area increase as operand lengths increase. Therefore, it seems that the four-to-two CSA multiplier is of more use for RSA systems with key lengths in excess of 1536-bits and therefore becomes an increasingly attractive option in future RSA systems. For bit lengths less than this then the five-to-two CSA multiplier is the preferable option.

3025)

20%( tn15ecr10eP50-5

Operand Bit Length (k)

Fig. 5. % Throughput/Area Increase using 4-to-2 CSA instead of 5-to-2 CSA

The Montgomery multiplier performance results reported are, to the authors’ knowledge, the fastest reported in the literature to date. Also, the concepts outlined in this section can easily be applied to other Montgomery multiplier designs, which employ different addition techniques to calculate (4). The next section will demonstrate the practicality of the multiplier architectures through an implementation of the RSA cryptosystem for 512-bit and 1024-bit key sizes. For these key sizes, the five-to-two CSA multiplier is used to calculate the successive modular multiplications.

III. RSA PROCESSOR ARCHITECTURES

A. RSA and Modular Exponentiation

The RSA encryption and decryption functions are given by C=Me(mod n) and M=Cd(mod n) respectively, where M is a plaintext message block, C is a ciphertext block, n is the k-bit modulus, and e and d are the public and private exponents respectively. The equation ed=1(mod(p-1)(q-1)) must also hold, where p and q are two large prime numbers (~ k/2-bits in length) and n=pq.

RSA paper

Thus, an RSA operation is a modular exponentiation with operands satisfying the conditions stated above. A well known and widely used modular exponentiation algorithm is the square and multiply algorithm given as Algorithm V below, which computes M=Cd(mod n). dk is the bit length of the TABLE

CLOCK CYCLES TO COMPUTE ONE RSA EXPONENTIATION

Type of Operation Corresponding Algorithm 5 Number of Clock Cycles Number of Clock Cycles for private exponent d.

Algorithm 2k V: Modular Exponentiation (C, d, n) K = 2 (mod n); (computed externally) P[0] = MontMult (K, C, n); R[0] = MontMult (K, 1, n); for i in 0 to dk loop

P[i+1] = MontMult (P[i], P[i], n); if d[i] = 1 then

R[i+1] = MontMult (R[i], P[i], n) end if; end loop;

M = MontMult (1, R[k], n); return M;

B. The RSA Cryptoprocessor Architectures

By using the five-to-two CSA multiplier architecture to calculate the MontMult (Montgomery Multiplication) stages of Algorithm V, we can re-write this algorithm as Algorithm V, given below. As a result, we have been able to develop RSA processor architectures with very impressive data throughput rates.

Algorithm V: Five-to-two Multiplier Modular Exponentiation

(C, d, n) K = 22k

(mod n); (computed externally)

P1[0], P2[0] = 5to2_MontMult (K, 0, C, 0, n); R1[0], R2[0] = 5to2_MontMult (K, 0, 1, 0, n); for i in 0 to dk loop

P1[i+1], P2[i+1] =

5to2_MontMult (P1[i], P2[i], P1[i], P2[i], n); if d[i] = 1 then

R1[i+1], R2[i+1] =

5to2_MontMult (R1[i], R2[i], P1[i], P2[i], n) end if; end loop;

M1, M2 = 5to2_MontMult (1, 0, R1[k], R2[k], n); M = M1 + M2; return M;

An analysis of the total number of clock cycles required to compute a full RSA encryption and decryption operation is given in Table . Again, k is the bit length of the modulus and ek and dk are the bit lengths of the public and private exponents respectively. An extra clock cycle is now required to complete a full modular multiplication, as shown in Table . This is to allow the results of the intermediate multiplications and squarings in the for loop in Algorithm V to be registered and re-used as inputs to the next iteration of that loop. The pre- and post-processing operations are required to convert the operands

Calculations for Encryption

Decryption

Pre-P1[0],P2[0] processing R1[0],R2[0] k+2 k/2+2 For Loop

R1[i+1],R2[i+1]

ek(k+2) dk/2(k/2+2) Post-processing M1,M2

k+2 k/2+2 Addition

k+2 k/2+2 Total Number Clock Cycles:

(k+2)(ek+3)

(k/2+2)(dk/2+3)

in Algorithm V to and from n-residue format respectively, as required by the Montgomery multiplication algorithm described in Section . The final addition operation makes use of the BRFA component, as shown in Fig. 2, to perform a full addition of the carry save result, M1 and M2, of Algorithm V. By using the CRT technique [9], two half-size decryption operations can be computed in parallel. Hence, the amount of clock cycles required to compute this operation is decreased by approximately a factor of four, as shown in Table , thus increasing the potential data throughput rate of the RSA decryption function significantly.

RSA processor architectures based on the five-to-two CSA multiplier described were captured generically in VHDL and implemented using the Xilinx Virtex2 series of FPGAs (using the highest place and route effort level) for 512-bit and 1024-bit key sizes. Tables V and V provide performance results for these implementations. A public exponent of 216+1 and a k-bit private exponent were used during testing.

TABLE V

RSA ENCRYPTION RESULTS

Xilinx Bit Clock No. Throughput One RSA Device Length Speed CLB (k) (MHz) Slices Rate Encryption (Mb/s) (ms) XC2V3000 512 96.27 11,196 4.79 0.11

The RSA processor performance results reported here are, to the authors’ knowledge, the fastest reported in the literature to date. For instance, our 1024-bit encryption and decryption times are 3.4 and 3.9 times faster respectively than 1024-bit RSA encryption and decryption times recently reported by Blum and Paar [11]. They also reported a 1024-bit radix-16 decryption time of 3.10 ms, which is 1.2 times slower than our best result. Wu, Hong and Wu [12] also recently reported a VLSI RSA implementation with a 512-bit decryption throughput rate of between 328 kb/s and 578 kb/s. This is based on a 0.6 µm CMOS technology. This result is 2.5 and 1.4 times slower than our 512-bit decryption throughput rate, in the worst and best cases respectively.

RSA paper

TABLE V

RSA DECRYPTION RESULTS

Xilinx Device XC2V3000 Length (k) 512 Speed (MHz) 105.98 CLB Slices 13,513 Throughput Rate (kb/s) 812.03 Decryption (ms) 0.63

It should also be noted that both of the aforementioned implementations also applied the CRT technique [9] to RSA decryption and that Blum and Paar [11] also used a public Montgomery’s multiplication algorithm. CSAs were used to perform the large operand additions required when used in an RSA implementation. It has been shown that by using a four-to-two CSA with two extra registers rather than a five-to-two CSA, then a useful reduction in the critical path delay of the multiplier can be achieved albeit at the expense of some additional silicon. However, for very large operand lengths of 1536-bits and greater, the percentage gain in data throughput rate outweighs the percentage increase in area. For a 2048-bit operand length, the gain in data throughput is 27.9% compared with a 9.9% increase in area. Thus, the four-to-two CSA exponent of 216+1.

C. Further Discussions

As discussed in Section , our designs are technology independent and thus can be ported to other technologies without difficulty. Therefore, even higher data rates should be achievable when the designs are migrated to modern ASIC technology. Previous experience would indicate [13] that a further doubling in performance is possible if the architectures are implemented in a 0.18µm CMOS. Also, with some further modifications, our Montgomery multiplier designs can be tailored so that they can be implemented as arithmetic co-processors on modern microprocessor based smart cards. Typical RISC processors for such applications require around 80,000 gates with associated co-processors requiring between 80,000 and 190,000 gates depending on the bit length k, as shown in Table V .

TABLE V

MONTGOMERY MULTIPLIER AREA RESULTS (NUMBER OF GATES)

Five-to-two CSA 512 82,929

The figures in Table V are based on the use of a single Montgomery multiplier. As shown, the gate counts required are easily within the realms of what can be integrated on a single chip. Also, by exploiting certain time-area trade-offs such as reducing the amount of carry save logic used to evaluate (7) and (8), then the silicon area of our architectures can be reduced significantly at the expense of some extra latency. For example, the amount of carry save logic used in the five-to-two CSA multiplier architecture can be reduced by two thirds at the expense of two extra clock cycles required to compute (7). Thus, as discussed in Section , the architectures presented are highly flexible and can be used in a multitude of different industrial applications, including as high-speed network security processors or as low silicon area arithmetic co-processors on embedded smart cards.

IV. CONCLUSION

In this paper we have presented new, generic modular multiplication architectures suitable for implementing

multiplier is of more use for RSA systems with key lengths in excess of 1536-bits, whereas the five-to-two CSA multiplier is preferable otherwise. In order to demonstrate the practicality of our multipliers, RSA processor architectures were developed for 512-bit and 1024-bit key sizes using the five-to-two CSA multiplier. The resulting FPGA-based Montgomery multiplier and RSA processor performance results presented are, we believe, the fastest reported in the literature to date with future significant performance gains achievable through ASIC implementation.

ACKNOWLEDGMENT

Amphion Semiconductor Ltd. and a Northern Ireland Department of Learning postgraduate studentship in the form of a CAST award have funded this research.

REFERENCES

[1] Stallings, W.: “Network and Internetwork Security Principles and Practice”. Prentice Hall International, 1995.

[2] Menezes, A., Oorschot, P., Vanstone, S.: “Handbook of Applied

Cryptography”. CRC Press, 1997.

[3] Rivest, R.L., Shamir, A., Adleman, L.: “A Method for Obtaining Digital

Signatures and Public-Key Cryptosystems”. Communications of the ACM, 21(2): 120-126, February 1978.

[4] Montgomery, P.L.: “Modular Multiplication without Trial Division”.

Math. Computation, Vol. 44, pp. 519-521, 1985.

[5] Koc, C.K., Acar, T., Burton S. Kaliski Jr.: “Analyzing and Comparing

Montgomery Multiplication Algorithms”. IEEE Micro, 16(3): 26–33, June 1996.

[6] Eldridge, S.E., Walter, C.D.: “Hardware Implementation of

Montgomery’s Modular Multiplication Algorithm”. IEEE Transactions on Computers, Vol. 42, pp. 693–699, July 1993.

[7] Elbirt, A.J., Paar, C.: “Towards an FPGA Architecture Optimized for

Public-Key Algorithms”. SPIE Symposium on Voice, Video and Communications, Sept 1999.

[8] Blum, T., Paar, C.: “Montgomery Modular Exponentiation on

Reconfigurable Hardware”. Proceedings 14th Symposium on Computer Arithmetic, pp. 70-77, 1999.

[9] Quisquater, J-J., Couvreur, C.: “Fast Decipherment Algorithm for RSA

Public-Key Cryptosystem”. Electronics Letters, Vol. 18, pp. 905–907, October 1982. [10] Walter, C.D.: “Montgomery Exponentiation Needs No Final

Subtractions”. Electronics Letters, 35(21): 1831-1832, October 1999.

[11] Blum, T., Paar, C.: “High-Radix Montgomery Modular Exponentiation

on Reconfigurable Hardware”. IEEE Transactions on Computers, Vol. 50, pp. 759-764, July 2001.

[12] Wu, C-H., Hong, J-H., Wu, C-W.: “VLSI Design of RSA Cryptosystem

Based on the Chinese Remainder Theorem”. Journal of Information Science and Engineering 17, 967-980, 2001. [13] Amphion Semiconductor Ltd.: “Data Security Data Sheets”,

本文来源：https://www.bwwdw.com/article/vtq1.html

相关文章：

例谈高中生物教学中如何发展学生生物学核心素养03-25

河南省公务员四级联考历年真题 - 图文12-01

社区工作者考试试题库题库（300道）03-27

2021年度幼儿园教师师德心得体会08-04

上一篇：the adventures of tom sawyer 最终版下一篇：安全员工作内容及岗位职责