Area efficient cryptographic ciphers for resource constrained devices

T. Blesslin Sheeba\textsuperscript{1}, Dr. P. Rangarajan\textsuperscript{2}

\textsuperscript{1} Department of ECE, Sathyabama University, Chennai-600087, India
\textsuperscript{2} Department of EEE, RMD Engineering College, Chennai-600087, India
blesslinsheebarmk@gmail.com

Abstract: The upcoming area of pervasive computing will be characterized by many smart devices that have very limited resources in terms of memory, computing power and battery supply. In information technology, Ubiquitous which is widely believed to be the next paradigm. The mass deployment of pervasive devices promises on the one hand many benefits, but on the other hand, many foreseen applications are security sensitive. In order to provide security on resource constrained devices lightweight cryptographic algorithms have been developed. In this paper we propose lightweight cryptography for FPGAs by introducing block cipher independent optimization techniques for Altera Cyclone III FPGAs and applying them to the lightweight cryptographic algorithms HIGHT and Present. Both are less than half the size of the AES implementation without using block RAMs.

1. Introduction
Ubiquitous computing represents the third area of computing devices after mainframes and personal computer for first and second eras. Radio frequency identification (RFID) tags and wireless sensor network (WSN) nodes are a few examples which are being used for automated electronic toll systems, identification tags for food products, pets, clothing and so on. This brings us close to the threshold of pervasive computing. The mass deployment of this device brings serious concerns for security and privacy[18]&[19]. The traditional cryptographic algorithms may not be suitable for these devices as they have limited memory and computational power along with serious power constraints. This led to development of new branch of cryptography called lightweight cryptography. HIGHT and Present were developed specifically for lightweight cryptography AES and Camellia, though not considered lightweight, and are also being used on these devices. Until now, lightweight cryptography is targeted towards application specific integrated circuits (ASICs). ASICs involve high non-recurring engineering cost and long time to market where as Field Programmable Gate Arrays (FPGAs) involve low non-recurring engineering cost and less time to market.

The dominant factor favorable to ASICs is their lower power consumption, which is of primary concern for lightweight cryptographic devices and their lower cost in large volumes. With the advent of low-cost and low-power FPGAs, we expect them to become popular for battery powered applications such as WSN nodes. Hence, they are a targeted for lightweight cryptographic applications. Reconfigurability of FPGAs allows the system to be upgraded if ever the need arises which is not possible with ASICs. Furthermore, lightweight crypto implementations lead to area saving over traditional implementations. This enables a designer to add crypto to an existing design at a minimal cost or to reduce the overall area consumption which might lead to cost saving as the design might now fit into a smaller, cheaper FPGAs. The ciphers considered are of full strength security i.e. 128-bit key length, even though traditional lightweight cryptography considers 80-bit key length to be sufficiently secure.

2. Materials and Methods

Advanced Encryption Standard
The Advanced Encryption Standard (AES) specifies FIPS-approved cryptographic algorithm that can be used to protect electronic data. The algorithm AES is a symmetric block cipher that can encrypt (encipher) and decrypt (decipher) information. In cipher text encryption converts data to an unintelligible form, decrypting the cipher text converts original form of data called plaintext. A cryptographic key of 128, 192, and 256 bits to encrypt and decrypt data in blocks of 128 bits is possible in AES algorithm. The specified standard algorithm may be implemented in software, firmware or hardware. The specific implementation may depend on several factors such as the application, the environment and technology.
The algorithm shall be used in conjunction with a FIPS approved or NIST recommended mode of operation. Object Identifiers (OIDs) and any associated parameters for AES used in these modes are available at the Computer Security Objects Register (CSOR).

HIGHT

HIGHT (HIGH security and light weight) is a 64-bit block cipher with 128-bit key length. It uses generalized Feistel structure with 32-rounds containing simple operations such as XOR, modular addition in the group of 28 elements, and bitwise rotation. The absence of traditional substitution layer, its Feistel structure and byte oriented operations make it suitable for low-cost, low-power and lightweight implementations. The HIGHT algorithm was modified to reduce the critical path in the key schedules which also reduces the area to 2608 gates. The initial security analysis presented in [2] & [4] showed that HIGHT only 19 rounds is secure. It provides low-resource hard-ware implementation, which is proper to ubiquitous computing device such as a sensor in USW or a RFID. HIGHT does not only consist of simple operations to be ultra-light but also has enough security as a good encryption algorithm.

Table 1. Comparision of AES and HIGHT

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Technology (μm)</th>
<th>Area (GEs)</th>
<th>Throughput (Mbps)</th>
<th>Max frequency (MHz)</th>
</tr>
</thead>
<tbody>
<tr>
<td>AES</td>
<td>0.35</td>
<td>3400</td>
<td>9.3</td>
<td>80</td>
</tr>
<tr>
<td>HIGHT</td>
<td>0.25</td>
<td>3048</td>
<td>150.6</td>
<td>80</td>
</tr>
</tbody>
</table>

Present

Present is an ultra-lightweight block cipher an it is a 31-round Substitution-Permutation (SP) network with a block size of 64-bit and 80-bit or 128-bit key lengths. In this paper a 128-bit key length is considered. Present was designed by incorporating some features of Serpent and Data Encryption Standard (DES). The nonlinear substitution layer, i.e. S-box in Present is similar to that of Serpent and the linear permutation layer to that of DES. The original Present proposal provides a basic security analysis. With the establishment of the AES the need for new block ciphers has been greatly diminished; for almost all block cipher applications the AES is an excellent and preferred choice. However, despite recent implementation advances, the AES is not suitable for extremely constrained environments such as RFID tags and sensor networks [3]. In this paper ultra-lightweight block cipher is described. During design of cipher, both security and hardware efficiency have been equally important [5].

Block ciphers

The entire block of plaintext can be encrypted with the same key [14]. This means that the encryption of any plaintext bit in a given block depends on every other plaintext bit in the same block[15]. In practice, the vast majority of block ciphers either have a block length of 128 bits (16 bytes) such as the advanced encryption standard (AES), or a block length of 64 bits (8 bytes) such as the Data Encrypted Standard (DES) or triple DES (3DES) algorithm.
2.1. Optimization techniques

Designing compact architectures in FPGAs depends on effective use of architectural features provided in the targeted FPGAs. FPGAs have features such as Look Up Table (LUT) based 16-bit shift register (SRL16) and Distributed Random Access Memory (DRAM) which can be employed to improve the performance and decrease the area of a given design by an order of magnitude.

2.2. FPGA Architecture

FPGA Structure

The fundamental logic unit in FPGAs is a slice, which contains mainly two four input LUTs and two flipflops. Half the slices of a chip are called SLICEMs. Their LUTs can be configured as 16 bit shift registers (SRL16) or as 16-bit distributed RAMs Called DRAMs.

Shift Register

The number of slices required for implementing a shift register depends on the number of bits to be stored and the number of taps. Taps are positions of a shift register where data can be written to or read from. Each tap is configured as a flipflop.

Distributed RAM (DRAM)

DRAMs offer fast and localized memory. They can be cascaded for realizing deeper memories with minimal penalty on timing. Distributed RAM supports two types of memories: single-port RAM and dual-port RAM. Both have synchronous write and either synchronous or asynchronous read.

2.3. Plain text and key storage

The most area consuming components of cryptographic algorithms are data and key storage. DRAM and shift register are two options for efficient memory implementation. In order to develop lightweight architectures, the algorithm implementations are scaled down to use either an 8-bit or a 16-bit datapath. Key and data are loaded either 8-bits or 16-bits depending on the implementation. Loading into shift register is simpler as the number of control bits needed is less as compared to DRAM which needs addressing. The size of the address increases with the number of
words to be stored in DRAM. On the other hand, the control bits of shift register are independent of size of the data it stores. However, some ciphers require intermediate data values. In case of HIGHT, plaintext is stored in DRAM due to its generalized Feistel structure. The key scheduling in ciphers Camellia, HIGHT and Present involve shifting of key which make use of shift register more apt as in[4]&[5]. However, the key in HIGHT is need in a different order during initial and final transformations compared to round operations which are difficult to accomplish with a shift register. In this case, a DRAM is used to store the key.

2.4. Control logic

Finite State Machines (FSM) are used for realizing the control logic of complex systems. Traditionally, FSMs are implemented using flipflops and combinational logic. However, this type of FSM implementation is complex and not efficient. The use of RAM blocks for sequential logic led to ROM-based FSM implementations. The control signals for each operation are combined to one control word. These control words are stored in a memory location which can be accessed by an address. DROMs are inferred by holding the write signal low for DRAMs. ROM-based FSMs have additional advantages. The maximum frequency at which a ROM-based FSM operate is independent of the complexity of the circuit. This method is also proved to save power. For our HIGHT and Present architectures, the control signals are generated by a counter and some additional logic. The size of the control word is reduced by removing any control signals which repeat a short sequence of values many times, such as control signals for round operations, from the main controller and assigning them to a sub-controller. We use this technique for camellia where the control signals for round function f are generated by a sub-controller called F-controller.

3. Discussion

The HIGHT algorithm consists of 32-rounds with initial and final transformations before the first and after the last rounds respectively. The plaintext P and cipher text C are split into eight 8-bit blocks P7…P0 and C7…C0 and the original key K into sixteen 8-bit blocks K15….K0. The initial transformation uses the four whitening key bytes W K0, W K1 , W K2 , and W K3 to transform a plain text P into the input of first round function, X0 = X0,7 X0,0. In the final transformation, the data is shifted towards right and transforms X32 = X32,7 X32,0 into cipher text C by with the four whitening keys W K4 , W K5 , W K6 and W K7. Both transformations perform a XOR or modular addition as shown in Figure 1.

3.1 Lightweight architecture of HIGHT

The reduction of area for HIGHT is achieved by scaling the 64-bit algorithm to 8-bit. This reduction does not come at the cost of temporary storage or multiplexers.

Data Storage

The round function involves shifting which suggest that a shift register is the most efficient solution for data storage. However, the generalized Feistel structure of HIGHT leads to a misalignment of data when a shift register is used. Realignment requires additional clock cycles. This reduces the throughput and the more complex logic increases the area consumption. Therefore, we use a DRAM which is more efficient both in terms of area and latency. A dual-port DRAM is used as the round function requires two 8-bit blocks of data for computing one 8-bit block. The shifting involved in round function is accomplished by addressing. The addresses needed for all operations are generated by using three 3-bit multiplexers M6, M7, and M8, 2-bit and 3-bit adders, 12-bit shift register SR and a 3-bit counter C3 as shown in figure 3. The address from M8 is used for both reading and writing while M7 is used only for reading. The control signals for all operations are generated by using a 5-bit counter and some logic functions.

Round Function, Initial and Final Transformation

The addresses for round function are generated by shift register (SR) and a 3-bit adder. The initial values of SR required for first round are computed by using 2 LSB bits of C3. The addresses for the next round are generated by loading the result of 3-bit adder into SR. This way of generating the addresses reduces the complexity of the control logic and avoids extra clock cycles needed for shifting of data. Each round operation requires 4 clock cycles. The initial and the final transformations are performed by using the data path of the round function with use of two extra multiplexers M2 and M3. The addresses for both transformations are generated by C3. The initial transformation is performed while the data is being loaded into the shift register which save clock cycles.
Key Storage and Scheduling
The 128-bit key is stored in a single-port DRAM. The sub-keys and whitening keys are generated by using two 3-bit counters C1 and C2 and a 2x1 multiplexer M5. The MSB bit of the key address which is also used as selection bit for M5 is generated from the output of the 5-bit counter used in control logic. The two 3-bit counters are used for addressing instead of one 4-bit counter to accomplish shifting involved in key scheduling.

3.2. Lightweight architecture of PRESENT
The area is reduced by scaling the 64-bit implementation to a 16-bit implementation and by applying the optimization techniques. Scaling the implementation to 8-bit would decrease the throughput drastically and yield a very small area reduction due to the complexity of the permutation operation. Our implementations of wider datapath led to a significant increase in area consumption.

Data storage
The state $b_{63} \ldots b_0$ is stored in the shift register SR1 for the reason specified in III. It performs a 16-bit circular left-shift per clock cycle. We consider the 64-bit shift register as a combination of sixteen 4-bit block $15, \ldots, 0$. The 16 MSB are tapped out of SR1 for the round operation.

S-box Implementation and Permutation Layer
The round operation starts by XOR the incoming data with the round key RK1 and applying the result to four S-boxes. Present’s 4-bit to 4-bit S-boxes are implemented in a single LUT each. Our architecture uses four S-boxes for round operations and two for key scheduling. The Permutation function is implemented by using the shift register SR2 which performs a shift by 4 bits during round operation and by 16 bits after each round when copying its content into SR1. Furthermore, the 16-bit output from the S-boxes is given as input to the blocks 12, 8, 4, and 0 of SR2. During first clock cycle of the round operation the 4-bits blocks 15, 11, 7 and 3 are computed from the 16 MSB of SR1 and placed in position 12, 8, 4, and 0 of SR2. In the subsequent clock cycle SR2 is shifted by 4 bits and blocks 14, 10, 6 and 2 are computed and placed in the now empty positions 12, 8, 4, and 0 of SR2. These operations repeat for another two clock cycles to complete the round function. This results in a total of 8 clock cycles for each round operation.

Key Storage and Scheduling
The key is stored in a 128-bit shift register which performs a 16-bit circular left shift. The first round key RK1 is obtained during the first four clock cycles by tapping the 16 MSB from the key and passing them to the RKGen function. However, during these four clock cycles the key was shifted by 64-bit. Subsequent round keys require only a shift by 61 bits which is not possible with a 16-bit shift. We overcome this problem by placing three extra taps on the shift register and using two 3-bit registers A and
B along with several multiplexers. The 3 bits that were “lost” during generation of the first round key are stored in register A. For subsequent round keys, the value from register A and the 13 MSB from the key are passed to RKGen. However, when generating the round key RK2 we are facing the same problem of shifting by 64-bit again and compensate by storing three bits in register B. From now on, RKGen gets the value from registers BijAjj11 MSB from key. When we write the round keys back into the shift register we take the 3-bit difference into account. Therefore, no additional registers are required for subsequent rounds. The RKgen function consists of two S-boxes for S-box operation, a 5-bit XOR to compute the XOR with the round counter, and multiplexers to choose the appropriate bits for round key generation. The output of the RKgen function is the round key as shown in figure 4.

![Figure 4. Key scheduling of present](image)

4. Results
The designs of HIGHT and Present were described in Verilog HDL, synthesized for the Altera Cyclone III EP3C16F484C6 device using Altera Quartus II and Simulated with Modelsim. All results are after place and route. Table 1 shows the detailed implementations of HIGHT and Present. The results for AES were obtained by using the VerilogHDL code for the ASIC implementation and synthesizing it in FPGA. Camellia and AES encrypt blocks of 128-bit data, whereas the other algorithms operate on 64-bit data blocks. Therefore, AES and Camellia implementations require more storage. Furthermore, AES and Camellia have 8x8 S-boxes which occupy 64 slices or 16% and 20% respectively of the total design area in our implementations. Present’s 4x4 Sbox occupies only 2 slices Present AES use registers (i.e. flipflops) for data and key storage. Even though the total number of flipflops needed is far smaller than the number of LUTs used, the addressing logic contributes to the area consumption.

The Camellia implementation uses 88 SRL-16 elements, which would be capable of storing a maximum of 1,408 bits, to store its two 128-bit keys. Unfortunately, the round key generation shifts the key by 15 and 17 bits. This irregular shift requires many additional tapings causing the high number of SRL-16 elements. Implementing shifts in multiples of 8 require less area.

<table>
<thead>
<tr>
<th>Block cipher</th>
<th>Flip Flops</th>
<th>LUTs</th>
<th>SRL-16s</th>
<th>Slices</th>
<th>Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Present</td>
<td>114</td>
<td>25</td>
<td>9</td>
<td>13</td>
<td>SRL DRAM</td>
</tr>
<tr>
<td>HIGHT</td>
<td>226</td>
<td>338</td>
<td>42</td>
<td>53</td>
<td>DRAM FF</td>
</tr>
<tr>
<td>Camellia</td>
<td>164</td>
<td>42</td>
<td>1</td>
<td>3</td>
<td>SRL SRL</td>
</tr>
<tr>
<td>TinyXTEA-3</td>
<td>254</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>FF FF</td>
</tr>
<tr>
<td>AES</td>
<td>393</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>FF FF</td>
</tr>
</tbody>
</table>

The same can be observed in Present’s key schedule as its involves 61-bit shifts. HIGHT makes extensive use of DRAM elements for both, data and key storage and uses SRL-16s in its control logic. Present uses SRL-16s for both. DRAM and SRL-1 elements are an ideal choice for storing data and key provided that the algorithm is regular which leads to a simple control logic. Camellia is an example for an algorithm with high irregularity, therefore DRAM and SRL-16 elements cannot be used to full effect. Implementing permutation functions that span more than 8 or 16 bits also increases the area consumption and latency for lightweight implementations in FPGAs. Table 3 compares our implementations with Camellia, TinyXTEA-3, AES and the eSTREAM portfolio ciphers. The stream ciphers outperform all block cipher implementations with respect to the throughput/area metric. However, they are defined for 80-bit keys and only MICKEY and Grain offer 128-bit versions. Stream ciphers are still considered immature and only recently the stream cipher F-FCSR-H was removed from the portfolio. AES has the highest throughput of the block ciphers followed by HIGHT. However, HIGHT has a better throughput to area ratio and consumes only half the size and no block rams. For the throughput to area ratio computation 300 slices are added to the area of AES and 140 to AES to compensate for the block ram usage.
Table 3. Results for Present and HIGHT compared to other block Ciphers and the eSTREAM for TFOLIO Ciphers on FPGA

<table>
<thead>
<tr>
<th>Design</th>
<th>Max. Delay</th>
<th>Clock Cycles per block</th>
<th>Block Size (bits)</th>
<th>Key Size (bits)</th>
<th>Area (Slices)</th>
<th>Block RAMs</th>
<th>Throughput (Mbps)</th>
<th>Throughput area (Mbps/ Slices)</th>
<th>Device</th>
</tr>
</thead>
<tbody>
<tr>
<td>Present</td>
<td>8.78</td>
<td>256</td>
<td>64</td>
<td>128</td>
<td>117</td>
<td>0</td>
<td>28.46</td>
<td>0.24</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>HIGHT</td>
<td>6.12</td>
<td>160</td>
<td>64</td>
<td>128</td>
<td>91</td>
<td>0</td>
<td>65.48</td>
<td>0.72</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>Camellia</td>
<td>7.95</td>
<td>875</td>
<td>128</td>
<td>128</td>
<td>318</td>
<td>0</td>
<td>18.41</td>
<td>0.06</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>AES</td>
<td>14.21</td>
<td>534</td>
<td>128</td>
<td>128</td>
<td>393</td>
<td>0</td>
<td>16.86</td>
<td>0.04</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>AES-8-bit</td>
<td>14.93</td>
<td>3900</td>
<td>128</td>
<td>128</td>
<td>124</td>
<td>2</td>
<td>2.2</td>
<td>0.01</td>
<td>XC2S15-6</td>
</tr>
<tr>
<td>AES</td>
<td>20.00</td>
<td>46</td>
<td>128</td>
<td>128</td>
<td>222</td>
<td>3</td>
<td>139</td>
<td>0.27</td>
<td>XC2S30-5</td>
</tr>
<tr>
<td>Tiny</td>
<td>15.97</td>
<td>112</td>
<td>64</td>
<td>128</td>
<td>254</td>
<td>0</td>
<td>35.78</td>
<td>0.14</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>XTEA-3</td>
<td>5.10</td>
<td>1</td>
<td>1</td>
<td>128</td>
<td>0</td>
<td>0</td>
<td>196</td>
<td>4.45</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>Grain v1</td>
<td>5.10</td>
<td>1</td>
<td>1</td>
<td>128</td>
<td>50</td>
<td>0</td>
<td>196</td>
<td>3.92</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>128</td>
<td>4.29</td>
<td>1</td>
<td>1</td>
<td>128</td>
<td>115</td>
<td>0</td>
<td>233</td>
<td>2.03</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>MICKEY v2</td>
<td>4.48</td>
<td>1</td>
<td>1</td>
<td>128</td>
<td>176</td>
<td>0</td>
<td>223</td>
<td>1.27</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>MICKEY 128</td>
<td>4.17</td>
<td>1</td>
<td>1</td>
<td>128</td>
<td>50</td>
<td>0</td>
<td>240</td>
<td>4.80</td>
<td>XC3S50-5</td>
</tr>
<tr>
<td>Tinvaun</td>
<td>4.74</td>
<td>1</td>
<td>64</td>
<td>128</td>
<td>344</td>
<td>0</td>
<td>13.504</td>
<td>39.26</td>
<td>XC3S400-5</td>
</tr>
</tbody>
</table>

5. Conclusion
Lightweight implementations of cryptographic algorithms for FPGAs are going to become an important research area due to the introduction of FPGAs for battery powered devices. In this paper we introduced the first lightweight implementations of the block ciphers HIGHT and Present on FPGAs. Our implementation of HIGHT consumes less than 100 slices, encrypts data at 65 Mbps and has a better throughput over area ratio than the previously published lightweight implementation of AES. Furthermore, we introduced optimization techniques for lightweight implementations that can also be applied to other algorithms. Investigating the robustness of lightweight implementations against side channel analysis and implementing lightweight asymmetric cryptosystems is future work.

References


6/25/2013