26 Introduction to ARM Processor
Dr Selvi Ravindran
About ARM :
ARM is apopular low power embedded processor.Itscore is based on the RISC architecture. In 1985 the first prototype was introduced under the name Acorn RISC machine later renamed as Advanced RISC Machine (ARM). Being a RISC machine the design is simple which means it can be built by less number of transistors. It’seasy to program high density code and low power consumption makes it suitable for embedded applications.
Learning Objectives:
- To understand the architectural details of the ARM processor and intricacies of its instruction set
- To appreciate the various generations of advancement in ARM families
- A road map of the register sets and processor modes to enable ARM programming
1.1 ARM PROCESSOR
ARM is based on the load store architecture. This means the access of memory is done via these two instructions. The load instruction copies data from memory to the register file, whereas the store instruction writes the data from the registers to the memory. All the arithmetic and logical instructions access only the register file, thus keeping theoperand access less time consuming and simple. Basically the core is a 32 bit processor, extended to 64 bit from the revision ARM9 family onwards.
1.2 Data Size and Instruction Set
The data processing capability of the instruction basically depends on the register width.The group of registers called the register file typically holds a signed or unsigned 32/64 bit data depending on the ARM core family. The ARM sign extend hardware converts the Byte (8 bits) and Halfword (16 bits or two bytes) intoWord (32 bits or four bytes) to store in the register file.
ARM implements three categories of instruction sets:
32-bit ARM Instruction Set, 16-bit Thumb Instruction Set,
8-bit Jazelle instruction set – Jazelle cores can also execute Java bytecode.
The “current program status register” decides which instruction set will be executed. The instruction set typically uses a three address format with one destination operand Rd and two source operands Rn and Rm. The features of these instruction set are described in Table 1.1
Table 1.1Instruction Set Features
1.3 Functional Blocks of ARM
The data flow diagram of the ARM core is shown in Figure 1.1, where the functional units are connected by data bus. The Arithmetic Logic Unit (ALU) and Multiply-Accumulate unit (MAC) retrieve value from data path A and B and store the answer back in the register file. One of the integral parts of the core is theBarrel Shifter (BS). Itis used in pre-processing the source operand Rm. The BS along with ALU is used in calculating many different expressions.
The base address is generated by the ALU for the load and store instructions and stored in the address register. The core is capable of accessing an array of memory locations with help of the incrementer. It auto-increments, to access the next sequential memory location until an exception or interrupt changes the flow of execution.
1.4 Processor Modes
The ARM has seven basic operating modes: six privileged mode and one non privileged mode (user). The privileged modes are used to service interrupt , exception and access protected resources.
User Mode : is an unprivileged mode under which most tasks run. This mode is used for executing application programs.
System mode : is a privileged mode to run user and system programs. Under this mode the user has all the access permissions. This mode uses the same set of registers as thenon privilegeduser mode.
Supervisor Mode: This is the mode where in the OS kernel operates. This mode is entered on reset and when a Software Interrupt instruction is executed.
Two of the privilege modes are allocated for interrupt handling. In general, ARM has two levels of interrupts.
Fast Interrupt Request (FIQ) : FIQ mode is entered when a high priority (fast) interrupt is raised. FIQ supports channel communication for data transfer.
Interrupt Request (IRQ) :IRQ isentered when a low priority (normal) interrupt is raised. This is a privileged mode for general purpose interrupt handling.
Abort : used to handle memory access violations. The abort mode handles data abort and pre-fetch abort.
Undefined : used to handle undefined instructions that are not supported by the implementation.
1.5 Registers
Arm has 37 registers, each 32 bits long. The registers are broadly classified into Special Purpose Registers(SPR) and General Purpose Registers(GPR). The SPRs are processor driven registers and GPRs are for the programmer. Overall there are about 18 active registers of which 16 registers r0 to r15 are visible to the user. In this the registers r13, r14 and r15 are used as the stack pointer, link register(lr) and program counter(pc) respectively. The lr is used to store the return address whenever the control shifts to the subroutine. There are two program status registers CPSR and SPSR (current program status register and saved program status register). Out of the 37 registers, 20 registers are used orthogonally under different banks. The processor mode decides which bank is accessible. The complete register set under different modes is shown in Figure 1.2.
1.6 The ARM Register Set
The shaded region in Figure 1.2 shows the way the registers are used for different privileged and non privileged modes. On the left is the user and system mode and on the right side the currently visible set of registers is shown for each privileged mode.All instructions can access GPRs and only a few instructions can access the SPRs. One of the prominent SPR is Program Status Register (PSR). The PSR in ARM represents the outcome of internal operations. It helps in monitoring the status of the processor. The PSR is generallyreferred to as Current PSR. There is another PSR for storage purpose calledSaved PSR (SPSR).The processor switches from one mode to another on demand of a interrupt, exception and privileged instruction. As the processor transits via a privileged mode to non privileged mode vice versa,the current executing mode CPSR is saved in the SPSR of the initiator mode. After completion of the task, it moves back to the earlier switched mode. This procedure is similar to context switching. At the end a special return instruction it copies the SPSR into the CPSR.
The layout of the PSR is shown in figure 1.3. PSRs can be split into four 8-bit fields that can be individually written as:
The Condition Code(CC)bits Negative(N), Zero(Z), Carry(C) and overflow(V) are set on the occurrence of respective result. All Thumb instructions and few ARM instruction affect the CC flags.The ARM instruction can influence the cc,if they are appended with a S variable in the instruction. Sticky overflow flag (Q flag) is set either when saturation occurs during signed add and subtract and Double add and subtract (QADD, QDADD, QSUB or QDSUB), or the result of signed multiply and accumulatemultiple and accumulate word (SMLAxy or SMLAWx) overflows 32-bits. Bits that are reserved for future use should not be modified by current software. Typically, a read-modify-write strategy should be used to update the value of a status register to ensure future compatibility. Note that the T/J bits in the CPSR should never be changed directly by writing to the PSR. This is usedbyBranch exchange or branch link exchange (BX/BLX) instruction to change the state.The processor is in ARM stateon reset, following an interrupt, or during exception execution. The mode of the processor can be changed by using data transfer instruction and writing into the specific mode bit.
1.7 CONDITIONAL Execution
Most instructions in ARM are appended with a conditional attribute which decides the course of execution of that instruction. These attributes are dependent on the conditional flags which were affected by the previous instruction in the program. The different conditional attributes postfixed to an instruction are shown in Table 1.2. If no conditions are appended, the default condition of execute always (AL) is assumed. The processor compares condition field in instruction against NZCV flags to determine if the instruction should be executed. The use of conditional instruction improves code density and performance by reducing the number of branch instructions.
Following is the code snippet showing instructions with and without conditional execution. Here ADDNE eliminates the necessity of the branch instruction, thus reducing the number of cycles needed to execute the code and in turn reducing the energy consumption.
Table 1.2Condition Mnemonic
Figure 1.4 shows an example C code consisting of IF ELSE statement. Its equivalent ARM mnemonic shows that without conditional execution we need 5 instructions and with conditional attribute it reduces to 3 instructions.
1.8 Exceptions , Interrupts and vector table
Typically an external/internal event occurring in the processor are responded to by executing a special handler routine. Interrupts are used to handle these events by switching the context to FIQ/IRQ mode . Exceptions are used to handle instruction faults. Whenever an exception or interrupt occurs, the ARM processor saves the current status and return address. This is done by copying CPSR into the respective SPSR_<mode>;setting the appropriate CPSR bits(disables hardware interrupts) and moving to the ARM state. It stores the return address in link register LR_<mode>. The execution moves to a fixed memory location called vector address. It sets the PC to the vector address. The Vector table can be at 0xFFFF0000 on ARM720T and on ARM9/10 family devices. Figure 1.5 shows the vector table for different processor modes located in the memory. From the figure, it can be understood that each handler is allocated four words. A handler routine smaller than four words is written in its specific allocated space. In case the handler code is exceeding the four words a branch instruction can point to the additional lines of code written at available system memory.
After providing service to the raised interrupt or exception, the handler needs to restore CPSR from SPSR_<mode>and PC from LR_<mode>. The interrupt/exception handlers are written in ARM instruction only. Thus the processor has to switch to the ARM state for its execution.
1.9 Pipeline
Pipeline is the mechanism of overlapping the instruction execution. The instructions are divided into stages like fetching, decoding, executing etc. Since every instruction execution has to go through different stages during its lifecycle in the processor, each instruction can be executed in different functional units parallelly forfaster completion. The number of stages in a pipeline is called its length. As the pipeline length increases the work done at each stage is reduced. The pipeline length of each ARM family varies from 3 stage (FETCH –DECODE –EXECUTE), 5 stage (FETCH –DECODE –EXECUTE- MEMORY – WRITE), 6 stage (FETCH –ISSUE – DECODE –EXECUTE- MEMORY – WRITE ) and 8 stage (FETCH1 – FETCH2- DECODE – REGISTER-SHIFT- DATA1-DATA2- MEMORY – WRITE ).
Pipeline Comparison
Figure 1.4 shows the pipeline stages for ARM7TDMI and ARM9TDMI processor. ARM7 has a three stage pipeline and ARM 9 has five stage pipeline.Pipelining reduces the number of cycles needed to execute the instruction which is measured as Cycles Per Instruction(CPI). The ARM7 CPI is 1.9and 1.5 for ARM9.The operating frequency is approximately double for ARM9TDMI over ARM7TDMI on the same fabrication process. Therefore, at least double the processing power is available.Despite these features, an instruction may take longer to execute thereby pausing the upstream stages. Forwarding paths have been provided to minimise this, and by using a bit of consideration when writing code they can almost be eliminated.
ARM10 introduces another stage to ARM9’s pipeline to provide additional time to handle coprocessor instruction decode and handle branch prediction. The Multiplier is now broken up over two stages, execute and memory, since the multiplier is also pipelined.
ARM11 is a single issue processor. Only one instruction can be issuedper cycle from the issue stage to one of the 3 backend pipeline stages.While the instructions are issued in order they may complete out of order. This will be dependent on availability of data, length of execution and memory access times. The ARM10 and ARM11 pipelines are shown in Figure 1.5
2.0 Core Extension
Core extension refers to the additional hardware provided to the ARM core to provide extra functionalities. Three hardware extensionsthat we will consider are:
1. Cache and tightly coupled memory ,
2. Memory management and
3. Coprocessor interface.
Cache and tightly coupled memoryrefer to the two styles of cache organization: single unified cache and Tightly Coupled Memory(TCM). The first one is Von Neumann style of clubbing the instruction and data cache. The latter TCM is a fast SRAM located close to the core for deterministic behavior.
Memory management concept in ARM basically revolves around the point of whether the memory is protected or not. We have three types of memory management hardware: 1. No extension no protection, 2. MPU: limited protection and 3. MMU: full protection.
Lastly the Coprocessor interface extends the processing feature of the core by extending the instruction set and improves performance.
2.1 ARM Architecture evolution
The different revisions brought into the ARM processor is a witness to the demanding embedded market which has led to many changes in the instruction set architecture (ISA). We introduce the processor nomenclature which gives the basic information and salient features of each ARM family.Versions mostly refer to the instruction set that the ARM core executes.
ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S}
X: family
Y: Memory management /protection unit
Z: cache
T: Thumb 16 bit decoder
D: JTAG debug
M: fast multiplier
I: embedded ICE macrocell
E: enhanced instruction
J: Jazelle
F: vector floating point
S: synthesizible version
2.1.1 ARM Family
ARMv1 is the first version of ARM processor with 26-bit addressing, no multiply / coprocessor. ARMv2: ARM2 is the first commercial chip whichincluded 32-bit multiply instructions and coprocessor support. The ARM2 refined version ARMv2a also called as ARM3has an on-chip cache, added load and store instructions and cache management. ARMv3: ARM6, 32 bit addressing, virtual memory support.
ARMv4: –ARM7TDMI, referred to as Strong ARM given the variety of features it had. It includes MMU, unified 8K cache, five stage pipeline ,T: Thumb, 16-bit instruction set, D: on-chip Debug support, enabling the processor to halt in response to a debug request, M: enhanced Multiplier, yield a full 64-bit result, high performance, I: EmbeddedICE hardware.The ARM7, which is still the most often used core in a low-power design, executes the version 4 instruction set.
ARM9 family-(ARMv5TE) :ARM920T belongs to the ARM9 family and is based onHarvard architecture. It has separate data and Instruction( D+I) cache. ARM926EJ is a jazelle core with, Embedded Trace Macrocell(ETM) technology.Jazelle adds Java bytecode execution, which increases Java performance by 5-10 times and also reduces power consumption. Architectural extensions were added for version 5TE to include DSP instructions, such as 16-bit signed MLA instructions, saturation arithmetic, etc.
ARM10 family (v5TE and v5TEJ) has six stages pipeline, supports vector floating point, 32K D+I cache, 64 bit bus interface. The ARM926EJ-S , ARM1020E and ARM1026EJ-S cores are examples of Version 5 architectures.
ARM 11 family (v6) implemented eight stages pipeline. It has a separate load store and arithmetic pipeline. LDREX/STREX instructions improve multi-processing support. This version supports multimedia application with SIMD extension. SIMD instructions provides increased audio/video codec performance. ARM1136J-S and ARM1136JF-S supports vector floating point for fast operation. Version 6 added instructions for doing byte manipulations and graphics algorithms more efficiently.
ARM11(ARM v7) family executes the 16 and 32 bit Thumb instruction. ARM cortex implementation is an example of this version. ARM v8Aimplements 32-bit and 64-bit processor. It executes both 32 bits and 64 bits ARM and Thumb instruction sets. It supports virtual memory and rich operating systems.The ARM11 family implemented the Version 6 architecture. Version 7 architectures which include the Cortex family of cores, such as the Cortex A8, Cortex M3 and Cortex R4. It extended the functionality by adding things such as Thumb2, low-power features, and improved security.
3.SUMMARY
In this module, we have discussed the popular ARM architecture. We have explored the conditional execution of instructions which makes the code more compact and saves energy. We had a brief discussion on Pipeline, Exceptions andCore Extension. We ended up by looking through the different revision which gave rise to the evolution of ARM architecture.
4. References
1. www.arm.com
2. en.wikipedia.org/wiki/ARM_architecture
3. infocenter.arm.com
4. Andrew N Sloss, D. Symes, C. Wright, ” Arm system developers guide”, Morgan
Kauffman/ Elsevier, 2006.