31 Atom Processor ISA
Dr Selvi Ravindran
About the module:
Atom may not be at the helm of the embedded market, but never the less, it comes from the Intel family which compels an embedded programmer to understand its nuances. In this module, we explore the Bonnell micro-architecture of the Atom. We look at interrupt and exception handling. We also look at system level architecture for embedded systems using the Atom processor family.
Learning Outcomes:
- Able to understand the architectural details of the Atom processor. Able to appreciate intricacies of Atom programming.
- Understand the Interrupt and exception handling in Atom.
1. Atom Processor
Embedded programmers mostly program in assembly/embedded C.Programming in assembly forces the programmer to know about the micro architecture of the processor. Similarly, C programs include the architectural details in a header file. This necessitates an understanding of the micro architecture of the embedded processor by the developer. Hence, we look at the micro-architecture details of the Atom processor.
Atom generally is based on IA32 and a few processors of the series can support IA64 ISA. For example, the N450 supports the IA64 ISA and incorporates graphics and memory controller to provide higher performance. Atom processors that enable IA64 ISA support two modes of operation: 1. 64 bit mode and 2.Compatible mode. The 64bit mode allows the IA64 ISA to execute by providing 64 bit registers and one terabyte of physical memory.The compatible mode allows the IA-32 protected mode programs to execute along with IA64 ISA applications. A two level TLB implements memory protection and translates virtual to physical addresses. The TLB supports 4KB and 4MB page sizes.
1.1 Atom Micro-Architecture
Atom processor executes Intel x86 compatible instruction Set. The Instruction(opcode) length is variable. Atom belongs to the CISC architecture family. Atom is a two-wide superscalar in-order processor. Therefore it has a capacity of executing two instructions simultaneously. It supports vectored operation with SIMD instructions. Atom implements Bonnell microarchitecture. It translates the complex operation in the instruction into smaller operations. The microarchitecture deals with internal operations Memory load, Memory store and ALU operation.
Figure 1.1 shows the microarchitecture of the Atom processor. It consists of five major clusters/components.
a. Front End Cluster(FEC)
b. Memory Execution Cluster(MEC)
c. FP/SIMD Execution Cluster(FPEC)
d. Integer Execution Cluster(IEC)
e. Bus cluster
The core functional unit where the computation is done supports superscalar processing. The functional units are kept minimal to reduce power consumption. The micro-architecture has two Integer ALUs and two Floating Point ALUs which reside in IEC and FPEC respectively. The other cluster blocks aim at providing an interface to these ALUs. The next section describes clusters in detail.
2.1 Front End Cluster
Front end is charged with functionalities to search and place the instruction into one of the three execution units : Memory Execution cluster, Integer Execution cluster and FP/SIMD Execution cluster. The computation is basically done in Integer and Floating-Point units. The MEC generally deals with the address calculation. The FEC consists of: Branch Prediction unit, Instruction TLB, Instruction Cache, Prefetch buffer, Decode, and Microcode sequencer. Figure 1.2 shows the units of FEC. The FEC performs the task of instruction fetch and issue. The fetch phase deals with reading the instruction from the code memory.
The Instruction TLB translates the virtual address to physical address finding the actual location of the instruction. The fetch phase locates the instruction either by adding the current instruction length to its address (fall through address) or it searches with the help of the branch predictor unit. If the current instruction is a branch instruction the target address is predicted by this unit. FEC has an instruction cache which helps in quick access of instructions that are recently executed. To further optimize the speed, the prefetch buffer holds the opcode ready for decoding. Decode interprets the operation. Instruction decode phase is characterized by micro-operations. The Microcode sequencer decodes the complex instruction for execution in the pipeline. It decodes two instructionsper cycle except in the case of Branch and FP operations.
The FEC contains pre-decode bits to demarcate individual instructions. The decoder identifies instruction boundary with the help of pre-decode bits. FEC has two queues to hold the instructions. Atom is a dual issue superscalar processor but not perfectly symmetrical. Every pair(combination) of operations cannot execute simultaneously in the pipeline. The two Instruction Queues hold instructions temporarily until they are ready to execute in the execution cluster.
2.2 Memory Execution Cluster
MEC provides functionalities for generating address and accessing data. Figure 1.3 shows the five components. They are: Address Generation unit(AGU), Data TLB, Data Cache, Prefetcher and Write-Combining Buffers. AGU generates the data address from base register, scale and offset according to the addressing mode of the instruction. The Data TLB translates the virtual address to physical address finding the actual location of the data. Data cache helps in quick access of data that were recently used. Prefetcher predicts the future access by analysing the history. Write-Combining Buffers club a series of write backs needed to be updated in the level-two (L2) cache. The combining of write backs improves the utilization of memory bandwidth. It also supports store forwarding which eliminates potential stalls. Here a load operation can bypass the process of committing the previous results. These yet to be stored values can be forwarded for the next instruction internally, thus, avoiding the stalls.
Dependencies between the instructions relying on the conditional flags may cause stalls. Branch instruction introduces one cycle stall and other instructions have 3 cycle stalls. Figure 1.4 shows a 3 cycle stall between subtract and move instruction. Here the move instruction can only execute (generate address) after the completion of the subtraction.
2.3 FP/ SIMD Execution Cluster
The FPEC executes: X87, SIMD and integer multiply instructions. It contains two ALUs dedicated for FP as shown in Figure 1.5. The first FP ALU performs FP multiplication, FP division, FP load, FP store, shuffle and SIMD multiplication. The second ALU performs only FP addition. As mentioned earlier, the Atom processor is not symmetric. Every pair of FP operations cannot execute simultaneously. And there are asymmetries in the execution time, as well. For e.g.,FP add takes one cycle to execute where as FP multiply takes more than one cycle for the execute stage.
2.4 Instruction Execution Cluster
The IEC is comparatively more symmetric. It enables joint execution of many instructions, but is still limited as every pair cannot execute concurrently. It contains two ALUs. Figure 1.6 shows the ALUs, shifter and a joint execution unit.
2.5 Bus Cluster
The Bus Cluster is the connection between processor and memory subsystem. It contains 512KB eight way L2 cache. It also has Bus Interface Unit (BIU) and Advanced Programmable Interrupt Controller (APIC). We will discuss the interrupt controller in section 4. Figure 1.7 shows the Bus cluster with three components. The BUI interconnects the bus cluster with memory. The peripherals are interconnected with the two Front Side Bus(FSBs): Address bus and Data bus.
3. Programming Tips
Based on the unique features of Atom, certain tradeoffs can be incorporated in the program for better performance. They are discussed below.
Store Forwarding: As already discussed, Atom supports store forwarding. It is the process of internal forwarding of intermediate result (store) to load before commit. This benefit has certain restrictions which have to be taken into consideration when writing the program. Store forwarding is limited to integer values with same sizes. Consider the following example, where the first movl instruction stores a data in ebx. The next move instruction loads the data from ebx. Due to mismatch in size, the store forwarding happens for the first load and fails in the second load.
movl %ecx, (%ebx)
movl %(ebx), %eax
// store is forwarded
movb %(ebx), %cl
// store is not forwarded
Avoid non zero segment base : Atom is not optimized for non-zero segment bases. Hence the utilization of such bases decreases the throughput. If required,a non-zero segment can be restrictively used by ensuring that the segment base is 64-bit aligned.
Matched call and return to obtain the instruction pointer: Call matched with return sequence. The example shows a code where a get_ip is called. The instruction pointer is placed in the ecx . The subroutine loads this into the stack pointer and returns to the caller. The pop from the stack will take to the matched instruction.
call _get_ip
Code using ip………………
_get_ip::
movl (%esp), %ecx
ret
Partial register dependencies: Forwarding intermediate store to load before commit. Consider the example given below. Here there is no dependency, but the second instruction waits for the completion as cl and ch are subset of cx. This assumed dependency stalls the next instruction unnecessarily. Hence, the programmer must be careful about the use of registers.
movb %bl %cl
movb %ch, %dh // ch is assumed dependent on cl
Flag Updates: Limitation on instructions dependent on status register. Instructions dependent on the status register cannot be executed simultaneously with the instruction that sets the flag. In the following example, setc instruction cannot execute along with addl. Similarly jc has one cycle stall after addl instruction.
addl %ebx, %ebx
setc %ah
#1 cycle stall
addl %ebx, %ebx
jc
dest
# executes 1 cycle after previous
3.1 Programming Tips for Effective Math
A few optimizations for better performance of math operations are:
a) Effective address generated by previous ALU operations are subjected to multicycle penalty. To prevent the delay, use Lea instruction for address calculation.
b) Effective scheduling of integer multiply: Multiply takes longer time for execution; so it is always good to schedule mul instruction with another mul or floating point operation.
c) Avoid Divides: Divide instruction is very slow compared to others; so it is better to choose alternatives for divide.
d) Scalar doubles are faster than packed doubles: Packed doubles have lesser throughput; so it is better to go with multiple scalar operations.
e) SSE instruction executes faster than x87 for floating point operations: Again there is a throughput concern, as x87 has a two cycle penalty.
4. Interrupts and Exceptions
Interrupts or exceptions are events that are orthogonal to the program execution. Processor is equipped to handle these events by executing a series of instructions called handlers. Handlers route the processor to a different mode and handle the event in a deterministic manner. Every interrupt is assigned a unique ID called vector number. Interrupts may be maskable or non-maskable. Interrupts are of two types: Hardware interruptwhich aer raised by peripherals, and Software Interrupt, typically raised by the instruction INT #vector.
Exceptions are eventsassociated with instruction execution that are detected by the processor. A popular example is Divide by zero exception generated during the execution of divide instruction with divisor zero. Exceptions could also be raised for memory issues, invalid instruction, system bus error etc. These are referred to as Machine check exceptions. Examples are ECC errors, cache error, TLB errors, parity error, etc.
Exceptions are categorized into two types based on a known defined model: Precise exception and Imprecise exception. Precise exceptions are those that are associated with a definite known cause which generally is identifiable and rectified accordingly. For example, exception triggered by a Divide by zero instruction is a precise exception. Precise exceptions are further classified into Faults and Traps. Faults are those that are generally identified correctly and once rectified allow the program to continue without any hindrance. Traps are exceptions that are generally trapped by the OS. These are rectified by the OS in the kernel mode without any loss to the program continuity; examples are page fault and INTO (overflow).
Imprecise exceptions are the ones that are not correctly identified and rectified. The timeline of exception triggered and exception processed could be varyingly large and the exception raised may not be related to the instruction execution. The processor could have continued execution of many instructions by the time it was detected. An example is the ECC error in the cache. These exceptions are generally not recoverable and Intel calls them as Abort. Table 1.1 describes a list of exceptions with their sources, type and vector number (identifier).
Figure 1.8 depicts the different sources of interrupt and exceptions like the processor cores, cache, bus interface, external peripherals, ALUs and FP units.
The processor generally has a mechanism to execute the corresponding handler on occurrence of interrupt or exceptions. An interrupt vector table contains the reference location of the handler along with the details of it vector number and sources. To execute these handlers, the processor switches the context, moves from the current mode to interrupt/system mode. Figure 1.9 shows the registers involved in the process of executing the handlers. The steps taken to resolve and execute the handlers through the vector table and interrupt controller are described below.
The interrupt controller is also called as the Local Advanced peripheral Interrupt Controller (local APIC). The vector table is called Interrupt Descriptor table (IDT). The base address of IDT is in Interrupt Descriptor table Register (IDTR).
IDT contains three gates to provide access to the handlers: 1. Task gate, 2. Interrupt gate and 3. Trap gate. The difference between these gates is basically in the procedural way the handlers are treated. For example Interrupt gate, Clears the IF flag which disables other interrupts while executing current interrupt handler. Interrupt and trap gates are for external interrupts. Interrupt gate contains the following information:
Segment selector: Base address from global or local table
Segment offset: Add offset and produce linear address of ISR
Privilege level: Usually zero equivalent to kernel mode
The ISR location is got by adding the IDTR[vector number] and segment offset. Prior to the execution of ISR the appropriate stack is used to save the registers
We now look at how system level architectures are built using the Atom processor.
5. System level Architecture
Atom is Intel’s smallest and lowest power processor 1.6 GHz (TDP 2.5 W). The Intel Atom processor enables the industry to create pocket-sized and low power Mobile Internet Devices (MIDs), and Internet-focused notebooks (netbooks) and desktops (nettops). Figure 1.10 shows the system level architecture with Atom processor. The system level architecture consists of CPU, memory and I/O. Supporting these basic componentsare the interfaces: SCH – System Controller Hub and FSB – Front Side Bus. The SCH provides legacy support for I/O and memory peripherals. It also provides power management support. The FSB is the interface between the CPU and the external information outside the system package. The basic flow of information in the system uses memory and I/O. During reset, the BIOS firmware configures the memory and the I/O controller. The CPU can do both memory and I/O operations by mapping them to a separate address range.
5.1 E6xx Series System Level Architecture
The Intel Atom Processor E6xx Series is the next-generation Intel architecture (IA) CPU for small form factor ultra low power embedded segments based on a new architecture partitioning. Figure 1.11 shows the system level architecture with Atom processor E6xx series. The architecture partitioning integrates the 3D graphics engine, memory controller and other blocks with the IA CPU core, Customer defined IOH, ASIC, FPGA. Table 1.2 shows the features of the system architecture.
1.3 GHz (Mainstream SKU)
Macro-operation execution support
2-wide instruction decode and in-order execution
On die, 32 kB 4-way L1 Instruction Cache and 24 kB 6-way L1 Data Cache
On die, 512 kB, 8-way L2 cache
Support for IA 32-bit architecture
Supports Intel Virtualization Technology (Intel VT-x)
Supports Intel Hyper-Threading Technology — two threads
Enhanced Intel SpeedStep
TDP of 3.6W
- Summary
In this module, we have discussed the Atom micro-architecture and system level architecture. We looked into the details of the cluster units in the Atom. We explored programming tips for efficient programming using Atom. We also looked at interrupts and exceptions and their management.
References
- Break Away with Intel Atom Processors: A Guide to Architecture Migration by Lori M. Matassa and Max Domeika. 2012. Intel Press.
- Das, Lyla B. The X86 Microprocessors 8086 to Pentium, Multicores, Atom and the 8051 Microcontroller Architecture programming and Interfacing. 2 edition. 2014 Pearson Education India.
- en.wikipedia.org/wiki/Intel_Atom
- http://edc.intel.com