28 Code Generator Introduction and Basic Blocks

Rajeswari Sridhar

After discussing the various semantic rules necessary to convert every programming construct into three-address code, in this module, we will discuss the next phase of the compiler – Code generation. We will start this module by understanding the issues in generating code. As a first step in code generation, we will discuss the concept of basic blocks and flow graphs

 

28.1 Code Generation – Introduction

 

The position of the code generation module is shown in figure 28.1. As shown in figure 28.1, the front-end of the compiler delivers intermediate code as output after processing the lexical, syntactic and semantic phases. This intermediate code is used for code generation. The intermediate code could also be optimized and the optimized intermediate code could also be fed to the code generator.

The following are the expectations of the code generator:

 

•      Code produced by compiler must be correct – The conversion from the source to the target language should preserve the semantics of the source program.

•      Code produced by compiler should be of high quality – The code generation algorithm should be aware of the instructions set of the target language. The code should be generated, by effectively using these target machine resources. Normally, heuristic techniques are used to generate good but suboptimal code, as optimal code generation is an undecidable problem.

 

28.2 Issues in the Code Generator

The code generator needs to address so many issues before actually generating the code. The following are some of the issues and let us discuss each one of them in brief.

•      Input to the code generator: The code generator gets the three-address code as input. The format in which this three-address code is fed as input need to be decided. The following are some of the ways of feeding input to the code generator.

•      Linear – The input could be a string of characters where we use postfix notation to express the input.

•      Tables – As discussed in the previous modules, the three-address code is typically represented using Quadruples, Triples or Indirect triples and this table could be served as input

•      Non-linear – Abstract Syntax tree (AST) or Directed Acyclic Graph (DAG) could be used as input to the code generator after converting the input into AST or DAG representation  In addition to the input, symbol table information need to be given to the code generator. The format in which symbol table needs to be available to the code generator needs to be decided.

•      Target Program Code: The back-end code generator of a compiler may generate different forms of code, depending on the requirements. The target program code needs to be specified as assembly language or machine language. The following are some of the forms

•      Absolute machine code – This code is directly executable

•      Relocatable machine code – This is a relocatable machine language and is normally available as an object files

•      Assembly language – This is available in any assembly language which depends on the target machine and later an assembler converts this to machine language.

•      Byte code forms for interpreters – Java virtual machine has the write once and run anywhere concept where the output should be in Byte code.

•      Target Machine: Implementing code generation requires thorough understanding of the target machine architecture and its instruction set. Consider a hypothetical machine with the following features:

 

–  The instructions are byte-addressable where a word is 4 bytes

–  Has n general purpose registers R0, R1, …, Rn-1

–  Instructions are of Two-address, ones of the form

op source, destination where op stands for op-codes.

 

For example

 

MOV (move content of source to destination)

ADD (add content of source to destination)

SUB (subtract content of source from destination)

The following would be the instructions with a total cost of 6 (2+2+2):

 

MOV y,R0

ADD z,R0

MOV R0,x

 

Consider another instruction a:=a+1 and if we adopt the same strategy to convert this instruction to target code, the following would be the result with a cost of 6 (2+2+2)”

 

MOV a,R0

ADD #1,R0

MOV R0,a

 

If we replace this with the following instruction, the cost would be 3 (1 + 1 +1) which is “1” each for ADD, #1, a

 

ADD #1,a

 

On the other hand, if this is replaced by the following instruction the cost would be 2 (1 for INC and 1 for “a”)

 

INC a

 

•      Instruction Selection – The choice of instructions also could change based on the addressing Modes

 

•      Suppose we translate a:=b+c into

 

MOV b,R0

ADD c,R0

MOV R0,a

 

The cost of this would be 6 (2 + 2 +2)

 

•      On the other hand, assuming addresses of a, b, and c are stored in R0, R1, and R2

 

MOV *R1,*R0

ADD *R2,*R0

 

The cost of this would be 2 (1 +1) which is just due to the MOV and ADD instructions

 

•      Consider, R1 and R2 contain values of b and c

 

ADD R2,R1

MOV R1,a

 

Would incur a cost of 3 (1 + 2) for ADD, MOV.

  • Need for Global Code optimization

This is the next issue in code generation. Consider the instruction and its corresponding three-address code as given in the table 28.3(a):

 

Table 28.3 a – Instruction and code

 

Instruction Code
x:=y+z MOVy,R0
ADD z,R0
MOV R0,x

 

Consider the following sequence of instructions and supposing we use the same logic to translate, then we will end up in the table 28.3 (b).

 

Table 28.3 (b) – Sequence of Instructions and code

 

Instruction Code
a:=b+c MOVa,R0
ADD b,R0
MOV R0,a
d:=a+e MOV a,R0  ———— Redundant as R0 is used
ADD e,R0
MOV R0,d

 

As given in the Table 28.3 (b), register R0 already had the value of ‘a’ according to the previous computation and hence this MOV instruction as indicated is redundant and this calls for Global code optimization.

  • Register allocation and Assignment

One of the important issues in code generation is register allocations. The number of registers in any architecture is limited to match the number of variable in a high- level program. Efficient utilization of the limited set of registers is important to generate good code. Registers are assigned by

 

Register allocation to select the set of variables that will reside in registers at a point in the code

–  Register assignment to pick the specific register that a variable will reside in

 

However, finding an optimal register assignment in general is NP-complete. Consider the table 28.3 (c) that has the sequence of instructions and their corresponding three-address code.

28.3 Basic Blocks and Flowgraphs

 

To handle all the issues in the code generation process, we need to construct basic blocks and flow graphs from the sequence of three-address code. Basic blocks are computation sequences and they form the node of a flow graph. Flow graph is a graphical representation of three-address code where nodes are the basic blocks and edges indicate the flow of control between basic blocks. The primary objective of flow graphs is code optimization include optimal register allocation.

 

A flow graph is a graphical depiction of a sequence of instructions. A flow graph can be defined at the intermediate code level or target code level. Consider the following example at the final code level and the corresponding flow graph of Figure 28.2.

 

 

Figure 28.3 Flow graph with basic blocks highlighted

 

In a flow graph, suppose there is an edge from basic block B1 to B2 as BB2, then B1 is a predecessor of B2 and B2 is a successor of B1

 

In figure 28.3, we have grouped the statements into basic blocks 1, 2, 3 and the manner to do it is given algorithm 28.1

 

Algorithm 28.1 Basic block construction

 

Input:  A sequence of three-address statements

Output:   A     list  of  basic  blocks  with each

three-address statementin  exactly one block

1.      Determine the set of leaders, the first statements if basic blocks

a)      The first statement is the leader

b)      Any statement that is the target of a goto is a leader

c)      Any statement that immediately follows a goto is a leader

 

2.   For    each    leader, statements    up    to  its  but  basic not  block including  consist the  of  next  the   leader leader    or  and  the all end  of the program

From algorithm 28.1, it is evident that the first step is to identify the set of leaders and group the statements between a pair of leaders into a basic block. An example of this is given in the next module

 

Summary: In this module, we understood the need for code generation and the various issues that need to be handled in code generation. Basic block and algorithm for basic block creation from three address / final code was discussed.