What is an Intermediate Representation?


Intermediate Representations (IRs) are used by compilers to represent programs throughout the compilation pipeline. To understand what an IR is, let’s look at what compilers do and how they are structured.

Compilers and Transformations

In essence, a compiler takes a program written in some source language and transforms it into an executable program. This transformation has a key property: it preserves some kind of behavior.

For example, a program written in C++ describes what happens in some abstract machine defined by the C++ standard. When the compiler transforms this program into a program for a real world machine, it must ensure that the real world machine will behave like the C++ abstract machine does.

Many Transformations

In reality, the compiler performs many such behaviour-preserving transformations before producing the final program, using different representations along the way. Let’s look at one possible flow that an LLVM-based compiler may follow; don’t worry about the names of each stage, we’re interested in the fact that they exist, and where the IR is located.

We start with the source program, and then we:

  1. Transform it into a Parse Tree.
  2. Transform it into an Abstract Syntax Tree (AST).
  3. Transform it into a semantically valid AST.
  4. Transform it into Intermediate Representation (IR).
  5. Transform it into equivalent IR.
  6. Transform it into a Selection DAG.
  7. Transform it into sequence of machine instructions.
  8. Transform it into the final program.

The list is not comprehensive, especially in the later stages which I am not familiar with; the important observation is the number of transformations, and how all of them must preserve behavior specified by each input representation in the output representation.

While each representation in the list above is “intermediate” in the sense that it is neither the input program nor the final executable, we usually take Intermediate Representation to mean the IR generated in Step 4, “optimized” (transformed) in Step 5, and lowered in Step 6.

Different Languages, Same IR

Steps 1-5 are specific to the source language of the input program, but all other steps are agnostic to the language; the IR is the first such agnostic representation. Using this scheme, one can conceive different compilers that all share the “middle” and “back”-ends of the sequence above:

A side-effect of a language-agnostic IR is that everything required by the language specification must be captured using generic mechanisms provided by the IR: the language specification doesn’t exist in that level. Because of this, one can inspect IR and understand how language concepts map to simpler and low level code abstractions.

Goals of the IR

In the compilation pipeline, the IR sits between representations specific to source languages and representations specific to the target machine.

From where the IR is positioned in the compilation pipeline, we can derive some of its design goals:

LLVM’s IR attempts to achieve these design goals by:

Up Next

The next post in this series will dig into the core concepts of LLVM’s IR!