Home

What is an Intermediate Representation?

2022-03-17

Intermediate Representations (IRs) are used by compilers to represent programs throughout the compilation pipeline. To understand what an IR is, let’s look at what compilers do and how they are structured.

Compilers and Transformations

In essence, a compiler takes a program written in some source language and transforms it into an executable program; this transformation has a key property: it preserves behavior defined by the source language.

For example, a program written in C++ describes what happens in the abstract machine defined by the C++ standard. When the compiler transforms this program into a program for a real world machine, it must ensure that the real world machine will behave like the C++ abstract machine would.

Many Transformations

In reality, the compiler performs many behaviour-preserving transformations before producing the final program; along the way, it uses many different representations for the program, like graphs, high-level instruction sequences, and real machine instructions. Let’s look at one possible flow that an LLVM-based compiler may follow; don’t worry about the names of each stage, as we’re only interested in the fact that they exist, and where the IR is located.

We start with the source program, and then we:

Transform it into a Parse Tree.
- Usually, this step can fail for ill-formed programs.
Transform it into an Abstract Syntax Tree (AST).
Transform it into a semantically valid AST.
- Usually, this step can fail for ill-formed programs.
Transform it into Intermediate Representation (IR).
Transform it into equivalent IR.
- Often referred to as “optimizations”.
Transform it into a Selection DAG.
Transform it into sequence of machine instructions.
Transform it into the final program.

The list is not comprehensive, especially in the later stages which I am not familiar with; the important observation is the number of transformations and how all of them must preserve behavior.

Each representation in the list above is “intermediate” in the sense that it is neither the input program nor the final executable, but we usually define Intermediate Representation to mean the IR generated in Step 4, “optimized” (transformed) in Step 5, and lowered in Step 6.

Different Languages, Same IR

Steps 1-5 are specific to the source language of the input program, whereas all other steps are agnostic to the language; the IR is the first such agnostic representation. Using this scheme, one can conceive different compilers that all share the “middle” and “back”-ends of the sequence above:

As a side-effect of a language-agnostic IR, the behavior required by the input language specification must be captured using generic mechanisms provided by the IR; the language specification can’t exist in that level, otherwise it would no longer be language-agnostic. Because of this, one can inspect IR and understand how language concepts are mapped to simpler and lower level code abstractions.

Goals of the IR

In the compilation pipeline, the IR sits between representations specific to source languages and representations specific to the target machine:

We can derive some of its design goals from where the IR is positioned in the compilation pipeline. It must be:

Able to represent concepts from any high level language,
Amenable to analysis required by “optimizing” transformations,
Able to represent concepts required by target specific representations.

LLVM’s IR attempts to achieve these design goals by:

Being a RISC-like language,
Having a type system,
Being highly configurable.

Up Next

The next post in this series will dig into the core concepts of LLVM’s IR!