2022-03-17
Intermediate Representations (IRs) are used by compilers to represent programs throughout the compilation pipeline. To understand what an IR is, let’s look at what compilers do and how they are structured.
In essence, a compiler takes a program written in some source language and transforms it into an executable program; this transformation has a key property: it preserves behavior defined by the source language.
For example, a program written in C++ describes what happens in the abstract machine defined by the C++ standard. When the compiler transforms this program into a program for a real world machine, it must ensure that the real world machine will behave like the C++ abstract machine would.
In reality, the compiler performs many behaviour-preserving transformations before producing the final program; along the way, it uses many different representations for the program, like graphs, high-level instruction sequences, and real machine instructions. Let’s look at one possible flow that an LLVM-based compiler may follow; don’t worry about the names of each stage, as we’re only interested in the fact that they exist, and where the IR is located.
We start with the source program, and then we:
The list is not comprehensive, especially in the later stages which I am not familiar with; the important observation is the number of transformations and how all of them must preserve behavior.
Each representation in the list above is “intermediate” in the sense that it is neither the input program nor the final executable, but we usually define Intermediate Representation to mean the IR generated in Step 4, “optimized” (transformed) in Step 5, and lowered in Step 6.
Steps 1-5 are specific to the source language of the input program, whereas all other steps are agnostic to the language; the IR is the first such agnostic representation. Using this scheme, one can conceive different compilers that all share the “middle” and “back”-ends of the sequence above:
As a side-effect of a language-agnostic IR, the behavior required by the input language specification must be captured using generic mechanisms provided by the IR; the language specification can’t exist in that level, otherwise it would no longer be language-agnostic. Because of this, one can inspect IR and understand how language concepts are mapped to simpler and lower level code abstractions.
In the compilation pipeline, the IR sits between representations specific to source languages and representations specific to the target machine:
We can derive some of its design goals from where the IR is positioned in the compilation pipeline. It must be:
LLVM’s IR attempts to achieve these design goals by:
The next post in this series will dig into the core concepts of LLVM’s IR!