2022-03-20
(This is an early draft)
One LLVM IR file (.ll) represents an LLVM IR Module, a
top-level entity encapsulating all other data structures in the IR.
There are four such data structures:
We will focus on global symbols (variables and functions).
Global symbols are top-level Values visible to the
entire Module. Their names always start with the @ symbol,
for example: @x, @__foo and
@main.
Unlike registers, the name of a global symbol may have semantic
meaning in the program; in other words, global symbols have linkage. For
example, a global symbol may have external linkage, which
means its name is visible to other Modules. For such a symbol,
it would be illegal to rename it: doing so could invalidate code in
other Modules.
Global symbols define memory regions allocated at compilation time.
For this reason, the Value of a global symbol has a pointer
type.
For example, if we declare a global variable of type i32
called x, the type of the Value
@x is ptr. To access the underlying integer,
we must first load from that address.
There are two kinds of global symbols: global variables and functions.
As a global symbol, global variables have a name and linkage.
Additionally, they require a type and a constant initial
Value:
@gv1 = external global float 1.0In this example, we have a global symbol that:
gv1.float Value.Value float 1.0.External linkage is the default and can be omitted:
@gv1 = global float 1.0From here on, we will be omitting linkage for all global symbols.
Recall that, because all global symbols define a memory region, the
Value @gv1 has a pointer type. As such, to
read or write the Value in that memory location we use
loads and stores:
%1 = load float, ptr @gv1
store float 2.0, ptr @gv1There is one other important variation of global variables, we may
replace global with the constant keyword:
@gv1 = constant float 1.0This means that stores to this memory region are illegal and the optimizer can assume they do not exist.
Let’s compile some C++ global declarations and look at the corresponding IR global variable:
int just_int;
// @just_int = dso_local global i32 0, align 4The keyword dso_local is used to indicate, roughly, that
this variable is not going to be “patched in” at runtime,
like in the case of dynamic libraries. This information is useful for
the optimizer.
Note that, while we didn’t explicitly initialize the C++ variable, it is zero-initialized in IR. Zero initialization is required by C++ in this case, so we see it captured in the C++ to IR translation.
Finally, there is alignment information: the address of this variable is guaranteed to be a multiple of 4.
extern int extern_int;
// @extern_int = external global i32, align 4If we make our variable extern, a few things change:
external linkage is explicitly written out. This is
just a quirk of the IR parser/printer. The variable
just_int also had external linkage
implicitly.dso_local: it could be
defined in some shared library that will be linked later.Let’s look at more examples:
const int const_int = 1;
// @_ZL9const_int = internal constant i32 1
static int static_int = 2;
// @_ZL10static_int = internal global i32 2
static const int static_const_int = 3;
// @_ZL16static_const_int = internal constant i32 3Compare these static variables to what happens with a class static variable:
class MyClass {
public:
static int static_class_member;
// @_ZN7MyClass19static_class_memberE = external global i32, align 4
static const int static_const_class_member;
// @_ZN7MyClass25static_const_class_memberE = external constant i32, align 4
};external
linkage. This shows the completely different meanings of static in a C++
program: where before we were using static to mean “local to this
translation unit”, and so it gets internal linkage, in the
class example we are essentially providing a namespace to the variable,
but it can still be accessed by other translation units.You can see these in action in Godbolt.
A function declaration in LLVM IR has the following syntax:
declare i64 @foo(i64, ptr)declare,i64),foo),i64,
ptr).A function definition is very similar to the declaration,
but we use a different keyword (define), provide names to
the parameters and include the body of the function:
define i64 @foo(i64 %val, ptr %myptr) {
%temp = load i64, ptr %myptr
%mul = mul i64 %val, %temp
ret %mul
}This function loads an i64 Value from
%ptr, multiplies it with %val and returns the
result (ret instruction).
What is the type of @foo? Like all global symbols, it
defines a memory region and therefore its type is a pointer type
(ptr).
It is a useful exercise to read the LLVM documentation on some of the topics discussed:
Now that we understand the core concepts in LLVM, discussed global symbols and explored some basic instructions, we are ready to dig into the biggest piece of the puzzle: function bodies.