2022-03-20
(This is an early draft)
One LLVM IR file (.ll
) represents an LLVM IR Module, a
top-level entity encapsulating all other data structures in the IR.
There are four such data structures:
We will focus on global symbols (variables and functions).
Global symbols are top-level Value
s visible to the
entire Module. Their names always start with the @
symbol,
for example: @x
, @__foo
and
@main
.
Unlike registers, the name of a global symbol may have semantic
meaning in the program; in other words, global symbols have linkage. For
example, a global symbol may have external
linkage, which
means its name is visible to other Modules. For such a symbol,
it would be illegal to rename it: doing so could invalidate code in
other Modules.
Global symbols define memory regions allocated at compilation time.
For this reason, the Value
of a global symbol has a pointer
type.
For example, if we declare a global variable of type i32
called x
, the type of the Value
@x
is ptr
. To access the underlying integer,
we must first load from that address.
There are two kinds of global symbols: global variables and functions.
As a global symbol, global variables have a name and linkage.
Additionally, they require a type and a constant initial
Value
:
@gv1 = external global float 1.0
In this example, we have a global symbol that:
gv1
.float
Value
.Value
float 1.0
.External linkage is the default and can be omitted:
@gv1 = global float 1.0
From here on, we will be omitting linkage for all global symbols.
Recall that, because all global symbols define a memory region, the
Value
@gv1
has a pointer type. As such, to
read or write the Value
in that memory location we use
loads and stores:
%1 = load float, ptr @gv1
store float 2.0, ptr @gv1
There is one other important variation of global variables, we may
replace global
with the constant
keyword:
@gv1 = constant float 1.0
This means that stores to this memory region are illegal and the optimizer can assume they do not exist.
Let’s compile some C++ global declarations and look at the corresponding IR global variable:
int just_int;
// @just_int = dso_local global i32 0, align 4
The keyword dso_local
is used to indicate, roughly, that
this variable is not
going to be “patched in” at runtime,
like in the case of dynamic libraries. This information is useful for
the optimizer.
Note that, while we didn’t explicitly initialize the C++ variable, it is zero-initialized in IR. Zero initialization is required by C++ in this case, so we see it captured in the C++ to IR translation.
Finally, there is alignment information: the address of this variable is guaranteed to be a multiple of 4.
extern int extern_int;
// @extern_int = external global i32, align 4
If we make our variable extern
, a few things change:
external
linkage is explicitly written out. This is
just a quirk of the IR parser/printer. The variable
just_int
also had external
linkage
implicitly.dso_local
: it could be
defined in some shared library that will be linked later.Let’s look at more examples:
const int const_int = 1;
// @_ZL9const_int = internal constant i32 1
static int static_int = 2;
// @_ZL10static_int = internal global i32 2
static const int static_const_int = 3;
// @_ZL16static_const_int = internal constant i32 3
Compare these static variables to what happens with a class static variable:
class MyClass {
public:
static int static_class_member;
// @_ZN7MyClass19static_class_memberE = external global i32, align 4
static const int static_const_class_member;
// @_ZN7MyClass25static_const_class_memberE = external constant i32, align 4
};
external
linkage. This shows the completely different meanings of static in a C++
program: where before we were using static to mean “local to this
translation unit”, and so it gets internal
linkage, in the
class example we are essentially providing a namespace to the variable,
but it can still be accessed by other translation units.You can see these in action in Godbolt.
A function declaration in LLVM IR has the following syntax:
declare i64 @foo(i64, ptr)
declare
,i64
),foo
),i64
,
ptr
).A function definition is very similar to the declaration,
but we use a different keyword (define
), provide names to
the parameters and include the body of the function:
define i64 @foo(i64 %val, ptr %myptr) {
%temp = load i64, ptr %myptr
%mul = mul i64 %val, %temp
ret %mul
}
This function loads an i64
Value
from
%ptr
, multiplies it with %val
and returns the
result (ret
instruction).
What is the type of @foo
? Like all global symbols, it
defines a memory region and therefore its type is a pointer type
(ptr
).
It is a useful exercise to read the LLVM documentation on some of the topics discussed:
Now that we understand the core concepts in LLVM, discussed global symbols and explored some basic instructions, we are ready to dig into the biggest piece of the puzzle: function bodies.