C++ Compilation Process

28 September, 2013 - 7 min read

The C++ compilation process

Compiling a source code file in C++ is a four-step process. For example, if you have a C++ source code file named prog1.cpp and you execute the compile command

   g++ -Wall -ansi -o prog1 prog1.cpp

the compilation process looks like this:

The C++ preprocessor copies the contents of the included header files into the source code file, generates macro code, and replaces symbolic constants defined using #define with their values.
The expanded source code file produced by the C++ preprocessor is compiled into the assembly language for the platform.
The assembler code generated by the compiler is assembled into the object code for the platform.
The object code file generated by the assembler is linked together with the object code files for any library functions used to produce an executable file.

By using appropriate compiler options, we can stop this process at any stage.

To stop the process after the preprocessor step, you can use the -E option:

<pre>   g++ -E prog1.cpp</pre>



The expanded source code file will be printed on standard output (the screen by default); you can redirect the output to a file if you wish. Note that the expanded source code file is often incredibly large - a 20 line source code file can easily produce an expanded file of 20,000 lines or more, depending on which header files were included.

To stop the process after the compile step, you can use the -S option:

<pre>   g++ -Wall -ansi -S prog1.cpp</pre>



By default, the assembler code for a source file named&nbsp;_filename.cpp_&nbsp;will be placed in a file named&nbsp;_filename.s_.

To stop the process after the assembly step, you can use the -c option:

<pre>   g++ -Wall -ansi -c prog1.cpp</pre>



By default, the assembler code for a source file named&nbsp;_filename.cpp_&nbsp;will be placed in a file named&nbsp;_filename.o_.

We will briefly highlight key features of the C Compilation model here.

Fig.The C Compilation Model

The Preprocessor

We will study this part of the compilation process in greater detail later.

However we need some basic information for some C programs.

The Preprocessor accepts source code as input and is responsible for

removing comments
interpreting special preprocessor directives denoted by #.

For example

#include -- includes contents of a named file. Files usually called header files. e.g
- #include <math.h> -- standard library maths file.
- #include <stdio.h> -- standard library I/O file
#define -- defines a symbolic name or constant. Macro substitution.
- #define MAX_ARRAY_SIZE 100

C Compiler

The C compiler translates source to assembly code. The source code is received from the preprocessor.

Assembler

The assembler creates object code. On a UNIX system you may see files with a .o suffix (.OBJ on MSDOS) to indicate object code files.

Link Editor

If a source file references library functions or functions defined in other source files the link editor combines these functions (with main()) to create an executable file. External Variable references resolved here also.

Below are the stages that happen in order regardless of the operating system/compiler and graphically illustrated in Figure w.1.

Preprocessing is the first pass of any C compilation. It processes include-files, conditional compilation instructions and macros.

Compilation is the second pass. It takes the output of the preprocessor, and the source code, and generates assembler source code.

Assembly is the third stage of compilation. It takes the assembly source code and produces an assembly listing with offsets. The assembler output is stored in an object file.

Linking is the final stage of compilation. It takes one or more object files or libraries as input and combines them to produce a single (usually executable) file. In doing so, it resolves references to external symbols, assigns final addresses to procedures/functions and variables, and revises code and data to reflect new addresses (a process called relocation).

ANSI C translation phases
                       =========================

          +-------------------------------------------------+
          | map physical characters to source character set |
          |     replace line terminators with newlines      |
          |           decode trigraph sequences             |
          +-------------------------------------------------+
                                   |
                                   V
               +---------------------------------------+
               | join lines along trailing backslashes |
               +---------------------------------------+
                                   |
                                   V
     +-------------------------------------------------------------+
     | decompose into preprocessing tokens and whitespace/comments |
     |                      strip comments                         |
     |                      retain newlines                        |
     +-------------------------------------------------------------+        
                                   |
                                   V
          +------------------------------------------------+
          | execute preprocessing directives/invoke macros |
          |              process included files            |
          +------------------------------------------------+
                                   |
                                   V
   +----------------------------------------------------------------+
   | decode escape sequences in character constants/string literals |
   +----------------------------------------------------------------+
                                   |
                                   V
                +--------------------------------------+
                | concatenate adjacent string literals |
                +--------------------------------------+
                                   |
                                   V
              +------------------------------------------+
              | convert preprocessing tokens to C tokens |
              |       analyze and translate tokens       |
              +------------------------------------------+
                                   |
                                   V
                    +-----------------------------+
                    | resolve external references |
                    |        link libraries       |
                    |      build program image    |
                    +-----------------------------+

The compilation of a C++ program involves several steps:

Preprocessing: the preprocessor takes a C++ source code file and deals with the #includes,#defines and other preprocessor directives. The output of this step is a "pure" C++ file without pre-processor directives;
Compilation: the compiler takes the pre-processor's output and produces an object file from it.
Linking: the linker takes the object files produced by the compiler and produces either a library or an executable file.

Preprocessing

The preprocessor handles the preprocessor directives, like #include and #define. It is agnostic of the syntax of C++, which is why it must be used with care.

It works on one C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending of #if, #ifdef and #ifndef directives.

The preprocessor is working on a stream of preprocessing token, and macro substitution is defined as replacing tokens by other tokens (the operator ## allows to merge two tokens when it make sense).

After all this it produces a single output that is a stream of tokens resulting from the transformations described above. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with clever use of the #if and #error directives.

Compilation

The compilation step is performed on each output of the preprocessor. It involves parsing the C++ source code (now without any preprocessor directives) and, producing an object file. This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don't provide a definition for it. The compiler doesn't mind this, and will happily produce the object file as long as the source code is well-formed.

Compilers usually let you stop compilation at this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don't need to recompile everything if you only change a single file.

The produced object files can be put in special archives called static libraries, for easier reusing later on.

It's at this stage the "regular" compiler errors, like syntax errors or failed overload resolution errors, are reported.

Linking

The linker is what produces the final compilation output from the object files the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven't got much in common with static libraries mentioned earlier) or an executable.

It links all the object files by replacing the references to undefined symbols contained within them with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.

At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don't exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.

END