About programming

What is a programming language?

A computer cannot directly execute commands in natural languages, such as English or Spanish. In fact, inside the computer's memory and processor (CPU), all instructions are represented as sequences of bits, i.e., values that can be either 0 or 1 (or, electrically, low voltage vs high voltage). This representation is called the binary code or machine code and is very specific for each type of processor family.

Tip

The instructions "understood" by the CPU describe very simple tasks, such as read value A from memory location X and add values B and C and store the result in A. A computer program combines sequences of these very simple instructions to finally perform complex tasks.

Although it is possible write instructions to the computer directly in machine code, this task would be extremelly difficult and error prone, and unattainable for people who value their own sanity. Instead, we use programming languages to describe commands using a grammar that is similar to a natural languages (but not as flexible):

first, we must create a plain text file, called the source code, containing a program written following the grammar and conventions of a specific programming language that we choose (there are many!);
then, we use a special computer program (already in binary form), called a translator, that reads the source file as input, and converts it into machine code that can be directly executed by the CPU (Fig. 1).

Fig. 1 - Source code to binary code translation.

Since the translator itself is a computer program and not a human:

it cannot guess what we mean if we make syntax errors (grammar mistakes) — thus, we must respect the language rules very strictly;
it simply converts the code that we wrote, being unable to guess our intentions — thus, if we make logical errors in our program, there's nothing the translator can do for us;

Note

Because programming languages, unlike natural languages, are very precise and unambiguous (there is only one possible interpretation for each command), they are a type of formal language. Other examples of formal languages include Mathematics notation and Propositional Logic.

Types of translators

There are two main types of translators:

interpreters are programs that execute source code on the fly, i.e, every time you run the interpreter, the source code is processed from scratch and each line is validated and executed according to the rules of the language, one by one (Fig. 2-a);
compilers are programs that convert source code to an executable form, i.e., they read the source code once and generate a new file that contains binary instructions in a format that can be loaded by the operating sytem, this file is called an executable file (Fig. 2-b STEP 1); in order to run the program, the operating system loads the executable file into memory, and the CPU executes the binary code directly (Fig. 2-b STEP 2) — at this point, the compiler is no longer necessary;

Fig. 2 - (a) interpreters process and execute source code directly; (b) compilers generate an optimized binary format that can be efficiently executed by the CPU.

But why use different strategies? Like most things in Computer Science, the choice is a trade-off:

Since interpreters directly process the source code, it is simple to change the program and immediately test it, or to describe complex operations using simple commands; however, interpreting the source code line by line is usually many orders of magnitude slower than executing binary code;
Since compilers generate binary instructions that are optimized to be run by the CPU, compiled programs are very fast and efficient; however, every time the source code changes, a new compilation is necessary, which can be a slow process for large and/or complex programs;

Language implementers choose the best strategy based on the characteristics and goals of the language. Typically, languages with easier to use grammar and/or simple instructions that perform very complex operations are interpreted, while languages that have more detailed instructions and allow the programmer to control more aspects of the program execution are compiled.

Table 1 below illustrates the main advantages and disadvantages of each strategy:

Criteria

Interpreted languages

Compiled languages

Execution speed

✖ (Very) Slow

⇒ The source code must be processed from scratch line by line, every time.

✔ Fast

⇒ After compiling the code once (which can be slow), the executable file is in the optimized binary format accepted by the CPU.

Dynamic behaviour
(changing aspects of the program while executing)

✔ Simple

⇒ Since the state of the program is created dynamically as the code is interpreted line by line, it is also simpler to dynamically change it.

✖ Difficult

⇒ Since the state of the program is represented as a memory layout during compilation, any dynamic behaviour must be designed beforehand by the programmer.

Expressivity
(ability to express complex operations using simple instructions)

✔ High(er)

⇒ Complex operations may be implemented internallly in the interpreter; which then translates simple commands into many instructions on the fly.

✖ Low(er)

⇒ Compiled languages can only be fast if they have commands closer to the machine instructions; for this reason, expressing complex operations usually requires many programming language commands.

Portability
(run the program in many different platforms)

✔ Simple

⇒ Since the source code is always processed directly, to support the language in a new environment (e.g., another processor architecture and/or operating system), it is enough to create an interpreter that runs in the target environment.

✖ Difficult

⇒ Since it is necessary to generate machine code that is compatible with the new environment, the new compiler must handle all the specific machine instructions, conventions and behaviour of the new environment, which is a much more complex task.

Optimization
(taking advantage of specific aspects of the environment to make the program faster)

✖ Difficult

⇒ Since the source code is read directly line by line, the execution of the program is necessarily slower, at least on the first run; however, modern interpreters often use techniques that generate optimized machine code on the fly to speed up the subsequent executions of different parts of the program (this process is very complex, though).

✔ Simple(r)

⇒ Since the compiler generates executable files with the machine instructions that will be directly executed by the CPU, it can greatly take advantage of the fastest execution strategies available for each processor and/or operating system.

Table 1: Interpreted vs compiled languages.

Warning

The table above is a generalization. Recent technological advances in both strategies have greatly reduced their shortcomings, as briefly explained in "Other strategies" below.

Tip

A given programming language is not necessarily interpreted or compiled. The programming language itself is just a specification of a formal and precise way to write computer programs. There can be many implementations of the same language, and each of them could theoretically be either in the form of an interpreter or of a compiler. However, most languages are typically implemented using only one of these techniques.

Examples of (typically) interpreted programming languages include: JavaScript, PHP, Ruby, Python, Prolog, etc...
Examples of (typically) compiled programming languages include: C, C++, Fortran, Rust, Pascal, Zig, etc...

Other strategies [extra]

Besides "pure" compilers and interpreters — i.e, compilers that generate machine code or interpreters that read the source code as-is—, there are other "hybrid" strategies that can be used to compensate the shortcomings of each approach. Some common examples:

Intermediate representation: in this approach, the compiler does not generate machine code directly, but instead uses an intermediate binary code that is much easier and faster to execute and/or convert to machine code on the fly. The intermediate code is designed to be more easily portable to other environments, i.e, it is simpler to write a program that interprets the intermediate code, than to write a new compiler for every new system that must be supported. Examples of programming languages that use this approach are:
- Java, whose compiler generates bytecode to be processed by the Java Virtual Machine;
- C#, whose compiler generates code in the Common Intermediate Language of the .NET platform;
- Python, whose interpreter generates a bytecode representation of each source file when it is first read, and later uses this optimized version to speed up execution;
Just in Time Compilation: also known by the acronym "JIT", this technique used in many programming language (and intermediate code) interpreters consists of generating an optimized machine code representation of small parts of the program the first time that the part is processed; later executions of that part of the program are not processed again, but instead use the previously generated optimized version, making the overall execution much faster in most scenarios;
Transpiling: a transpiler (or source to source translator) is a program that processes code written in a given source programming language and generates code in a different target programming language. Since it is usually easier to generate code in some existing programming language (instead of machine code or bytecode), this technique is useful when we want to test a new programming language idea without writing a complete new compiler, but instead using some existing compiler for a different language, and/or we want to make sure that our programs are compatible with programs written in the pre-existing language. Examples of programming languages that use transpilers are:
- TypeScript, whose compiler generates JavaScript code;
- the original C++ compiler (cfront), generated C code (current C++ compilers generate machine code directly);
- C++ Syntax 2, whose cppfront compiler generates C++ code;
- Nim, whose compiler generates C code;
- Dart, whose compiler can generate JavaScript

Domain-specific languages [extra]

A domain-specific language, often referred to by the acronym DSL is a special type of formal language, ofen containing some features found in programming languages, that is used in very restricted contexts, as opposed to general purpose programming languages that can be used in a many different of contexts. They are tipically used to describe and/or manipulate specific types of data or control specific systems. It is easier to understand with examples:

The Structured Query Language (SQL), is a language used to manipulate and retrieve data from certain types of databases, allowing users to perform complex logic operations and groupings in the data, among other features;
The HyperText Markup Language (HTML), is used to describe the layout and contents of web pages;
The JavaScript Object Notation (JSON), is used to describe data in a portable format that can be easily and quickly processed by many different systems (originally, JavaScript);
The bison syntax, is the notation used to describe programming language grammars and generate syntatic analyzers using the program bison;

DSL are usually interpreted, and their processing happens inside some bigger context, such as within a database management system or a web browser.

Abstraction levels

The main reason why there are so many programming languages available is that each language is designed with a certain purpose in mind. Although most languages are general purpose — which means that they can be used to create applications in several diffent contexts —, the design choices made for each language reflect the type of applications for which they should ideally be used.

That said, one of the main characteristics that define a language is its level of abstraction. Although a blurry notion, generally speaking, the level of abstration defines if the commands available in the language are closer to machine code, in other words, they have a low level of abstration, or if they are closer to natural (human) language, in other words, they have a high level of abstration.

Table 2 below illustrates this concept by giving examples of languages in different levels of abstration and what kind of commands they typically provide:

Level	Language	Type of commands
HIGH LOW	SQL	"Get a user by name" "Count comments of a given post"
	Java	"Define a type called `Player`, which is a subtype of `GameEntity`" "Create a list of objects of type `Monster`" "Create a variable that holds text" "Set the name of the monster"
	C	"Create a memory location that holds a sequence of characters" "Given a sequence of integers, get element at position `5`" "Do: Make `index` be `0`; while `index` is less than the length of the sequence: show the element at position `index`; increment `index`;"
	Assembly Language (a textual representation of machine code)	"load a 4-byte integer value from memory address `0xFFFFA` into register `R1`;" "add integer value at register `R2` to register `R7` and put the result in register `R1`;" "jump to address `0xABABAB`;"
	Binary code (machine code)	The same as Assembly Language, but with each parameter encoded in binary form (sequences of `0` and `1`)

Table 2: Examples of different levels of abstraction.

Tip

Although "low level" languages such as C or C++ are based on very detailed commands (i.e., that allow to control the hardware very precisely), we will see during this tutorial that they also provide are mechanisms that allow programmers to group sequences of instructions into single blocks with a name, called functions (or procedures), or to define their own custom data types that can be manipulated more conveniently. Using such mechanisms, new levels of abstraction can be created within the source code to allow expressing more complex operations. This process is the most fundamental aspect of writing good programs.