System calls: how programs talk to the Linux kernel

In the previous post we took a look at the kernel, what it is and what it’s responsible for. It is a single file, and among other things it “gives us APIs to interact with the hardware”.

We also built a simple distribution that consisted of the kernel and our init program written in Go.

But if the kernel makes hardware interactions possible for programs, how was init able to print on the screen? The screen is a hardware device too.

And where is that API that makes these interactions possible?

Let’s find out.

Observing our init program

First, let’s install a tool called strace that lets us observe what programs are doing:

~$ sudo apt -y install strace

Now let’s enter the directory where we built our Go program and run our init from the previous post, but this time with strace:

~$ cd linux-inside-out/init
~/linux-inside-out/init$ strace -f -e trace=execve,getpid,write ./init

Note: the -f flag instructs strace to follow any other process that our program might start. The -e trace=execve,getpid,write part filters the output to only show these three system calls. Without this filter, strace output can be very verbose and overwhelming. We will focus on these for now.

You will see something like this:

execve("./init", ["./init"], 0x7ffdd61f6f88 /* 25 vars */) = 0
strace: Process 25165 attached
strace: Process 25166 attached
strace: Process 25167 attached
strace: Process 25168 attached
[pid 25164] write(1, "Hello from Go init!\n", 20Hello from Go init!
) = 20
[pid 25164] getpid()                    = 25164
[pid 25164] write(1, "PID: 25164\n", 11PID: 25164
) = 11
[pid 25164] write(1, "tick 0\n", 7tick 0
)     = 7
[pid 25165] getpid()                    = 25164
[pid 25164] write(1, "tick 1\n", 7)     = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 25164] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=25164, si_uid=1000} ---
[pid 25164] write(1, "tick 1\n", 7tick 1
)     = 7
[pid 25164] write(1, "tick 2\n", 7tick 2
)     = 7
[pid 25165] getpid( <unfinished ...>
[pid 25164] write(1, "tick 3\n", 7 <unfinished ...>
[pid 25165] <... getpid resumed>)       = 25164
tick 3
[pid 25164] <... write resumed>)        = 7
[pid 25164] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=25164, si_uid=1000} ---
[pid 25165] getpid( <unfinished ...>
[pid 25164] write(1, "tick 4\n", 7 <unfinished ...>
[pid 25165] <... getpid resumed>)       = 25164
tick 4
[pid 25164] <... write resumed>)        = 7
[pid 25164] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=25164, si_uid=1000} ---
^Cstrace: Process 25164 detached
strace: Process 25167 detached
strace: Process 25168 detached
strace: Process 25166 detached
strace: Process 25165 detached

Press Ctrl + C to exit from the process.

Let’s go step by step through what we see here:

execve("./init", ["./init"], 0x7ffdd61f6f88 /* 25 vars */) = 0

The ./init might be familiar. This is the name of our program. With execve we ask the kernel to execute a program.

[pid 25164] write(1, "Hello from Go init!\n", 20Hello from Go init!
) = 20

This is what we print at the beginning of our program (fmt.Println("Hello from Go init!")).

[pid 25164] getpid()                    = 25164
[pid 25164] write(1, "PID: 25164\n", 11PID: 25164

We ask the kernel the process ID of our running program that is 25164 (getpid() = 25164). Then we print it (write(1, "PID: 25164\n", 11PID: 25164). That is exactly what we did in our Go code: fmt.Println("PID:", os.Getpid()) // printing the PID (process ID).

Then we wait two seconds and print a tick [number] in a loop:

[pid 25164] write(1, "tick 1\n", 7tick 1
)     = 7
...
[pid 25164] write(1, "tick 2\n", 7tick 2
)     = 7
[pid 25164] write(1, "tick 3\n", 7 <unfinished ...>
...

fmt.Println("tick", i)
time.Sleep(2 * time.Second)

So what is the execve, getpid, write that we see here?

System calls

These are system calls, the kernel’s API for userspace programs.

When your Go program calls fmt.Println(), somewhere deep in the Go runtime, it calls write() to ask the kernel to output text.

Every time a program needs to interact with the outside world, whether it’s a hardware device, another program, a service on the network, or data on the disk, it makes a system call.

Your program runs in a sandbox environment. It can only access memory the kernel assigned to it, and it can only follow control flow and do math on the CPU.

The kernel owns all of the resources in the system. If a program needs access to any of them, it needs to ask the kernel.

This design provides us multiple benefits:

Security: The kernel can check permissions during system calls
Isolation: A misbehaving program can’t crash other programs or the computer
Observability: Because everything goes through system calls, you can trace and audit what any program is doing

All programs use the same system call interface

Now let’s trace another program. echo, a simple utility written in C that you might be familiar with.

~$ strace echo "Hello from Go init!"

execve("/usr/bin/echo", ["echo", "Hello from Go init!"], 0x7ffea82ea1d8 /* 25 vars */) = 0
...
write(1, "Hello from Go init!\n", 20Hello from Go init!
)   = 20
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

write(1, "Hello from Go init!\n", 20Hello from Go init!
)   = 20

This line is exactly the same as we saw above:

[pid 25164] write(1, "Hello from Go init!\n", 20Hello from Go init!
) = 20

Our Go program and the C program make the same system calls. The language doesn’t matter, they both talk to the kernel using the same API.

Every tool and command that you see in a book or a tutorial is a userspace program that performs a set of system calls to ask the kernel to do things.

How system calls actually work

Now we know that system calls are the kernel’s API for userspace programs.

Web APIs use the HTTP protocol as their communication channel.
But what is the channel of communication for system calls? How do programs perform these calls?

The answer lies in the hardware. Modern CPUs have different execution modes:

User mode (restricted): Where your programs run. They have limited access to memory and some CPU instructions. It is a sandboxed mode.
Kernel mode (privileged): Where the kernel runs. It has total access to all memory and all hardware instructions.

When a program wants to perform an operation that needs elevated privileges, it puts the system call number and arguments into the CPU registers (these are small memory banks built into the processor) and executes a built-in CPU instruction (on x86_64 architecture this is SYSCALL).

This instruction causes the CPU to instantly switch from user mode to kernel mode. The program stops here and control is handed over to the kernel.

The kernel looks at the registers, sees what the program wants to perform, checks if it is allowed to, and then performs the action.

The kernel puts the result back into a register and switches the CPU back to user mode. The program resumes its normal operation.

Note: the name of the modes and the syscall instruction vary by CPU architecture, but the concept is pretty much the same.

Your Program (User Mode)            Linux Kernel (Kernel Mode)
────────────────────────            ──────────────────────────

     │
     ├─ Put syscall number in register
     ├─ Put arguments in registers
     │
     └─ Execute: SYSCALL ───────────────────┐
                                            │
                                       [CPU MODE SWITCH]
                                            │
                                            ▼
                                     Kernel receives request
                                            │
                                            ├─ Check permissions
                                            ├─ Perform operation
                                            │
                                            └─ Put result in register
                                            │
                                       [CPU MODE SWITCH]
                                            │
result ◄────────────────────────────────────┘
     │
continue program...

You can even find this instruction in the source code of Go if you start tracking down what happens when you print something: src/internal/runtime/syscall/linux/asm_linux_amd64.s.

Troubleshooting with strace

Since all resource access goes through system calls, strace (and similar tracing tools) give you a troubleshooting superpower. \

It works on any program, written in any language, without access to source code. You’re observing the kernel’s API.

You can use it to find:

Files a web application tries to load but does not find

[pid 764640] stat("/app/assets/assets/app-5pEwFF_.js", 0x7ffdfef9af00) = -1 ENOENT (No such file or directory)
[pid 764640] stat("/app/assets/assets/bootstrap-xCO4u8H.js", 0x7ffdfef9af00) = -1 ENOENT (No such file or directory)
[pid 764640] stat("/app/assets/assets/bootstrap.js", 0x7ffdfef9af00) = -1 ENOENT (No such file or directory)
[pid 764640] stat("/app/assets/assets/styles/app-17aMgWE.css", 0x7ffdfef9af00) = -1 ENOENT (No such file or directory)
[pid 764640] stat("/app/assets/assets/styles/app.css", 0x7ffdfef9af00) = -1 ENOENT (No such file or directory)

What your application communicates with other services

[pid 764640] sendto(7, "l\0\0\0\3SELECT sess_data, sess_lifetime FROM sessions WHERE sess_id = '3666497cf012d41afe736f4eaf869e71' FOR UPDATE", 112, MSG_DONTWAIT, NULL, 0) = 112
[pid 764640] poll([{fd=7, events=POLLIN|POLLERR|POLLHUP}], 1, 86400000) = 1 ([{fd=7, revents=POLLIN}])
[pid 764640] recvfrom(7, "\1\0\0\1\2A\0\0\2\3def\twebapp\10sessions\10sessions\tsess_data\tsess_data\f?\0\377\377\377\377\374\221\20\0\0\0I\0\0\3\3def\tangolklub\10sessions\10sessions\rsess_lifetime\rsess_lifetime\f?\0\n\0\0\0\3)P\0\0\0\5\0\0\4\376\0\0\3\0\5\0\0\5\376\0\0\3\0", 32768, MSG_DONTWAIT, NULL, NULL) = 169

An application tries to connect to an API but it cannot

connect(4, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.255.255.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockname(4, {sa_family=AF_INET, sin_port=htons(59202), sin_addr=inet_addr("65.108.213.19")}, [128 => 16]) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fb3514ccdf0}, NULL, 8) = 0
poll([{fd=4, events=POLLOUT}, {fd=3, events=POLLIN}], 2, 1000) = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fb3514ccdf0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fb3514ccdf0}, NULL, 8) = 0
poll([{fd=4, events=POLLPRI|POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)

Note: strace slows down programs, and you can see sensitive information like secrets and personal data. Always take this into consideration when you use it.

The output of strace can be overwhelming at first. Start with simple commands that you are familiar with, like ls, cp, and focus on these common system calls:

They might have variants like openat for open, but the concept is the same.
You can look up any system call with man 2 <name> (you may need to install manpages-dev first):

~$ sudo apt -y install manpages-dev
~$ man 2 write

What we learned

We expanded our mental model that the kernel is a program that has special privileges
The kernel runs in kernel mode, regular programs run in a sandboxed environment called user mode
System calls are the kernel’s API for userspace programs
When a program interacts with the outside world, it does it via system calls
You can trace these system calls with tools like strace