(introduction-to-ebpf)= # Introduction to eBPF [eBPF](https://ebpf.io/) is a powerful tool for server and system administrators, often described as a lightweight, sandboxed virtual machine within the kernel. It is commonly used for performance monitoring, security, and network traffic processing without the need to modify or rebuild the kernel. Since it runs in the kernel space, there is no need for context-switching, making it very fast compared to solutions implemented in user-space. It also has access to the kernel data structures, providing more capabilities than tools limited to the interfaces exposed to user-space. BPF, which stands for "Berkeley Packet Filter", was originally designed to perform network packet filtering. Over time, it has evolved into *extended Berkeley Packet Filter* (eBPF), a tool which contains many additional capabilities, including the use of more registers, support for 64-bit registers, data stores (Maps), and more. As a result, eBPF has been extended beyond the kernel networking subsystem and it not only enhances the networking experience, but also provides tracing, profiling, observability, security, etc. The terms eBPF and BPF are used interchangeably but both refer to eBPF now. ## How eBPF works User-space applications can load eBPF programs into the kernel as eBPF bytecode. Although you could write eBPF programs in bytecode, there are several tools which provide abstraction layers on top of eBPF so you do not need to write bytecode manually. These tools will then generate the bytecode which will in turn be loaded into the kernel. Once an eBPF program is loaded into the kernel, it is then verified by the kernel before it can run. These checks include: * verifying if the eBPF program halts and will never get stuck in a loop, * verifying that the program will not crash by checking the registers and stack state validity throughout the program, and * ensuring the process loading the eBPF program has all the capabilities required by the eBPF program to run. After the verification step, the bytecode is Just-In-Time (JIT) compiled into machine-code to optimize the program's execution. ## eBPF in Ubuntu Server Since Ubuntu 24.04, `bpftrace` and `bpfcc-tools` (the BPF Compiler Collection or BCC) are available in every Ubuntu Server installation by default as part of our efforts to [enhance the application developer's and sysadmin's experience in Ubuntu](https://discourse.ubuntu.com/t/spec-include-performance-tooling-in-ubuntu/43134). In Ubuntu, you can use tools like the BCC to identify bottlenecks, investigate performance degradation, trace specific function calls, and create custom monitoring tools to collect data on specific kernel or user-space processes without disrupting running services. Both `bpftrace` and `bpfcc-tools` install sets of tools to handle these different functionalities. Apart from the `bpftrace` tool itself, you can fetch a comprehensive list of these tools with the following command: ```bash $ dpkg -L bpftrace bpfcc-tools | grep -E '/s?bin/.*$' | xargs -n1 basename ``` Most of the tools listed above are quite well documented. Their manpages usually include good examples you can try immediately. The tools ending in `.bt` are installed by `bpftrace` and refer to `bpftrace` scripts (they are text files, hence, you could read them to understand how they are achieving specific tasks). The ones ending in `-bpfcc` are BCC tools (from `bpfcc`) written in Python (you can also inspect those as you would inspect the `.bt` files). These `bpftrace` scripts often demonstrate how a complex task can be achieved in simple ways. The `-bpfcc` variants are often a bit more advanced, often providing more options and customizations. You will also find several text files describing use-case examples and the output for `bpftrace` tools in `/usr/share/doc/bpftrace/examples/` and for `bpfcc-tools` in `/usr/share/doc/bpfcc-tools/examples/`. For instance, ```bash # bashreadline.bt ``` will print bash commands for all running bash shells in your system. Since the information you can get via eBPF may be confidential, you will need to run any of it as root. You may notice that there is also a `bashreadline-bpfcc` tool available from `bpfcc-tools`. Both of them provide similar features. The former is implemented in python with BCC while the latter is a `bpftrace` script, as described above. You will see that many of these tools have both a `-bpfcc` and a `.bt` version. Do read their manpages (and perhaps the scripts) to choose which suits you best. ## Example - Determine what commands are executed One tool that is trivial but powerful is `execsnoop-bpfcc`, which allows you to answer common questions that should not be common - like "what other programs is this action eventually calling?". It can be used to determine if a program calls another tool too often or calls something that you'd not expect. It has become a common way to help you understand what happens when a maintainer script is executed in a .deb package in Ubuntu. For this task, you'd run `execsnoop-bpfcc` with the following arguments: * `-Uu root` - to reduce the noisy output only to things done in root context (like here the package install) * `-T` - to get time info along the log ```bash # In one console run: $ sudo execsnoop-bpfcc -Uu root -T # In another trigger what you want to watch for $ sudo apt install --reinstall vim-nox # Execsnoop in the first console will now report probably more than you expected: TIME UID PCOMM PID PPID RET ARGS 10:58:07 1000 sudo 1323101 1322857 0 /usr/bin/sudo apt install --reinstall vim-nox 10:58:10 0 apt 1323107 1323106 0 /usr/bin/apt install --reinstall vim-nox 10:58:10 0 dpkg 1323108 1323107 0 /usr/bin/dpkg --print-foreign-architectures ... 10:58:12 0 sh 1323134 1323107 0 /bin/sh -c /usr/sbin/dpkg-preconfigure --apt || true ... 10:58:13 0 tar 1323155 1323152 0 /usr/bin/tar -x -f --warning=no-timestamp 10:58:14 0 vim-nox.prerm 1323157 1323150 0 /var/lib/dpkg/info/vim-nox.prerm upgrade 2:9.1.0016-1ubuntu7.3 10:58:14 0 dpkg-deb 1323158 1323150 0 /usr/bin/dpkg-deb --fsys-tarfile /var/cache/apt/archives/vim-nox_2%3a9.1.0016-1ubuntu7.3_amd64.deb ... 10:58:14 0 update-alternat 1323171 1323163 0 /usr/bin/update-alternatives --install /usr/bin/vimdiff vimdiff /usr/bin/vim.nox 40 ... 10:58:17 0 snap 1323218 1323217 0 /usr/bin/snap advise-snap --from-apt 10:58:17 1000 git 1323224 1323223 0 /usr/bin/git rev-parse --abbrev-ref HEAD 10:58:17 1000 git 1323226 1323225 0 /usr/bin/git status --porcelain 10:58:17 1000 vte-urlencode-c 1323227 1322857 0 /usr/libexec/vte-urlencode-cwd ``` ## eBPF can be modified to your needs Let us look at another practical application of eBPF. This example is meant to show another use-case with eBPF then evolve this case into a more complex one by modifying it. ### Example - Find out which files QEMU is loading Let’s say you want to verify which binary files are loaded when running a particular QEMU command. QEMU is a truly complex program and sometimes it can be hard to make the connection from a command line to the files used from /usr/share/qemu. This is hard to determine when you define the QEMU command line, but becomes problematic when more useful layers of abstraction are used like libvirt or LXD or even things on top like OpenStack. While `strace` could be used, it would add additional overhead to the investigation process since `strace` uses `ptrace`, and context switching may be required. Furthermore, if you need to monitor a system to produce this answer, especially on a host running many VMs, `strace` quickly reaches its limits. Instead, `opensnoop` could be used to trace `open()` syscalls. In this case, we use `opensnoop-bpfcc` to have more parameters to tune it to our needs. The example will use the following arguments: * `--full-path` - Show full path for open calls using a relative path. * `--name qemu-system-x86` - only care about files opened by QEMU; The mindful reader will wonder why this isn't qemu-system-x86_64, but you'd see in unfiltered output of opensnoop that it is length limited, so only the shorter qemu-system-x86 can be used. ```bash # This will collect a log of files opened by QEMU $ sudo /usr/sbin/opensnoop-bpfcc --full-path --name qemu-system-x86 # # If now you in another console or anyone on this system in general runs QEMU, # this would log the files opened # # For example calling LXD for an ephemeral VM $ lxc launch ubuntu-daily:n n-vm-test --ephemeral --vm # # Will in opensnoop deliver a barrage of files opened 1308728 qemu-system-x86 -1 2 PID COMM FD ERR PATH /snap/lxd/current/zfs-2.2/lib/glibc-hwcaps/x86-64-v3/libpixman-1.so.0 1308728 qemu-system-x86 -1 2 /snap/lxd/current/zfs-2.2/lib/glibc-hwcaps/x86-64-v2/libpixman-1.so.0 1308728 qemu-system-x86 -1 2 /snap/lxd/current/zfs-2.2/lib/tls/haswell/x86_64/libpixman-1.so.0 ... 1313104 qemu-system-x86 58 0 /sys/dev/block/230:16/queue/zoned 1313104 qemu-system-x86 20 0 /dev/fd/4 ``` Of course the QEMU process opens plenty of things: shared libraries, config files, entries in `/{sys,dev,proc}`, and much more. But with `opensnoop-bpfcc`, we can see them all as they happen across the whole system. ### Focusing on a specific file type Imagine you only wanted to verify which `.bin` files this is loading. Of course, we could just use `grep` on the output, but this whole section is about showing eBPF examples to get you started. So here we make the simplest change -- modifying the python wrapper around the tracing eBPF code. Once you understand how to do this, you can go further in adapting them to your own needs by delving into the eBPF code itself, and from there to create your very own eBPF solutions from scratch. So while `opensnoop-bpfcc` as of right now has no option to filter on the file names, it could ... ```bash $ sudo cp /usr/sbin/opensnoop-bpfcc /usr/sbin/opensnoop-bpfcc.new $ sudo vim /usr/sbin/opensnoop-bpfcc.new ... $ diff -Naur /usr/sbin/opensnoop-bpfcc /usr/sbin/opensnoop-bpfcc.new --- /usr/sbin/opensnoop-bpfcc 2024-11-12 09:15:17.172939237 +0100 +++ /usr/sbin/opensnoop-bpfcc.new 2024-11-12 09:31:48.973939968 +0100 @@ -40,6 +40,7 @@ ./opensnoop -u 1000 # only trace UID 1000 ./opensnoop -d 10 # trace for 10 seconds only ./opensnoop -n main # only print process names containing "main" + ./opensnoop -c path # only print paths containing "fname" ./opensnoop -e # show extended fields ./opensnoop -f O_WRONLY -f O_RDWR # only print calls for writing ./opensnoop -F # show full path for an open file with relative path @@ -71,6 +72,9 @@ parser.add_argument("-n", "--name", type=ArgString, help="only print process names containing this name") +parser.add_argument("-c", "--contains", + type=ArgString, + help="only print paths containing this string (implies --full-path)") parser.add_argument("--ebpf", action="store_true", help=argparse.SUPPRESS) parser.add_argument("-e", "--extended_fields", action="store_true", @@ -83,6 +87,8 @@ help="size of the perf ring buffer " "(must be a power of two number of pages and defaults to 64)") args = parser.parse_args() +if args.contains is not None: + args.full_path = True debug = 0 if args.duration: args.duration = timedelta(seconds=int(args.duration)) @@ -440,6 +446,12 @@ if args.name and bytes(args.name) not in event.comm: skip = True + paths = entries[event.id] + paths.reverse() + entire_path = os.path.join(*paths) + if args.contains and bytes(args.contains) not in entire_path: + skip = True + if not skip: if args.timestamp: delta = event.ts - initial_ts @@ -458,9 +470,7 @@ if not args.full_path: printb(b"%s" % event.name) else: - paths = entries[event.id] - paths.reverse() - printb(b"%s" % os.path.join(*paths)) + printb(b"%s" % entire_path) if args.full_path: try: ``` Running the modified version now allows you to probe for specific file names, like all the `.bin` files: ```bash $ sudo /usr/sbin/opensnoop-bpfcc.new --contains '.bin' --name qemu-system-x86 PID COMM FD ERR PATH 1316661 qemu-system-x86 21 0 /snap/lxd/current/share/qemu//kvmvapic.bin 1316661 qemu-system-x86 39 0 /snap/lxd/current/share/qemu//vgabios-virtio.bin 1316661 qemu-system-x86 39 0 /snap/lxd/current/share/qemu//vgabios-virtio.bin ``` ### Use for other purposes, the limit is your imagination And just like with all the other tools and examples, the limit is your imagination. Wanted to know which files in `/etc` your complex intertwined apache config is really loading? ```bash $ sudo /usr/sbin/opensnoop-bpfcc.new --name 'apache2' --contains '/etc' PID COMM FD ERR PATH 1319357 apache2 3 0 /etc/apache2/apache2.conf 1319357 apache2 4 0 /etc/apache2/mods-enabled 1319357 apache2 4 0 /etc/apache2/mods-enabled/access_compat.load ... 1319357 apache2 4 0 /etc/apache2/ports.conf 1319357 apache2 4 0 /etc/apache2/conf-enabled 1319357 apache2 4 0 /etc/apache2/conf-enabled/charset.conf 1319357 apache2 4 0 /etc/apache2/conf-enabled/localized-error-pages.conf ... 1319357 apache2 4 0 /etc/apache2/sites-enabled/000-default.conf ``` ## eBPF's limitations When you read the example code change above, it is worth noticing - just like the existing `--name` option - this is filtering on the reporting side, not on event generation. So you should be aware and understand why you might receive eBPF messages like: ```text Possibly lost 84 samples ``` Since eBPF programs produce events on a ring buffer, if the rate of events exceeds the pace that the userspace process can consume the events, a program will lose some events (overwritten since it's a ring). The "Possibly lost .. samples" message hints that this is happening. This is conceptually the same for almost all kernel tracing facilities. Since they are not allowed to slow down the kernel, you can't tell the tool to "wait until I've consumed". Most of the time this is fine, but advanced users might need to aggregate on the eBPF side to reduce what needs to be picked up by the userspace. And despite having lower overhead, eBPF tools still need to find their balance between buffering, dropping events and consuming CPU. See the [same discussion](https://github.com/iovisor/bcc/issues/1033) in the examples of tools shown above. ## Conclusion eBPF offers a vast array of options to monitor, debug, and secure your systems directly in kernel-space (i.e., fast and omniscient), with no need to disrupt running services. It is an invaluable tool for system administrators and software engineers. ## References * [Introduction to eBPF video, given at the Ubuntu summit 2024](https://www.youtube.com/live/byPpJW5l6pg?t=30314s), eventually presenting an eBPF based framework for Kubernetes called Inspector Gadget. * For a deeper introduction into eBPF concepts consider reading [what is eBPF](https://ebpf.io/what-is-ebpf/) by the eBPF community. * For a complete documentation on eBPF, its internals and interfaces, please check the [upstream documentation](https://docs.kernel.org/bpf/). * The eBPF community also has a [list of eBPF based solutions](https://ebpf.io/applications/), many of which are related to the Kubernetes ecosystem.