Two Sessions on Faster Networking

4 months ago 17

Ready to give LWN a try?

With a subscription to LWN, you can stay current with what is happening in the Linux and free-software community and take advantage of subscriber-only site features. We are pleased to offer you a free trial subscription, no credit card required, so that you can see for yourself. Please, join us!

Cong Wang and Daniel Borkmann each led session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit about their respective plans to speed up networking in the Linux kernel. Both sessions described ways to remove unnecessary operations in the networking stack, but they focused on different areas. Wang spoke about using BPF to speed up socket operations, while Borkmann spoke about eliminating the overhead of networking operations on virtual machines.

sk_msg

Wang began by explaining that struct sk_msg is a data structure used internally for socket-layer messaging. He compared it to the more widely used struct sk_buff, but said that sk_msg was much simpler. BPF programs can access sk_msg structures through socket maps, where they are primarily used to let BPF programs redirect messages between sockets.

There are a few use cases for redirections like this. For example, bypassing the TCP stack when sending messages between a client and a server on the same machine. This can avoid unnecessary overhead, Wang explained, but it's only helpful if forwarding the messages in BPF is actually faster. After questioning from Borkmann, Wang clarified that this use case is speculative; it is not actually being used in production.

Depending on the types of sockets involved, exactly how messages are redirected can vary. When redirecting from a transmitting socket to a receiving socket, for example, the BPF program doesn't need to make any changes to the data at all, resulting in a fast transfer. When redirecting from a receiving socket to a transmitting socket, on the other hand, the BPF program needs to execute a series of conversions to convert the received data into the right format.

The TCP stack has had a lot of optimization over the years. For example, it efficiently batches short messages. As a result, BPF redirection is actually slower than traversing the whole TCP stack for short messages. That has been partially corrected by work from Zijian Zhang to add a buffer to batch short messages in sockets. Wang thinks that performance can be improved further, however, by reusing the code for Nagle's algorithm from the TCP stack.

Wang then presented a number of different, more speculative ideas for improving performance. These included introducing new, more efficient interfaces for BPF programs manipulating socket messages, removing locks where possible, and simplifying the transformations needed for the receiving-socket-to-transmitting-socket case.

There was some discussion of where sk_msg structures are used throughout the kernel and how those areas would be impacted. Wang closed out the session with the observation that TCP sockets are widely used; increasingly, containerized workloads use TCP sockets to communicate within the same physical machine. Any work to speed up local sockets will undoubtedly be generally useful.

Netkit for virtual machines

Virtual machines (VMs) provide comprehensive isolation from the physical hardware, at the cost of additional overhead. Where possible, it would be nice to reduce that overhead. Borkmann spoke about his work to remove some of the overhead of networking in VMs, as part of a larger plan to try to make VM workloads and container workloads use the same underlying tooling in Kubernetes.

Today, a VM running under Kubernetes runs inside a container with QEMU. This odd state of affairs is because Kubernetes started as a container-management engine, so putting the virtual machine manager inside a container lets Kubernetes reuse many existing tools. Borkmann shared this slide to explain what this does to the networking stack:

[A slide showing the networking abstractions underlying a VM]

In short, a network packet destined for an application running in a virtual machine must be received by the physical hardware, handled by the host kernel, forwarded to the virtual container bridge network, given to the host side of QEMU's virtual network device, passed into the virtual machine, and finally handled by the guest kernel.

This is a lot of unnecessary work, Borkmann said. About a year ago, QEMU got a new networking backend based on AF_XDP sockets; he suspected that AF_XDP sockets could be used to bypass the steps above. The change is not trivial because express data path (XDP) is not supported inside network namespaces (which are used in containers). Borkmann's idea was to reserve a set of queues on the physical network card, bind those to Cilium's netkit (a kernel driver that is designed to help reduce the overhead for network namespaces), and dedicate those queues to the network namespace of the container.

This would let traffic go directly from the physical hardware, to QEMU's AF_XDP networking backend, to the VM's kernel. This is about as minimal as the overhead could be, because the host system still needs to be in control of the actual hardware. The design would also let BPF programs running on the host intercept and modify traffic as normal.

Just before the summit, Borkmann got a proof-of-concept implementation working. The code is not too complicated, he said, but there are still several APIs that he would like to slightly tweak in order to simplify the idea. In particular, the XDP API is fairly limited, compared to a hardware networking device; Borkmann wants to extend that API with support for various kinds of hardware offload.

Although that session was not the last in the BPF track, it does mark the completion of LWN's coverage for this year. The last session in the BPF track was already covered in the same article as Mahé Tardy's earlier session.

Read Entire Article