Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp6156546rdb; Thu, 14 Dec 2023 09:41:45 -0800 (PST) X-Google-Smtp-Source: AGHT+IHX1waRQFUASfh50Po8o2ZVOWq2RBjvcquLcw6aCXxmrsfrwgCva/TN4mp92sOpb7l2ss45 X-Received: by 2002:a05:6359:a1f:b0:170:21ba:538b with SMTP id el31-20020a0563590a1f00b0017021ba538bmr6876557rwb.2.1702575704968; Thu, 14 Dec 2023 09:41:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702575704; cv=none; d=google.com; s=arc-20160816; b=Hxvs9HsslAYaJ4pG8BjZ7nuvt/AxDQCFfsr0kVwDG4ehKY1n2i+KIzhK/jWOpwnoA8 XV3m5JQ1e2RqUZL66QaYydZqny/c1EVzNLILMRzNr/6pkcOwdqj2b7ri86BR2Ug/GvUh Sjq4H1/8FgeZkwbbzZSYx4wxnKmM9tVtlB7E7Ux7nWGFIr2L3pcKrRVMl+iCmXYvgmFT qksVgB0uUuwX7Ufj42oEVZkeDat4tKsi4dO05Cx4z8rYwlpPVdKeegMyjRO3qn6Vjzfm OOkterHecWynqknzGAgoGKiAw/6R93Idg3UqAHjoyN/sBebfZEQnjS2OlTpuaTJVvknR pQHQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=H40nK5IpwIy9PQlp7QwxJH6+lSaRPIO031fd0w6k2mM=; fh=ATP8HtKJBEYib9fIhp9nywdhLlVMt2EJzZntSvZqu6Y=; b=vwaGMR2CaPnhnS9/vnBsJkKHQEmSHMZ6Twp5quTFWjASvzt18pRVwv5f0UAVESNawt 1Yj5RNyEDOQVG7oDw2+nvroWF4C7KdiyFJAvJ13QvKqnlQGOi8g6VhxVt8tQ0pIPt3KR GGSE4/gFB9C3f+Tk8p5kzDu18CdoUniyO+pkeCAY+rlvcvWZgXybx1c1mlcG4rdV0YzX X0juvqTceQawUGymsPZ6OZxw2kyuElzGlEJ8CVfoTbL0jBxAzDrZv9QkMwU2H0AzlVBY 1w7SFpxe8CZ/j+6gwHyJjNYlgcJMHJVRfZdNG7OX4qZf41i5t3hB5QwdC3YsOlGXVtMP 2iuQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@networkplumber-org.20230601.gappssmtp.com header.s=20230601 header.b=HtpH4jdQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=networkplumber.org Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id o8-20020a656148000000b005c6763c301asi11669209pgv.467.2023.12.14.09.41.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 09:41:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@networkplumber-org.20230601.gappssmtp.com header.s=20230601 header.b=HtpH4jdQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=networkplumber.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 444DC80576C7; Thu, 14 Dec 2023 09:41:41 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1444198AbjLNRkz (ORCPT + 99 others); Thu, 14 Dec 2023 12:40:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39352 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230349AbjLNRkk (ORCPT ); Thu, 14 Dec 2023 12:40:40 -0500 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 00D9D114 for ; Thu, 14 Dec 2023 09:40:45 -0800 (PST) Received: by mail-pf1-x42e.google.com with SMTP id d2e1a72fcca58-6ce72faf1e8so5413588b3a.0 for ; Thu, 14 Dec 2023 09:40:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20230601.gappssmtp.com; s=20230601; t=1702575645; x=1703180445; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=H40nK5IpwIy9PQlp7QwxJH6+lSaRPIO031fd0w6k2mM=; b=HtpH4jdQ2kUP5QG5waUErTB6IWLsGrKrx4CbmWQVa43lgHvZxjD/UVJ/rpr3K2BCOb +3cuMjQzv9Dc/4jMMOZP02NcDKFMu9koKgzs7SzzGlLzt5kFuH/XyUI9nXGN1Zm2HjOq uYdoa6+E89rfKFTbOvyJzOnoPAAoR/3tOtfnyJWl24dw9G69nMUsE1xo+VVSKq8UtfOC JT0WM2F7WewNqAdBcpMExKgn0A4YQmBiSNoc1z7nQsaYdKfNYL39m9Pnvr2xiGZNIUJX ZNcuMdtDOffzl9Ca/ISyL30vRIdBD8YAKiR8PNTS/yXZelJWypowKiwqzKLzOXmDj3JP Iv1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702575645; x=1703180445; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=H40nK5IpwIy9PQlp7QwxJH6+lSaRPIO031fd0w6k2mM=; b=bRePUzhwplgOOumo0BhGfKr3//xjKJKhfo7J0EjdkjJgE/bCrfmn5zafbr+3I3UFKm rG4xOMU3Ps1HhruuxzDWMOIJCF1nuSNaZXqQCCxhHcozkoRiArGbTjiPuLi3rH7VuZtw ybL45tqmoq4N2tz/TobUSS5LvNIOgICo9VWsqM8GXmqOU5lpjNIondqPlpCcAkWBWyYX A3ciWn4Ws3bRmjKKF+0h4qAV0XMougyhUI477Ih9UB/DRoDAfFwm+xIcKGKMiFvueojw a9m4idB4JunqqkrTtpZpiRTezW9f7+Q55P7jIHr4FftP5rKUK057iat7gaNUdSMWu2JZ hpkA== X-Gm-Message-State: AOJu0YywatWJ4pASMOtTFqwmYyteAJsFZycr7+TJB+MgcysKRiB/y96L qYtaeIzjAtx4fP8Bp9TijDcwHA== X-Received: by 2002:a05:6a00:1381:b0:6ce:fa6e:5781 with SMTP id t1-20020a056a00138100b006cefa6e5781mr6017348pfg.45.1702575645311; Thu, 14 Dec 2023 09:40:45 -0800 (PST) Received: from hermes.local (204-195-123-141.wavecable.com. [204.195.123.141]) by smtp.gmail.com with ESMTPSA id x26-20020a62fb1a000000b006ce48a0b7c6sm12066018pfm.109.2023.12.14.09.40.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 09:40:45 -0800 (PST) Date: Thu, 14 Dec 2023 09:40:42 -0800 From: Stephen Hemminger To: Akihiko Odaki Cc: Benjamin Tissoires , Alexei Starovoitov , Jason Wang , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Jonathan Corbet , Willem de Bruijn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , "Michael S. Tsirkin" , Xuan Zhuo , Mykola Lysenko , Shuah Khan , Yuri Benditovich , Andrew Melnychenko , Benjamin Tissoires , bpf , "open list:DOCUMENTATION" , kvm@vger.kernel.org, LKML , virtualization@lists.linux-foundation.org, "open list:KERNEL SELFTEST FRAMEWORK" , Network Development Subject: Re: Should I add BPF kfuncs for userspace apps? And how? Message-ID: <20231214094042.75f704f6@hermes.local> In-Reply-To: <0d68722c-9e29-407b-9ef0-331683c995d2@daynix.com> References: <2f33be45-fe11-4b69-8e89-4d2824a0bf01@daynix.com> <0d68722c-9e29-407b-9ef0-331683c995d2@daynix.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 14 Dec 2023 09:41:41 -0800 (PST) On Thu, 14 Dec 2023 14:51:12 +0900 Akihiko Odaki wrote: > On 2023/12/13 19:22, Benjamin Tissoires wrote: > > On Tue, Dec 12, 2023 at 1:41=E2=80=AFPM Akihiko Odaki wrote: =20 > >> > >> On 2023/12/12 19:39, Benjamin Tissoires wrote: =20 > >>> Hi, > >>> > >>> On Tue, Dec 12, 2023 at 9:11=E2=80=AFAM Akihiko Odaki wrote: =20 > >>>> > >>>> Hi, =20 > >> > >> Hi, > >> > >> Thanks for reply. > >> =20 > >>>> > >>>> It is said eBPF is a safe way to extend kernels and that is very > >>>> attarctive, but we need to use kfuncs to add new usage of eBPF and > >>>> kfuncs are said as unstable as EXPORT_SYMBOL_GPL. So now I'd like to= ask > >>>> some questions: > >>>> > >>>> 1) Which should I choose, BPF kfuncs or ioctl, when adding a new fea= ture > >>>> for userspace apps? > >>>> 2) How should I use BPF kfuncs from userspace apps if I add them? > >>>> > >>>> Here, a "userspace app" means something not like a system-wide daemon > >>>> like systemd (particularly, I have QEMU in mind). I'll describe the > >>>> context more below: =20 > >>> > >>> I'm probably not the best person in the world to answer your > >>> questions, Alexei and others from the BPF core group are, but given > >>> that you pointed at a thread I was involved in, I feel I can give you > >>> a few pointers. > >>> > >>> But first and foremost, I encourage you to schedule an agenda item in > >>> the BPF office hour[4]. Being able to talk with the core people > >>> directly was tremendously helpful to me to understand their point. =20 > >> > >> I prefer emails because I'm not very fluent when speaking in English a= nd > >> may have a difficultly to listen to other people, but I may try it in > >> future. > >> =20 > >>> > >>> =20 > >>>> > >>>> --- > >>>> > >>>> I'm working on a new feature that aids virtio-net implementations us= ing > >>>> tuntap virtual network device. You can see [1] for details, but > >>>> basically it's to extend BPF_PROG_TYPE_SOCKET_FILTER to report four = more > >>>> bytes. > >>>> > >>>> However, with long discussions we have confirmed extending > >>>> BPF_PROG_TYPE_SOCKET_FILTER is not going to happen, and adding kfunc= s is > >>>> the way forward. So I decided how to add kfuncs to the kernel and ho= w to > >>>> use it. There are rich documentations for the kernel side, but I fou= nd > >>>> little about the userspace. The best I could find is a systemd change > >>>> proposal that is based on WIP kernel changes[2]. =20 > >>> > >>> Yes, as Alexei already replied, BPF is not adding new stable APIs, > >>> only kfuncs. The reason being that once it's marked as stable, you > >>> can't really remove it, even if you think it's badly designed and > >>> useless. > >>> > >>> Kfuncs, OTOH are "unstable" by default meaning that the constraints > >>> around it are more relaxed. > >>> > >>> However, "unstable" doesn't mean "unusable". It just means that the > >>> kernel might or might not have the function when you load your program > >>> in userspace. So you have to take that fact into account from day one, > >>> both from the kernel side and the userspace side. The kernel docs have > >>> a nice paragraph explaining that situation and makes the distinction > >>> between relatively unused kfuncs, and well known established ones. > >>> > >>> Regarding the systemd discussion you are mentioning ([2]), this is > >>> something that I have on my plate for a long time. I think I even > >>> mentioned it to Alexei at Kernel Recipes this year, and he frowned his > >>> eyebrows when I mentioned it. And looking at the systemd code and the > >>> benefits over a plain ioctl, it is clearer that in that case, a plain > >>> ioctl is better, mostly because we already know the API and the > >>> semantic. > >>> > >>> A kfunc would be interesting in cases where you are not sure about the > >>> overall design, and so you can give a shot at various API solutions > >>> without having to keep your bad v1 design forever. > >>> =20 > >>>> > >>>> So now I'm wondering how I should use BPF kfuncs from userspace apps= if > >>>> I add them. In the systemd discussion, it is told that Linus said it= 's > >>>> fine to use BPF kfuncs in a private infrastructure big companies own= , or > >>>> in systemd as those users know well about the system[3]. Indeed, tho= se > >>>> users should be able to make more assumptions on the kernel than > >>>> "normal" userspace applications can. > >>>> > >>>> Returning to my proposal, I'm proposing a new feature to be used by = QEMU > >>>> or other VMM applications. QEMU is more like a normal userspace > >>>> application, and usually does not make much assumptions on the kerne= l it > >>>> runs on. For example, it's generally safe to run a Debian container > >>>> including QEMU installed with apt on Fedora. BPF kfuncs may work eve= n in > >>>> such a situation thanks to CO-RE, but it sounds like *accidentally* > >>>> creating UAPIs. > >>>> > >>>> Considering all above, how can I integrate BPF kfuncs to the applica= tion? =20 > >>> > >>> FWIW, I'm not sure you can rely on BPF calls from a container. There > >>> is a high chance the syscall gets disabled by the runtime. =20 > >> > >> Right. Container runtimes will not pass CAP_BPF by default, but that > >> restriction can be lifted and I think that's a valid scenario. > >> =20 > >>> =20 > >>>> > >>>> If BPF kfuncs are like EXPORT_SYMBOL_GPL, the natural way to handle = them > >>>> is to think of BPF programs as some sort of kernel modules and > >>>> incorporate logic that behaves like modprobe. More concretely, I can= put > >>>> eBPF binaries to a directory like: > >>>> /usr/local/share/qemu/ebpf/$KERNEL_RELEASE =20 > >>> > >>> I would advise against that (one program per kernel release). Simply > >>> because your kfunc may or may not have been backported to kernel > >>> release v6.X.Y+1 while it was not there when v6.X.Y was out. So > >>> relying on the kernel number is just going to be a headache. > >>> > >>> As I understand it, the way forward is to rely on the kernel, libbpf > >>> and CO-RE: if the function is not available, the program will simply > >>> not load, and you'll know that this version of the code is not > >>> available (or has changed API). > >>> > >>> So what I would do if some kfunc API is becoming deprecated, is > >>> embedding both code paths in the same BPF unit, but marking them as > >>> not loaded by libppf. Then I can load the compilation unit, try v2 of > >>> the API, and if it's not available, try v1, and if not, then mention > >>> that I can not rely on BPF. Of course, this can also be done with > >>> separate compilation units. =20 > >> > >> Doesn't it mean that the kernel is free to break old versions of QEMU > >> including BPF programs? That's something I'd like to avoid. =20 > >=20 > > Couple of points here: > > - when you say "the kernel", it feels like you are talking about an > > external actor tampering with your code. But if you submit a kernel > > patch with a specific use case and get yourself involved in the > > community, why would anybody change your kfunc API without you knowing > > it? =20 >=20 > You are right in the practical aspect. I can pay efforts to keep kfunc=20 > APIs alive and I'm also sure other developers would also try not to=20 > break them for good. >=20 > Nevertheless I'm being careful to evaluate APIs from both of the kernel=20 > and userspace (QEMU) viewpoints. If I fail to keep kfuncs stable because= =20 > I die in an accident, for example, it's a poor excuse for other QEMU=20 > developers that I intended to keep them stable with my personal effort. >=20 > > - the whole warning about "unstable" policy means that the user space > > component should not take for granted the capability. So if the kfunc > > changes/disappears for good reasons (because it was marked as well > > used and deprecated for quite some time), qemu should not *break*, it > > should not provide the functionality, or have a secondary plan. > >=20 > > But even if you are encountering such issues, in case of a change in > > the ABI of your kfunc, it should be easy enough to backport the bpf > > changes to your old QEMUs and ask users to upgrade the user space if > > they upgrade their kernel. > >=20 > > AFAIU, it is as unstable as you want it to be. It's just that we are > > not in the "we don't break user space" contract, because we are > > talking about adding a kernel functionality from userspace, which > > requires knowing the kernel intrinsics. =20 >=20 > I must admit I'm still not convinced the proposed BPF program=20 > functionality needs to know internals of the kernel. >=20 > The eBPF program QEMU carries is just to calculate hashes from packets.=20 > It doesn't need to know the details of how the kernel handles packets.=20 > It only needs to have an access to the packet content. >=20 > It is exactly what BPF_PROG_TYPE_SOCKET_FILTER does, but it lacks a=20 > mechanism to report hash values so I need to extend it or invent a new=20 > method. Extending BPF_PROG_TYPE_SOCKET_FILTER is not a way forward since= =20 > CO-RE is superior to the context rewrite it relies on. But apparently=20 > adopting kfuncs and CO-RE also means to lose the "we don't break user=20 > space" contract although I have no intention to expose kernel internals=20 > to the eBPF program. An example is how one part of DPDK recomputes RSS over TAP. https://git.dpdk.org/dpdk/tree/drivers/net/tap/bpf/tap_bpf_program.c This feature is likely to be removed, because it is not actively used and the changes in BPF program loading broke it on current kernel releases. Which brings up the point that since the kernel does not have stable API/ABI for BPF program infrastructure, I would avoid it for projects that don't want to deal with that.