Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp806427ybn; Tue, 24 Sep 2019 09:47:39 -0700 (PDT) X-Google-Smtp-Source: APXvYqxf3jPabIaf0YGsTypxPNQXjIDEpI21ouCL/Cj7eQPBuATFf9yuiGREjQoVq6lSj91DIxUR X-Received: by 2002:a5d:4102:: with SMTP id l2mr3469685wrp.348.1569343659042; Tue, 24 Sep 2019 09:47:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569343659; cv=none; d=google.com; s=arc-20160816; b=O+gVG4wR9wirs4IwQWgxDTBe+c3mqRKzj5Ft3Ab3guALzm1/2aoR5da1s8Uyl5Tztp zfbieqefnEwrfi+u5BnnWepGEeDmS0lTBJf7FKqqvFIctC/lKvaADpjMDxVTeHgvDbZ+ tZQfwhbJhH7Rs+MWl1bydxUgAMTq23XRb777l5DH2mqmQSUdW3Pi6+b8GLAW/93k1K2+ 9ovz5zdw6yujjlrzeSADZAvCOadwfaGFsf4EAxgxsSdy/0hN4LAahihUrI7GB06eNvXp FQqF+Di/4YIIsZ4mQnSbfx+wGR9Z4f4n0yQMRStFrG2lolpr/iavKlG+SwV6f4LduYau YtQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=3dT6u52KylhsYRTEJEEw0V2ENMwu2Km8UW+8WTY1/G0=; b=nYWJOHbXVHzhxBXKeYmmSPTG0SQJfNvsDJt90BaGXgK1I2RdhpQpALnVjZoZ1DRt5s vuQ8eaIXp48R1AtmR0K/WNCJ146dqzizx7CMYIFFu4O5x79YbdmNfGwsolDFuJKZ5+iM MmT+U0db8tRnsS/p3TzJGhG0AhJaXG7lPLbjvZ53qcM4x8qSlbEO7A3Xz2CZRIpCJxOE l/V0sju9a/ZeKOikMfhtl53hdEUxIs9Jw3j0HawpuTBQ0Z/BKoPbR8RPEa2ai9z/W6x5 a8Sft8xV4eHKM2pXLUcAX4JR08jhL0RzOYTo2F+c7Xu0z45qW/5PnnqM1Qajp8OGPZOv Xa/g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l26si1507779edb.258.2019.09.24.09.47.15; Tue, 24 Sep 2019 09:47:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391035AbfIWFQI (ORCPT + 99 others); Mon, 23 Sep 2019 01:16:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49614 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389091AbfIWFQH (ORCPT ); Mon, 23 Sep 2019 01:16:07 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8E16830821A3; Mon, 23 Sep 2019 05:16:06 +0000 (UTC) Received: from [10.72.12.37] (ovpn-12-37.pek2.redhat.com [10.72.12.37]) by smtp.corp.redhat.com (Postfix) with ESMTP id AA8555C1D6; Mon, 23 Sep 2019 05:15:56 +0000 (UTC) Subject: Re: [PATCH net-next] tuntap: Fallback to automq on TUNSETSTEERINGEBPF prog negative return To: Matt Cover Cc: "Michael S. Tsirkin" , davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, Eric Dumazet , Stanislav Fomichev , Matthew Cover , mail@timurcelik.de, pabeni@redhat.com, Nicolas Dichtel , wangli39@baidu.com, lifei.shirley@bytedance.com, tglx@linutronix.de, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, bpf@vger.kernel.org References: <20190920185843.4096-1-matthew.cover@stackpath.com> <20190922080326-mutt-send-email-mst@kernel.org> <20190922162546-mutt-send-email-mst@kernel.org> <7d3abb5d-c5a7-9fbd-f82e-88b4bf717a0b@redhat.com> From: Jason Wang Message-ID: Date: Mon, 23 Sep 2019 13:15:54 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.47]); Mon, 23 Sep 2019 05:16:06 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/9/23 上午11:18, Matt Cover wrote: > On Sun, Sep 22, 2019 at 7:34 PM Jason Wang wrote: >> >> On 2019/9/23 上午9:15, Matt Cover wrote: >>> On Sun, Sep 22, 2019 at 5:51 PM Jason Wang wrote: >>>> On 2019/9/23 上午6:30, Matt Cover wrote: >>>>> On Sun, Sep 22, 2019 at 1:36 PM Michael S. Tsirkin wrote: >>>>>> On Sun, Sep 22, 2019 at 10:43:19AM -0700, Matt Cover wrote: >>>>>>> On Sun, Sep 22, 2019 at 5:37 AM Michael S. Tsirkin wrote: >>>>>>>> On Fri, Sep 20, 2019 at 11:58:43AM -0700, Matthew Cover wrote: >>>>>>>>> Treat a negative return from a TUNSETSTEERINGEBPF bpf prog as a signal >>>>>>>>> to fallback to tun_automq_select_queue() for tx queue selection. >>>>>>>>> >>>>>>>>> Compilation of this exact patch was tested. >>>>>>>>> >>>>>>>>> For functional testing 3 additional printk()s were added. >>>>>>>>> >>>>>>>>> Functional testing results (on 2 txq tap device): >>>>>>>>> >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun no prog ========== >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1' >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog -1 ========== >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '-1' >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1' >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 0 ========== >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '0' >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0' >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 1 ========== >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '1' >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '1' >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 2 ========== >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '2' >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0' >>>>>>>>> >>>>>>>>> Signed-off-by: Matthew Cover >>>>>>>> Could you add a bit more motivation data here? >>>>>>> Thank you for these questions Michael. >>>>>>> >>>>>>> I'll plan on adding the below information to the >>>>>>> commit message and submitting a v2 of this patch >>>>>>> when net-next reopens. In the meantime, it would >>>>>>> be very helpful to know if these answers address >>>>>>> some of your concerns. >>>>>>> >>>>>>>> 1. why is this a good idea >>>>>>> This change allows TUNSETSTEERINGEBPF progs to >>>>>>> do any of the following. >>>>>>> 1. implement queue selection for a subset of >>>>>>> traffic (e.g. special queue selection logic >>>>>>> for ipv4, but return negative and use the >>>>>>> default automq logic for ipv6) >>>>>>> 2. determine there isn't sufficient information >>>>>>> to do proper queue selection; return >>>>>>> negative and use the default automq logic >>>>>>> for the unknown >>>>>>> 3. implement a noop prog (e.g. do >>>>>>> bpf_trace_printk() then return negative and >>>>>>> use the default automq logic for everything) >>>>>>> >>>>>>>> 2. how do we know existing userspace does not rely on existing behaviour >>>>>>> Prior to this change a negative return from a >>>>>>> TUNSETSTEERINGEBPF prog would have been cast >>>>>>> into a u16 and traversed netdev_cap_txqueue(). >>>>>>> >>>>>>> In most cases netdev_cap_txqueue() would have >>>>>>> found this value to exceed real_num_tx_queues >>>>>>> and queue_index would be updated to 0. >>>>>>> >>>>>>> It is possible that a TUNSETSTEERINGEBPF prog >>>>>>> return a negative value which when cast into a >>>>>>> u16 results in a positive queue_index less than >>>>>>> real_num_tx_queues. For example, on x86_64, a >>>>>>> return value of -65535 results in a queue_index >>>>>>> of 1; which is a valid queue for any multiqueue >>>>>>> device. >>>>>>> >>>>>>> It seems unlikely, however as stated above is >>>>>>> unfortunately possible, that existing >>>>>>> TUNSETSTEERINGEBPF programs would choose to >>>>>>> return a negative value rather than return the >>>>>>> positive value which holds the same meaning. >>>>>>> >>>>>>> It seems more likely that future >>>>>>> TUNSETSTEERINGEBPF programs would leverage a >>>>>>> negative return and potentially be loaded into >>>>>>> a kernel with the old behavior. >>>>>> OK if we are returning a special >>>>>> value, shouldn't we limit it? How about a special >>>>>> value with this meaning? >>>>>> If we are changing an ABI let's at least make it >>>>>> extensible. >>>>>> >>>>> A special value with this meaning sounds >>>>> good to me. I'll plan on adding a define >>>>> set to -1 to cause the fallback to automq. >>>> Can it really return -1? >>>> >>>> I see: >>>> >>>> static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, >>>> struct sk_buff *skb) >>>> ... >>>> >>>> >>>>> The way I was initially viewing the old >>>>> behavior was that returning negative was >>>>> undefined; it happened to have the >>>>> outcomes I walked through, but not >>>>> necessarily by design. >>>> Having such fallback may bring extra troubles, it requires the eBPF >>>> program know the existence of the behavior which is not a part of kernel >>>> ABI actually. And then some eBPF program may start to rely on that which >>>> is pretty dangerous. Note, one important consideration is to have >>>> macvtap support where does not have any stuffs like automq. >>>> >>>> Thanks >>>> >>> How about we call this TUN_SSE_ABORT >>> instead of TUN_SSE_DO_AUTOMQ? >>> >>> TUN_SSE_ABORT could be documented as >>> falling back to the default queue >>> selection method in either space >>> (presumably macvtap has some queue >>> selection method when there is no prog). >> >> This looks like a more complex API, we don't want userspace to differ >> macvtap from tap too much. >> >> Thanks >> > This is barely more complex and provides > similar to what is done in many places. > For xdp, an XDP_PASS enacts what the > kernel would do if there was no bpf prog. > For tc cls in da mode, TC_ACT_OK enacts > what the kernel would do if there was > no bpf prog. For xt_bpf, false enacts > what the kernel would do if there was > no bpf prog (as long as negation > isn't in play in the rule, I believe). I think this is simply because you can't implement e.g XDP_PASS/TC_ACT_OK through eBPF itself which is not the case of steering prog here. > > I know that this is somewhat of an > oversimplification and that each of > these also means something else in > the respective hookpoint, but I standby > seeing value in this change. > > macvtap must have some default (i.e the > action which it takes when no prog is > loaded), even if that is just use queue > 0. We can provide the same TUN_SSE_ABORT > in userspace which does the same thing; > enacts the default when returned. Any > differences left between tap and macvtap > would be in what the default is, not in > these changes. And that difference already > exists today. I think it's better to safe to just drop the packet instead of trying to workaround it. Thanks > >>>>> In order to keep the new behavior >>>>> extensible, how should we state that a >>>>> negative return other than -1 is >>>>> undefined and therefore subject to >>>>> change. Is something like this >>>>> sufficient? >>>>> >>>>> Documentation/networking/tc-actions-env-rules.txt >>>>> >>>>> Additionally, what should the new >>>>> behavior implement when a negative other >>>>> than -1 is returned? I would like to have >>>>> it do the same thing as -1 for now, but >>>>> with the understanding that this behavior >>>>> is undefined. Does this sound reasonable? >>>>> >>>>>>>> 3. why doesn't userspace need a way to figure out whether it runs on a kernel with and >>>>>>>> without this patch >>>>>>> There may be some value in exposing this fact >>>>>>> to the ebpf prog loader. What is the standard >>>>>>> practice here, a define? >>>>>> We'll need something at runtime - people move binaries between kernels >>>>>> without rebuilding then. An ioctl is one option. >>>>>> A sysfs attribute is another, an ethtool flag yet another. >>>>>> A combination of these is possible. >>>>>> >>>>>> And if we are doing this anyway, maybe let userspace select >>>>>> the new behaviour? This way we can stay compatible with old >>>>>> userspace... >>>>>> >>>>> Understood. I'll look into adding an >>>>> ioctl to activate the new behavior. And >>>>> perhaps a method of checking which is >>>>> behavior is currently active (in case we >>>>> ever want to change the default, say >>>>> after some suitably long transition >>>>> period). >>>>> >>>>>>>> thanks, >>>>>>>> MST >>>>>>>> >>>>>>>>> --- >>>>>>>>> drivers/net/tun.c | 20 +++++++++++--------- >>>>>>>>> 1 file changed, 11 insertions(+), 9 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >>>>>>>>> index aab0be4..173d159 100644 >>>>>>>>> --- a/drivers/net/tun.c >>>>>>>>> +++ b/drivers/net/tun.c >>>>>>>>> @@ -583,35 +583,37 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) >>>>>>>>> return txq; >>>>>>>>> } >>>>>>>>> >>>>>>>>> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) >>>>>>>>> +static int tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) >>>>>>>>> { >>>>>>>>> struct tun_prog *prog; >>>>>>>>> u32 numqueues; >>>>>>>>> - u16 ret = 0; >>>>>>>>> + int ret = -1; >>>>>>>>> >>>>>>>>> numqueues = READ_ONCE(tun->numqueues); >>>>>>>>> if (!numqueues) >>>>>>>>> return 0; >>>>>>>>> >>>>>>>>> + rcu_read_lock(); >>>>>>>>> prog = rcu_dereference(tun->steering_prog); >>>>>>>>> if (prog) >>>>>>>>> ret = bpf_prog_run_clear_cb(prog->prog, skb); >>>>>>>>> + rcu_read_unlock(); >>>>>>>>> >>>>>>>>> - return ret % numqueues; >>>>>>>>> + if (ret >= 0) >>>>>>>>> + ret %= numqueues; >>>>>>>>> + >>>>>>>>> + return ret; >>>>>>>>> } >>>>>>>>> >>>>>>>>> static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, >>>>>>>>> struct net_device *sb_dev) >>>>>>>>> { >>>>>>>>> struct tun_struct *tun = netdev_priv(dev); >>>>>>>>> - u16 ret; >>>>>>>>> + int ret; >>>>>>>>> >>>>>>>>> - rcu_read_lock(); >>>>>>>>> - if (rcu_dereference(tun->steering_prog)) >>>>>>>>> - ret = tun_ebpf_select_queue(tun, skb); >>>>>>>>> - else >>>>>>>>> + ret = tun_ebpf_select_queue(tun, skb); >>>>>>>>> + if (ret < 0) >>>>>>>>> ret = tun_automq_select_queue(tun, skb); >>>>>>>>> - rcu_read_unlock(); >>>>>>>>> >>>>>>>>> return ret; >>>>>>>>> } >>>>>>>>> -- >>>>>>>>> 1.8.3.1