Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp802970ybn; Tue, 24 Sep 2019 09:44:18 -0700 (PDT) X-Google-Smtp-Source: APXvYqzTN7uDdZatGSqC0K9lI6sDWqraNKBnvsOKSixxAOQgzEGgHP/s94qBTPBqjwORL+4M0nrc X-Received: by 2002:adf:b3c1:: with SMTP id x1mr3464016wrd.33.1569343458171; Tue, 24 Sep 2019 09:44:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569343458; cv=none; d=google.com; s=arc-20160816; b=FiED6Ryxe4N9x0qi5rszTAO/kjfB/Ebrh4POawxpMwkcn1jvawm6yeCVNoFVQEZd0I eaoF9k/5ZXXApzJ09f5O3l1JE5Kgm2ZhTfDPs+rOc+XTImgLgXhQwoU7pW1EtXU0bBEw dLcwRUtuweinXjYr8zrL567tzprtpiE0H31OnKQ5bKrTggtDmzSa5bpqw9sbesWvbveY 0JLESjKDl2sU9M92WRQg4VnTNXoLV329uQVr6BwvYQiLDiI7hK96S4tRRBeWCTUITFIB 0BMWNjxeopfOfIDo8ZHHh1uzczg+/9dpInrfqf6ONPkmMoBJoGCw9vVY5MaBFEFT6tzt fWYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=zhupR8Ex9gxeyIQ8MeX06xCcF2kh4J7o8Edkrd3lRHQ=; b=Z276O1cA22dd9RG9PL3W6AS2uXlNWHBqt2/+pBlc713NPhOufMmUNn1Nu1Z0vwQeMg HpNFnBqpiDt7PZrFmJnEFJQmbMa4RKnU8MD/vHOFA4sayu0GGx8VY5pEHjOWV/JpzdmN 6QMZOQBmDy4Rbn1nlGwcCnAMagWFpJ/E23tyonzDbrLBcDb61rGmKzwyTKAjP1CFnih0 ur7iq/YVGVuLBy9ZpZdoBFNuJWJU9f2l1200WV43l0R7oVE1kO6iqNezTfPn8fmGe4ra czHqsRoThZW14e5OSrt7g4Vpd/4o4IBb7coz1C7GQaFfdA9LqdchNTyzveHx7fKuzR37 qp8A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n2si1486577edq.264.2019.09.24.09.43.54; Tue, 24 Sep 2019 09:44:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390484AbfIWAvR (ORCPT + 99 others); Sun, 22 Sep 2019 20:51:17 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49498 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389560AbfIWAvQ (ORCPT ); Sun, 22 Sep 2019 20:51:16 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id BA7FD3082E44; Mon, 23 Sep 2019 00:51:15 +0000 (UTC) Received: from [10.72.12.112] (ovpn-12-112.pek2.redhat.com [10.72.12.112]) by smtp.corp.redhat.com (Postfix) with ESMTP id 640AA5D704; Mon, 23 Sep 2019 00:51:06 +0000 (UTC) Subject: Re: [PATCH net-next] tuntap: Fallback to automq on TUNSETSTEERINGEBPF prog negative return To: Matt Cover , "Michael S. Tsirkin" Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, Eric Dumazet , Stanislav Fomichev , Matthew Cover , mail@timurcelik.de, pabeni@redhat.com, Nicolas Dichtel , wangli39@baidu.com, lifei.shirley@bytedance.com, tglx@linutronix.de, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, bpf@vger.kernel.org References: <20190920185843.4096-1-matthew.cover@stackpath.com> <20190922080326-mutt-send-email-mst@kernel.org> <20190922162546-mutt-send-email-mst@kernel.org> From: Jason Wang Message-ID: Date: Mon, 23 Sep 2019 08:51:04 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.46]); Mon, 23 Sep 2019 00:51:15 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/9/23 上午6:30, Matt Cover wrote: > On Sun, Sep 22, 2019 at 1:36 PM Michael S. Tsirkin wrote: >> On Sun, Sep 22, 2019 at 10:43:19AM -0700, Matt Cover wrote: >>> On Sun, Sep 22, 2019 at 5:37 AM Michael S. Tsirkin wrote: >>>> On Fri, Sep 20, 2019 at 11:58:43AM -0700, Matthew Cover wrote: >>>>> Treat a negative return from a TUNSETSTEERINGEBPF bpf prog as a signal >>>>> to fallback to tun_automq_select_queue() for tx queue selection. >>>>> >>>>> Compilation of this exact patch was tested. >>>>> >>>>> For functional testing 3 additional printk()s were added. >>>>> >>>>> Functional testing results (on 2 txq tap device): >>>>> >>>>> [Fri Sep 20 18:33:27 2019] ========== tun no prog ========== >>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1' >>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran >>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog -1 ========== >>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '-1' >>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1' >>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran >>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 0 ========== >>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '0' >>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0' >>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 1 ========== >>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '1' >>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '1' >>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 2 ========== >>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '2' >>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0' >>>>> >>>>> Signed-off-by: Matthew Cover >>>> >>>> Could you add a bit more motivation data here? >>> Thank you for these questions Michael. >>> >>> I'll plan on adding the below information to the >>> commit message and submitting a v2 of this patch >>> when net-next reopens. In the meantime, it would >>> be very helpful to know if these answers address >>> some of your concerns. >>> >>>> 1. why is this a good idea >>> This change allows TUNSETSTEERINGEBPF progs to >>> do any of the following. >>> 1. implement queue selection for a subset of >>> traffic (e.g. special queue selection logic >>> for ipv4, but return negative and use the >>> default automq logic for ipv6) >>> 2. determine there isn't sufficient information >>> to do proper queue selection; return >>> negative and use the default automq logic >>> for the unknown >>> 3. implement a noop prog (e.g. do >>> bpf_trace_printk() then return negative and >>> use the default automq logic for everything) >>> >>>> 2. how do we know existing userspace does not rely on existing behaviour >>> Prior to this change a negative return from a >>> TUNSETSTEERINGEBPF prog would have been cast >>> into a u16 and traversed netdev_cap_txqueue(). >>> >>> In most cases netdev_cap_txqueue() would have >>> found this value to exceed real_num_tx_queues >>> and queue_index would be updated to 0. >>> >>> It is possible that a TUNSETSTEERINGEBPF prog >>> return a negative value which when cast into a >>> u16 results in a positive queue_index less than >>> real_num_tx_queues. For example, on x86_64, a >>> return value of -65535 results in a queue_index >>> of 1; which is a valid queue for any multiqueue >>> device. >>> >>> It seems unlikely, however as stated above is >>> unfortunately possible, that existing >>> TUNSETSTEERINGEBPF programs would choose to >>> return a negative value rather than return the >>> positive value which holds the same meaning. >>> >>> It seems more likely that future >>> TUNSETSTEERINGEBPF programs would leverage a >>> negative return and potentially be loaded into >>> a kernel with the old behavior. >> OK if we are returning a special >> value, shouldn't we limit it? How about a special >> value with this meaning? >> If we are changing an ABI let's at least make it >> extensible. >> > A special value with this meaning sounds > good to me. I'll plan on adding a define > set to -1 to cause the fallback to automq. Can it really return -1? I see: static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,                                         struct sk_buff *skb) ... > > The way I was initially viewing the old > behavior was that returning negative was > undefined; it happened to have the > outcomes I walked through, but not > necessarily by design. Having such fallback may bring extra troubles, it requires the eBPF program know the existence of the behavior which is not a part of kernel ABI actually. And then some eBPF program may start to rely on that which is pretty dangerous. Note, one important consideration is to have macvtap support where does not have any stuffs like automq. Thanks > > In order to keep the new behavior > extensible, how should we state that a > negative return other than -1 is > undefined and therefore subject to > change. Is something like this > sufficient? > > Documentation/networking/tc-actions-env-rules.txt > > Additionally, what should the new > behavior implement when a negative other > than -1 is returned? I would like to have > it do the same thing as -1 for now, but > with the understanding that this behavior > is undefined. Does this sound reasonable? > >>>> 3. why doesn't userspace need a way to figure out whether it runs on a kernel with and >>>> without this patch >>> There may be some value in exposing this fact >>> to the ebpf prog loader. What is the standard >>> practice here, a define? >> >> We'll need something at runtime - people move binaries between kernels >> without rebuilding then. An ioctl is one option. >> A sysfs attribute is another, an ethtool flag yet another. >> A combination of these is possible. >> >> And if we are doing this anyway, maybe let userspace select >> the new behaviour? This way we can stay compatible with old >> userspace... >> > Understood. I'll look into adding an > ioctl to activate the new behavior. And > perhaps a method of checking which is > behavior is currently active (in case we > ever want to change the default, say > after some suitably long transition > period). > >>>> >>>> thanks, >>>> MST >>>> >>>>> --- >>>>> drivers/net/tun.c | 20 +++++++++++--------- >>>>> 1 file changed, 11 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >>>>> index aab0be4..173d159 100644 >>>>> --- a/drivers/net/tun.c >>>>> +++ b/drivers/net/tun.c >>>>> @@ -583,35 +583,37 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) >>>>> return txq; >>>>> } >>>>> >>>>> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) >>>>> +static int tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) >>>>> { >>>>> struct tun_prog *prog; >>>>> u32 numqueues; >>>>> - u16 ret = 0; >>>>> + int ret = -1; >>>>> >>>>> numqueues = READ_ONCE(tun->numqueues); >>>>> if (!numqueues) >>>>> return 0; >>>>> >>>>> + rcu_read_lock(); >>>>> prog = rcu_dereference(tun->steering_prog); >>>>> if (prog) >>>>> ret = bpf_prog_run_clear_cb(prog->prog, skb); >>>>> + rcu_read_unlock(); >>>>> >>>>> - return ret % numqueues; >>>>> + if (ret >= 0) >>>>> + ret %= numqueues; >>>>> + >>>>> + return ret; >>>>> } >>>>> >>>>> static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, >>>>> struct net_device *sb_dev) >>>>> { >>>>> struct tun_struct *tun = netdev_priv(dev); >>>>> - u16 ret; >>>>> + int ret; >>>>> >>>>> - rcu_read_lock(); >>>>> - if (rcu_dereference(tun->steering_prog)) >>>>> - ret = tun_ebpf_select_queue(tun, skb); >>>>> - else >>>>> + ret = tun_ebpf_select_queue(tun, skb); >>>>> + if (ret < 0) >>>>> ret = tun_automq_select_queue(tun, skb); >>>>> - rcu_read_unlock(); >>>>> >>>>> return ret; >>>>> } >>>>> -- >>>>> 1.8.3.1