Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp3831056ybf; Tue, 3 Mar 2020 13:45:13 -0800 (PST) X-Google-Smtp-Source: ADFU+vvLi28N6qIbJTMZHzSF7vY4X7er2zl9FoadreMDb1kPv4p7yfOc2yO+vbO9E9J5iKxAxI1S X-Received: by 2002:aca:75d2:: with SMTP id q201mr427615oic.81.1583271913027; Tue, 03 Mar 2020 13:45:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1583271913; cv=none; d=google.com; s=arc-20160816; b=fxU40s32SW9mjCJUHFFO5lROh5Qvz43IBZ92QNXBpzrIC4GmLSQxP45y/QmGnWAdOO /5JZORma57ajdQcmQoPSJHVpDHr1mTMm/OyFIHPmVKyDbbg6ADaeTSqSXgUuzwmglalG DpUYG+0YkQ7VlOy9PrMl9zNQQQWRcDm0Ngxa0atJWvvsS3+FirW96HiWrBd/KEKcNtJ6 Veu4C7e6DtsxnITT8EwLeCvDU8/j7gf16+kXXQEYZboNIW/McaxeJYuFnW0WmpN1M1SD AJcVC1yqaI4ipBINNwts+Ohe3M1lLyyw1LyyEpQ8kCeRMwawqIRm6NRjl7Zbo8qnUdZU 5qQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=vdNyhbuEie+WYZQc85wFdCBf25ywmb1hK9qslyFZ0HE=; b=L0uj+zdiAsy2PblsvJ2V1JP1D1B29g/1O329FjsMWYiFIEczbMuB+VshvIWIjnzktD +KvZupwU4a1IM+2rBxl0bXxnY05vIHvc7x3rSODGboSv2FbchLgjL8k+N1QsZqGC5itM ME/tyh5+KzUMvWOzqiBru51PZO+JBdhOnBnDky7oGSq/VDneOHfeZAw8HCnLjnHxtUM9 d7Qr2XXf+FWd5QAkMgkuJVBSC7kOqJ5G+jhYJ1Zemrz+bNzpMC1BmxR/Csdh/Kw/TI6h 70Z+CvXuAjzk8MtDErQSO4iP54OynLWd3dXsQEsJS5MBFM3Qq0Va7lBWa2k6RrJo9+XT XSVg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e14si7920546otk.89.2020.03.03.13.45.00; Tue, 03 Mar 2020 13:45:13 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731152AbgCCTrI (ORCPT + 99 others); Tue, 3 Mar 2020 14:47:08 -0500 Received: from www62.your-server.de ([213.133.104.62]:52452 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728180AbgCCTrI (ORCPT ); Tue, 3 Mar 2020 14:47:08 -0500 Received: from sslproxy02.your-server.de ([78.47.166.47]) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1j9DVI-0008SY-5W; Tue, 03 Mar 2020 20:46:56 +0100 Received: from [85.7.42.192] (helo=pc-9.home) by sslproxy02.your-server.de with esmtpsa (TLSv1.3:TLS_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1j9DVH-000JrL-Pf; Tue, 03 Mar 2020 20:46:55 +0100 Subject: Re: [PATCH v4] netdev attribute to control xdpgeneric skb linearization To: Willem de Bruijn , Jakub Kicinski Cc: Luigi Rizzo , Network Development , =?UTF-8?Q?Toke_H=c3=b8iland-J=c3=b8rgensen?= , David Miller , hawk@kernel.org, "Jubran, Samih" , linux-kernel , ast@kernel.org, bpf@vger.kernel.org References: <20200228105435.75298-1-lrizzo@google.com> <20200228110043.2771fddb@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> From: Daniel Borkmann Message-ID: <3c27d9c0-eb17-b20f-2d10-01f3bdf8c0d6@iogearbox.net> Date: Tue, 3 Mar 2020 20:46:55 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.2/25740/Tue Mar 3 13:12:16 2020) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2/29/20 12:53 AM, Willem de Bruijn wrote: > On Fri, Feb 28, 2020 at 2:01 PM Jakub Kicinski wrote: >> On Fri, 28 Feb 2020 02:54:35 -0800 Luigi Rizzo wrote: >>> Add a netdevice flag to control skb linearization in generic xdp mode. >>> >>> The attribute can be modified through >>> /sys/class/net//xdpgeneric_linearize >>> The default is 1 (on) >>> >>> Motivation: xdp expects linear skbs with some minimum headroom, and >>> generic xdp calls skb_linearize() if needed. The linearization is >>> expensive, and may be unnecessary e.g. when the xdp program does >>> not need access to the whole payload. >>> This sysfs entry allows users to opt out of linearization on a >>> per-device basis (linearization is still performed on cloned skbs). >>> >>> On a kernel instrumented to grab timestamps around the linearization >>> code in netif_receive_generic_xdp, and heavy netperf traffic with 1500b >>> mtu, I see the following times (nanoseconds/pkt) >>> >>> The receiver generally sees larger packets so the difference is more >>> significant. >>> >>> ns/pkt RECEIVER SENDER >>> >>> p50 p90 p99 p50 p90 p99 >>> >>> LINEARIZATION: 600ns 1090ns 4900ns 149ns 249ns 460ns >>> NO LINEARIZATION: 40ns 59ns 90ns 40ns 50ns 100ns >>> >>> v1 --> v2 : added Documentation >>> v2 --> v3 : adjusted for skb_cloned >>> v3 --> v4 : renamed to xdpgeneric_linearize, documentation >>> >>> Signed-off-by: Luigi Rizzo >> >> Just load your program in cls_bpf. No extensions or knobs needed. >> >> Making xdpgeneric-only extensions without touching native XDP makes >> no sense to me. Is this part of some greater vision? > > Yes, native xdp has the same issue when handling packets that exceed a > page (4K+ MTU) or otherwise consist of multiple segments. The issue is > just more acute in generic xdp. But agreed that both need to be solved > together. > > Many programs need only access to the header. There currently is not a > way to express this, or for xdp to convey that the buffer covers only > part of the packet. Right, my only question I had earlier was that when users ship their application with /sys/class/net//xdpgeneric_linearize turned off, how would they know how much of the data is actually pulled in? Afaik, some drivers might only have a linear section that covers the eth header and that is it. What should the BPF prog do in such case? Drop the skb since it does not have the rest of the data to e.g. make a XDP_PASS decision or fallback to tc/BPF altogether? I hinted earlier, one way to make this more graceful is to add a skb pointer inside e.g. struct xdp_rxq_info and then enable an bpf_skb_pull_data()-like helper e.g. as: BPF_CALL_2(bpf_xdp_pull_data, struct xdp_buff *, xdp, u32, len) { struct sk_buff *skb = xdp->rxq->skb; return skb ? bpf_try_make_writable(skb, len ? : skb_headlen(skb)) : -ENOTSUPP; } Thus, when the data/data_end test fails in generic XDP, the user can call e.g. bpf_xdp_pull_data(xdp, 64) to make sure we pull in as much as is needed w/o full linearization and once done the data/data_end can be repeated to proceed. Native XDP will leave xdp->rxq->skb as NULL, but later we could perhaps reuse the same bpf_xdp_pull_data() helper for native with skb-less backing. Thoughts? Thanks, Daniel