Received: by 2002:a25:23cc:0:0:0:0:0 with SMTP id j195csp1068550ybj; Thu, 7 May 2020 14:07:54 -0700 (PDT) X-Google-Smtp-Source: APiQypKyXJjKVEnlpHfs2oxu4kJc8LKkuqzYybIlt1xAK57w+swHWl3kLopWEScOROiOzI/1soSI X-Received: by 2002:a17:906:7c42:: with SMTP id g2mr14342537ejp.328.1588885674059; Thu, 07 May 2020 14:07:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588885674; cv=none; d=google.com; s=arc-20160816; b=kSFNLowA58oo9ENYYd6yXkvQ0iwenN0kzpz7lHqMBm82sXK+WhSuf9B/g6yY9nK8G6 P9MaA2xiBeUpiPhqwsBzomDYPOPOVZJspBdYptlEPfpNbrp58ivgMX0mKizgG6iZFKgi Xb43Ky+n/xzZvdrMY37HbSwag16v8U9gQzSCXBr1iy0TJAbQp3b6tnO5fy9d2crBTrBX d6E3rPW7X1uh1zEPsjHqJKhbQrzHokgycoqwyx81Pi35Rk021BSGe3QL/BgK/WU8Xh+w RXiP+VTPLFDTddmyJUxh9NwrCE+u3uv2cdos3a0FR4fLfLx8eA7jvxeFCgm7170okFWV LDzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=2TDE+k4zb5+R/aL6inREiMUN3aJKEl0dQWiurK8pPnE=; b=DDML9GBwhkXW+gXsReJrq7Gv+jrqmPeRxgh9w1YAroIVpbrJj5WOq9e1oBGBp5mH3n 6mGd2vYQ14jFTpRpZiGL8kU4NUOWih5ovs60g7J2eur6E/5asRkh1w9BnKoYw44h1sC1 JuaL0+15tn/HQ62x9wbdf1YrFbSrtKq05oXtltioJFQQ2PMfd+7oe9iTyPOiTm9MJ00o g7N3cA/dz+PSLBKHZJDkFtYFxgrr9uBySfWSqRA1SSKAff/0HZos6ElBEnNvfrrno6S3 GsQzHVaJavzxpZ4YZkjxBMxppLMpPhm4tymxFDHII0tC68wmOKR0zPRM9jnkiKmxXUkY +upQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o14si4278707ejm.511.2020.05.07.14.07.30; Thu, 07 May 2020 14:07:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726616AbgEGVGI (ORCPT + 99 others); Thu, 7 May 2020 17:06:08 -0400 Received: from www62.your-server.de ([213.133.104.62]:40234 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726218AbgEGVGH (ORCPT ); Thu, 7 May 2020 17:06:07 -0400 Received: from sslproxy01.your-server.de ([78.46.139.224]) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1jWniO-0004wi-CU; Thu, 07 May 2020 23:05:56 +0200 Received: from [178.195.186.98] (helo=pc-9.home) by sslproxy01.your-server.de with esmtpsa (TLSv1.3:TLS_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jWniO-0008Ol-3O; Thu, 07 May 2020 23:05:56 +0200 Subject: Re: [PATCH v3] net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu To: =?UTF-8?Q?Maciej_=c5=bbenczykowski?= Cc: Alexei Starovoitov , Linux Network Development Mailing List , Linux Kernel Mailing List , BPF Mailing List , "David S . Miller" , Jakub Kicinski References: <20200507023606.111650-1-zenczykowski@gmail.com> From: Daniel Borkmann Message-ID: Date: Thu, 7 May 2020 23:05:55 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.2/25805/Thu May 7 14:14:46 2020) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/7/20 6:46 PM, Maciej Żenczykowski wrote: > (a) not clear why the max is SKB_MAX_ALLOC in the first place (this is > PAGE_SIZE << 2, ie. 16K on x86), while lo mtu is 64k Agreed, tbh, it's not clear to me either atm. :) The SKB_MAX_ALLOC constant itself should be replaced with something more appropriate. Also as a small side note, the !skb->dev check should be removed since in tc ingress/egress the skb->dev is never NULL. (See also tc_cls_act_convert_ctx_access() on struct __sk_buff, ifindex conversion.) > (b) hmm, if we're not redirecting, then exceeding the ingress device's > mtu doesn't seem to be a problem. > > Indeed AFAIK this can already happen, some devices will round mtu up > when they configure the device mru buffers. > (ie. you configure L3 mtu 1500, they treat that as L2 1536 or 1532 [-4 > fcs], simply because 3 * 512 is a nice amount of buffers, or they'll > accept not only 1514 L2, but also 1518 L2 or even 1522 L2 for VLAN and > Q-IN-Q -- even if the packets aren't actually VLAN'ed, so your non > VLAN mru might be 1504 or 1508) > > Indeed my corp dell workstation with some standard 1 gigabit > motherboard nic has a standard default mtu of 1500, and I've seen it > receive L3 mtu 1520 packets (apparently due to misconfiguration in our > hardware [cisco? juniper?] ipv4->ipv6 translator which can take 1500 > mtu ipv4 packets and convert them to 1520 mtu ipv6 packets without > fragmenting or generating icmp too big errors). While it's obviously > wrong, it does just work (the network paths themselves are also > obviously 1520 clean). Right, agree on tc ingress side when skb goes further up the stack. > (c) If we are redirecting we'll eventually (after bpf program returns) > hit dev_queue_xmit(), and shouldn't that be what returns an error? You mean whether the check should be inside __dev_queue_xmit() itself instead? Maybe. From a cursory glance the MTU checks are spread in upper layer protos. As mentioned, we would need to have coverage for BPF progs attached on the tc ingress and egress (sch_handle_{ingress,egress}()) hook where for each case an skb can be just TC_ACT_OK'ed (up to stack or down to driver), redirected via ____dev_forward_skb() or dev_queue_xmit(). The ____dev_forward_skb() has us covered and for TC_ACT_OK on tc ingress, we'd only need a generic upper cap. So for the rest of the cases, we'd need to have some sort of sanity check somewhere. > btw. is_skb_forwardable() actually tests > - device is up && (packet is gso || skb->len < dev->mtu + > dev->hard_header_len + VLAN_HLEN); > > which is also wrong and in 2 ways, cause VLAN_HLEN makes no sense on > non ethernet, and the __bpf_skb_max_len function doesn't account for > VLAN... (which possibly has implications if you try to redirect to a > vlan interface) Yeah, it probably would for QinQ which is probably why noone was running into it so far. At least the skb_vlan_push() would first store the tag via __vlan_hwaccel_put_tag() in the skb itself. > I think having an mtu check is useful, but I think the mtu should be > selectable by the bpf program. Because it might not even be device > mtu at all, it might be path mtu which we should be testing against. > It should also be checked for gso frames, since the max post > segmentation size should be enforced. > > I agree we should expose dev->mtu (and dev->hard_header_len and hatype) > > I'll mull this over a bit more, but I'm not convinced this patch isn't ok as is. > There just is probably more we should do. Ok, makes sense. Thanks for looking into it!