Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp26045359rwd; Mon, 3 Jul 2023 04:54:41 -0700 (PDT) X-Google-Smtp-Source: APBJJlGUNZBcDMzxZLa1Cu3DwHB/IMU1fu+GRYD+5kzhQbFIGX2/VVZCeSVUbz6oWP7ldjcN+tNh X-Received: by 2002:a05:6359:b9f:b0:134:ccde:596b with SMTP id gf31-20020a0563590b9f00b00134ccde596bmr7454691rwb.12.1688385280785; Mon, 03 Jul 2023 04:54:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688385280; cv=none; d=google.com; s=arc-20160816; b=H2acyP2kz1TWUFP/Sx2UtoWdCUm/DPq0xwDEBGZutNKJoREwMw6veEMq4E9K2rmoaa 7XItJKl+O8j89OtamMzTS24xWhaDbf0cuHYPzJlAiLhOIcmdP73yu+bgTzDONqIrd5Rq qyVSuHEghIajAQwpxfm62Oclr8HzS7FTJuh1T4mz6F8WRxs8HpD3D7gBZ3YYlpXD5uWb EOzvbZaekRHpIBiy5qAX/8DOsr9YSejxRBstZji3GJEUwFKlvf9hj7/sH5xCGYBmpU5w gZ9gh/BwKavzP5WJyQwPuxdikItoc48IKqp72kjSNth+zp9fex44HAg9e8oks7DvbGRp BiBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=3dRFQnX6q2L+s/xMwOL4tS9PRX/lQXRRvVxG4UITLxI=; fh=jwlygcFILYvyxvhYBfDh8RF1KE70m5C8pe92KZFXpgM=; b=LLEUXCU5jKLIVWahpZ5TE7GKxruM5krEzXguOt1N4VcR8zU+mkaBtZiN3dfjrxu0ON 92GybpObg8rBikFDRgZ/RhsCDshnlZfloSKQYIXfF2fjY3+J6mVBiQMrEHeAPrvaQDGO 5xdslPfr0Vf7b01sE50ds/ulkZpT/xwCUNEe8iCmhMBOsyBp/KIE/rNyCvN5nfMihhD3 vMVIV6p2Vmt4EzvEzB1kMX80jf3RHsAJ2jnXuH0Hy+42PFP/xkUiRoEDMIAlol4h6srf YWgaB4B1yCgc9MZUyE5Dnm1wDU7cC9fR7zDW16SzBnsg0SqxowYyP2XSRKk2F8YYT41s pFTQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=pNqQoyV5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id az9-20020a056a02004900b0055b731aa9adsi6143896pgb.562.2023.07.03.04.54.27; Mon, 03 Jul 2023 04:54:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=pNqQoyV5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230070AbjGCLTk (ORCPT + 99 others); Mon, 3 Jul 2023 07:19:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43446 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229484AbjGCLTj (ORCPT ); Mon, 3 Jul 2023 07:19:39 -0400 Received: from mail-qv1-xf36.google.com (mail-qv1-xf36.google.com [IPv6:2607:f8b0:4864:20::f36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 17481E3; Mon, 3 Jul 2023 04:19:38 -0700 (PDT) Received: by mail-qv1-xf36.google.com with SMTP id 6a1803df08f44-635e01eb981so7512476d6.1; Mon, 03 Jul 2023 04:19:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688383177; x=1690975177; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=3dRFQnX6q2L+s/xMwOL4tS9PRX/lQXRRvVxG4UITLxI=; b=pNqQoyV585e+s71TqwI4udzGmKSOwwogfoCDXagF6WFiGPU0+Fk8IKNhxRILls4lCu wK0kczwgU2YBaZd3UfoTQHCfFe2yZFp5Tqhto2yF+zObqsDtkqK/GDRrJFYlyeGX/0HL pUR4aIlGtNLH0rRk7Q+G28NprzkSOyR8bdv/gLjgnT/u+QHZmxLO9DmlseDViGVIygGo Hza5IgR5bCsz2Wo0hGfr0GjVl3sSYRBlJ6MStxXcAhrfNaZu4ZMnD9dvBF9ekldyvmJq E2fB8fjZFIYcMl0LoFydW33RtcNhrNyUTpiiwiA0rYyGzqrNNrUMRliCofnJCJC6MvqM abbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688383177; x=1690975177; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3dRFQnX6q2L+s/xMwOL4tS9PRX/lQXRRvVxG4UITLxI=; b=WDp//YFtxYzFHFtyrBtwrg7bllqfBpyMG9MEgqWiqtcKlE8v803FsZ7QOkmNupqG+4 MTsSOVeqsJSoJgZ5t5rs0Dh+wcxzNLQsB0L3DUl0ToJXLzpa7upL20GXyTTV7lsNm0hh AmvqZGI7Hz1RKz+nicPj4Wu056Wp5qOrYpnKPkuaACjdXrH4sJAHuBvWBpy2S3ZHBoTf B5CGeMLaIosqtCgVVX3UJuQcboys6yIYp9eZnCs8UTf8a87b5vwPTXRSk+I18I0Qk6HV 6pAtsAdIs149yGLYadeaetPzXJJ60JBFVJkapPqmZTN5ENtRamaRDvsj7sJlsonbEyPg VqCg== X-Gm-Message-State: ABy/qLac0Ulm4U5s1wHfVvhXTe6j1R408Z/ZtOjlv98AMG8SKIgo4Itp oRThGx7NWB5bYRFsmjbWXSoDuC8pR2IQOk/hMoF0vjIYBncbEA== X-Received: by 2002:a05:6214:202c:b0:625:aa48:e50f with SMTP id 12-20020a056214202c00b00625aa48e50fmr9574440qvf.6.1688383177097; Mon, 03 Jul 2023 04:19:37 -0700 (PDT) MIME-Version: 1.0 References: <20230630145831.2988845-1-i.maximets@ovn.org> <04ed302e-067e-d372-370b-3fef1cf8c7f2@ovn.org> <297fdd01-f1c6-6733-534c-8ed50b74c3ae@ovn.org> In-Reply-To: <297fdd01-f1c6-6733-534c-8ed50b74c3ae@ovn.org> From: Magnus Karlsson Date: Mon, 3 Jul 2023 13:19:25 +0200 Message-ID: Subject: Re: [RFC bpf-next] xsk: honor SO_BINDTODEVICE on bind To: Ilya Maximets Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Magnus Karlsson , Maciej Fijalkowski , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Wang , Stefan Hajnoczi Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 3 Jul 2023 at 13:16, Ilya Maximets wrote: > > On 7/3/23 12:24, Magnus Karlsson wrote: > > On Mon, 3 Jul 2023 at 12:13, Ilya Maximets wrote: > >> > >> On 7/3/23 12:06, Ilya Maximets wrote: > >>> On 7/3/23 11:48, Magnus Karlsson wrote: > >>>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets wrote: > >>>>> > >>>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability. > >>>>> A privileged process might create the socket and pass it to a > >>>>> non-privileged process for later use. However, that process will be > >>>>> able to bind the socket to any network interface. Even though it will > >>>>> not be able to receive any traffic without modification of the BPF map, > >>>>> the situation is not ideal. > >>>>> > >>>>> Sockets already have a mechanism that can be used to restrict what > >>>>> interface they can be attached to. That is SO_BINDTODEVICE. > >>>>> > >>>>> To change the binding the process will need CAP_NET_RAW. > >>>>> > >>>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer > >>>>> workflow when non-privileged process is using AF_XDP. > >>>> > >>>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will > >>>> return an error from bind. So if I understand the purpose of > >>>> SO_BINDTODEVICE correctly, you could say that this option is always > >>>> set for an AF_XDP socket and it is not possible to toggle it. The only > >>>> way to "rebind" an AF_XDP socket is to close it and open a new one. > >>>> This was a conscious design decision from day one as it would be very > >>>> hard to support this, especially in zero-copy mode. > >>> > >>> Hi, Magnus. > >>> > >>> The purpose of this patch is not to allow re-binding. The use case is > >>> following: > >>> > >>> 1. First process creates a bare socket with socket(AF_XDP, ...). > >>> 2. First process loads the XSK program to the interface. > >>> 3. First process adds the socket fd to a BPF map. > >>> 4. First process sends socket fd to a second process. > >>> 5. Second process allocates UMEM. > >>> 6. Second process binds socket to the interface. > >> > >> 7. Second process sends/receives the traffic. :) > >> > >>> > >>> The idea is that the first process will call SO_BINDTODEVICE before > >>> sending socket fd to a second process, so the second process is limited > >>> in to which interface it can bind the socket. > >>> > >>> Does that make sense? > > > > Thanks for explaining this to me. Yes, that makes sense and seems > > useful. Could you please send a v2 and include the flow (1-7) above in > > your commit message? Would be good to add one step with the setsockopt > > SO_BINDTODEVICE before step #4 just to be clear. With those changes > > please feel free to include my ack: > > > > Acked-by: Magnus Karlsson > > Thanks! I'll update the commit message with the steps above to make it > more clear. > > I was planning to send a non-RFC version of this patch once the tree is > open (in a week). Or are the rules for bpf-next different? Bpf-next is always open I believe. > > > > Thank you! > > > >>> This workflow allows the second process to have no capabilities > >>> as long as it has sufficient RLIMIT_MEMLOCK. > >> > >> Note that steps 1-7 are working just fine today. i.e. the umem > >> registration, bind, ring mapping and traffic send/receive do not > >> require any extra capabilities. > >> > >> We may restrict the bind() call to require CAP_NET_RAW and then > >> allow it for sockets that had SO_BINDTODEVICE as an alternative. > >> But restriction will break the current uAPI. > >> > >>> > >>> Best regards, Ilya Maximets. > >>> > >>>> > >>>>> Signed-off-by: Ilya Maximets > >>>>> --- > >>>>> > >>>>> Posting as an RFC for now to probably get some feedback. > >>>>> Will re-post once the tree is open. > >>>>> > >>>>> Documentation/networking/af_xdp.rst | 9 +++++++++ > >>>>> net/xdp/xsk.c | 6 ++++++ > >>>>> 2 files changed, 15 insertions(+) > >>>>> > >>>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst > >>>>> index 247c6c4127e9..1cc35de336a4 100644 > >>>>> --- a/Documentation/networking/af_xdp.rst > >>>>> +++ b/Documentation/networking/af_xdp.rst > >>>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the > >>>>> application to use. The final option is the flags field, but it will > >>>>> be dealt with in separate sections for each UMEM flag. > >>>>> > >>>>> +SO_BINDTODEVICE setsockopt > >>>>> +-------------------------- > >>>>> + > >>>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP > >>>>> +socket to a particular network interface. It is useful when a socket > >>>>> +is created by a privileged process and passed to a non-privileged one. > >>>>> +Once the option is set, kernel will refuse attempts to bind that socket > >>>>> +to a different interface. Updating the value requires CAP_NET_RAW. > >>>>> + > >>>>> XDP_STATISTICS getsockopt > >>>>> ------------------------- > >>>>> > >>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > >>>>> index 5a8c0dd250af..386ff641db0f 100644 > >>>>> --- a/net/xdp/xsk.c > >>>>> +++ b/net/xdp/xsk.c > >>>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) > >>>>> struct sock *sk = sock->sk; > >>>>> struct xdp_sock *xs = xdp_sk(sk); > >>>>> struct net_device *dev; > >>>>> + int bound_dev_if; > >>>>> u32 flags, qid; > >>>>> int err = 0; > >>>>> > >>>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) > >>>>> XDP_USE_NEED_WAKEUP)) > >>>>> return -EINVAL; > >>>>> > >>>>> + bound_dev_if = READ_ONCE(sk->sk_bound_dev_if); > >>>>> + > >>>>> + if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex) > >>>>> + return -EINVAL; > >>>>> + > >>>>> rtnl_lock(); > >>>>> mutex_lock(&xs->mutex); > >>>>> if (xs->state != XSK_READY) { > >>>>> -- > >>>>> 2.40.1 > >>>>> > >>>>> > >>> > >> >