Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp1005742pxb; Wed, 13 Apr 2022 18:05:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzbOkKNYm+Ylux2PpbBMvBgsLmtyd81TX5G8Y/XaS3mNZ+iD4IyXhOT7uIz+uWJc7MOgcYH X-Received: by 2002:a17:90b:354a:b0:1cd:db3a:8f87 with SMTP id lt10-20020a17090b354a00b001cddb3a8f87mr986330pjb.44.1649898354266; Wed, 13 Apr 2022 18:05:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649898354; cv=none; d=google.com; s=arc-20160816; b=t6tpxeqON5e2fdLirShAlpN4SqdTIQhOHtyIrlVwVEot2yT//+pbx8PaF+qFzOzGpC NwLr/hgnflQApuZD/2BToeqhY3kPlU/safNuUEM3dMAVV9Y/F2mL7FB8EKtBKzrD2QQJ gq/AsK9kHZrXD9ML5mEa3N47gTILyq9OOiQ6s8V8xm+vnpNsXTqEH2bdAPG29uv/AzUp JWvkOo+O6VB6nfXW0GPeEWGGcyok45uOKfk/AtcRS9wbTPiYkUHYVamo8GlhrWPgAFT4 GmD5ZJ0MZ5jQAYFeGBNibUf3HAt7hb9U76K4TRoaLmtx2nZUPZ2S5ptVxtZ0tdLIgBV+ ST5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=AOcfAyqGgzTsDLYWZNFvb/pt3FEBh03V8FeUjVfINLs=; b=t7Uj/c2ZBVyxMsCR17G2IwFCQmJqJbZF50dPCv4xwaLZASnuZlGA4lj1G4GUZJREvD FjJlmtlOvLz3N7lfbXg0HiSQXMJBl18B3pz6McyRxOPRLefZSembXFPhNx/iNmiUXViR FlJWD8evD0p2fqp8UoRBFcb/nEGrj+BS/2ksJmvzsY8lCIT8q23SJo/3EMKya6n8K/y3 FkuTaOXQ5laAkGxqWSXxp85lCGJcgnlgAsr+olgeN+dgf6n7Ksq3ucN9zMBe2gDi7VWe hQd62tIWUklXg86jpUt4uQ7c5zF6BmUOqAGX3h6Y8eKZL84kaIX3cxRk+qgHkRKdpis5 Uv7Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=Bossa6GE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x4-20020a634a04000000b0039d6f6b0b04si6832058pga.690.2022.04.13.18.05.38; Wed, 13 Apr 2022 18:05:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=Bossa6GE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236034AbiDMRyg (ORCPT + 99 others); Wed, 13 Apr 2022 13:54:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42616 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236021AbiDMRye (ORCPT ); Wed, 13 Apr 2022 13:54:34 -0400 Received: from mail-qt1-x82b.google.com (mail-qt1-x82b.google.com [IPv6:2607:f8b0:4864:20::82b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 59EEF6D3A2 for ; Wed, 13 Apr 2022 10:52:11 -0700 (PDT) Received: by mail-qt1-x82b.google.com with SMTP id z16so1918284qtq.6 for ; Wed, 13 Apr 2022 10:52:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=AOcfAyqGgzTsDLYWZNFvb/pt3FEBh03V8FeUjVfINLs=; b=Bossa6GEY7VuCphnLPZuOzJUtcvFY0Xkwv27PVSBfCC/FdX4k2U/R/8tc3e6/TDG9b w/9uAgh8u82TWvzqs9/oqezRfVKYTJcg88jPsDyVpjn5i6qVesFEQ/DApakc869u+5lk GwUeq/iRHuUUSbU5VPjF+x+Nxyo3LcL64cerGvihnoxpTS7x163MByH0tjIdn6VLh7j0 ORezJT/JFq5so7GsIh61VtI+O3VpLisBAYp/sU9dzGHFPjvZ0UPJeFpJ9W27eu7ET5GJ cvhFhPAZ1ZOsHSur//jxKwQ6gDdtJtdM7gcJ7fyJKGc6kOw1FZQe9pd2KDaJak2ceHpG PJvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=AOcfAyqGgzTsDLYWZNFvb/pt3FEBh03V8FeUjVfINLs=; b=cyhfGpIkb+surUcWpxuQkyqtnL5I9WPnvdIplgLzj0A1pFO5Lj58xld2bHHdiFqyOM HNTlEOOj5YqTNf22sG3w1cr2qzS3RFxkZbaylHCLN5XP2Nk9UCIfjvPD1IznJ0azeGJ8 Yvyzq9Bf/b0YcB+PPRnmssEic+6zhgMxdHOMpOxxvs6Bouu6uEHe8Hk6anyz0NGmeuP8 maewnb62hNWY/kh4prpf684cF9yW+Rr1QULQXjdqtf1qwflYaHN7cbRa8Htu73C/FW35 8it4DWmJE47wakzH3+EC/GWWwyBf2kMm3VlEEKo/0CAiH6l+eAu8aUTx6UPA0bmTp2jW lhnA== X-Gm-Message-State: AOAM531yKoaS1lvlgh7Mlgr+5hK13+y5A7OsKPbzbSZt7OgeldmBfw4Z tTuDV8H0nvBOSV+VjnKbkkLwaw== X-Received: by 2002:a05:622a:1392:b0:2e1:e7b9:3ce4 with SMTP id o18-20020a05622a139200b002e1e7b93ce4mr7976945qtk.153.1649872330523; Wed, 13 Apr 2022 10:52:10 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-162-113-129.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.113.129]) by smtp.gmail.com with ESMTPSA id w10-20020a05620a424a00b00680c0c0312dsm23050212qko.30.2022.04.13.10.52.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Apr 2022 10:52:09 -0700 (PDT) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1nehA0-001kWp-5D; Wed, 13 Apr 2022 14:52:08 -0300 Date: Wed, 13 Apr 2022 14:52:08 -0300 From: Jason Gunthorpe To: David Hildenbrand Cc: Sean Christopherson , Andy Lutomirski , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , the arch/x86 maintainers , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen Subject: Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Message-ID: <20220413175208.GI64706@ziepe.ca> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> <20220310140911.50924-5-chao.p.peng@linux.intel.com> <02e18c90-196e-409e-b2ac-822aceea8891@www.fastmail.com> <7ab689e7-e04d-5693-f899-d2d785b09892@redhat.com> <20220412143636.GG64706@ziepe.ca> <1686fd2d-d9c3-ec12-32df-8c4c5ae26b08@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1686fd2d-d9c3-ec12-32df-8c4c5ae26b08@redhat.com> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote: > On 12.04.22 16:36, Jason Gunthorpe wrote: > > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > > > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he > >> past already with secretmem, it's not 100% that good of a fit (unmovable > >> is worth than mlocked). But it gets the job done for now at least. > > > > No, it doesn't. There are too many different interpretations how > > MELOCK is supposed to work > > > > eg VFIO accounts per-process so hostile users can just fork to go past > > it. > > > > RDMA is per-process but uses a different counter, so you can double up > > > > iouring is per-user and users a 3rd counter, so it can triple up on > > the above two > > Thanks for that summary, very helpful. I kicked off a big discussion when I suggested to change vfio to use the same as io_uring We may still end up trying it, but the major concern is that libvirt sets the RLIMIT_MEMLOCK and if we touch anything here - including fixing RDMA, or anything really, it becomes a uAPI break for libvirt.. > >> So I'm open for alternative to limit the amount of unmovable memory we > >> might allocate for user space, and then we could convert seretmem as well. > > > > I think it has to be cgroup based considering where we are now :\ > > Most probably. I think the important lessons we learned are that > > * mlocked != unmovable. > * RLIMIT_MEMLOCK should most probably never have been abused for > unmovable memory (especially, long-term pinning) The trouble is I'm not sure how anything can correctly/meaningfully set a limit. Consider qemu where we might have 3 different things all pinning the same page (rdma, iouring, vfio) - should the cgroup give 3x the limit? What use is that really? IMHO there are only two meaningful scenarios - either you are unpriv and limited to a very small number for your user/cgroup - or you are priv and you can do whatever you want. The idea we can fine tune this to exactly the right amount for a workload does not seem realistic and ends up exporting internal kernel decisions into a uAPI.. Jason