Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp3204383pxk; Mon, 28 Sep 2020 10:56:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzEeYb8Lft7PAsZJt4oCrGVYh9JUtSPyvXzzZqoZ33duxVNhnQXslZb3zdDI5NBnVzWTOaf X-Received: by 2002:a50:d2d1:: with SMTP id q17mr3057996edg.167.1601315772852; Mon, 28 Sep 2020 10:56:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1601315772; cv=none; d=google.com; s=arc-20160816; b=t1nWaZNjThtOMKtsZOkmjbBthQpIxjyTL9cD0NhQkwpijlZVqpHdSiN6Bwakd9NJVq NXzulnGmYjOWv471TDin53jH3Kt5/N+tocBt5sRrb8JrxX/IcIlco60C3I6KrNhRNM2W 3bEUd4271ZbJCfDLvCU+1IU4IZPTC0SBH0c+fMfKyepn0gymsN1m0zyI54TD6Tmn8bqt H2IPG/VNQdeAqdVItSpaEmXR5Fao7cDn5JbDTQHrSPSXW2FHtNF21hYStsoeLc97o6MX lXOyWSB4/WPtPgy8Uc2Uoweknula/RPGGeCZNsvruZEfu+O+wCSWiTVuMHo5b6xo8HYr BW4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=nVQLqfhAMue9WFpIvdR+SPVZ27tvqqW+KhexMPe51Eg=; b=M1oTj75JMdvo4L66brC0zTMihLS/L8fwSrOgvACMf3QoydiI1ipo0lb41HlfZu+IBG Atz30cxod5oeCyHZw7+6/cDdXLWDWhxmpTk4ykXvgcnnokAA2yElTGnxg0d8PBtUnrZe Pb2KO/FryUT31FZekuzB1evdkUs15s1cmzLW2wmx53mi4J36WVFOHec3NCfSFyAIXtmG puiZPB5x848yVq9/eZhaPemNzv3MX6wYMvViFkvAlu9OFeQ8rtlR80Exm0k1RFVEWVvv +PX6/jMSzenExHgql7M1jFhtsVL6C/rFhFW+tJLu7ziMV+AbNzX3Yu59YpccwXIXRNgk ASMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=JESamvpC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z13si1323769edx.167.2020.09.28.10.55.50; Mon, 28 Sep 2020 10:56:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=JESamvpC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726605AbgI1Rys (ORCPT + 99 others); Mon, 28 Sep 2020 13:54:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43684 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726506AbgI1Rys (ORCPT ); Mon, 28 Sep 2020 13:54:48 -0400 Received: from mail-lf1-x142.google.com (mail-lf1-x142.google.com [IPv6:2a00:1450:4864:20::142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E0E5C061755 for ; Mon, 28 Sep 2020 10:54:48 -0700 (PDT) Received: by mail-lf1-x142.google.com with SMTP id b12so2312114lfp.9 for ; Mon, 28 Sep 2020 10:54:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=nVQLqfhAMue9WFpIvdR+SPVZ27tvqqW+KhexMPe51Eg=; b=JESamvpCr7P6D1koJK+bYfaWc/gIRk+gxvqsb639smqknsgpUtq8ybZgdz3KmpdwSm BOGvMs13i1hIOhbXZ0kGX29MC4GwdbFoVGM2scepvfxdiRoWcX0JUlgsoAwS9IECgBin SZ2Psk9nWs4dIJn0dtt/AdBJw4kExf40QIIsI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=nVQLqfhAMue9WFpIvdR+SPVZ27tvqqW+KhexMPe51Eg=; b=dit+RsPI4e+zFwGnmmSqiIa0o10me6reBUpxoyImBjYJ3U4L8NJheDFGwx+63d66G+ +fSg1Ts9f1ic0bwiznWVsuezusboKvp/2hrIRiYufRMdYFifqj8Gct6eTWmjZtwYyvMf FAcW7G7tX4/RK2wjUAV9aIlpnll7zVhpd0TD8Om3AW03Q3G9WP90sGYt/cDaZ//q/Ewv uNmKuE8KITI0RdOypvqs1F1OX1yTPFVFTVBUUhK3IOiVvPIxaOxXtIukWPWDNAfCy/ic RYJJil7eFKzV4/I8jAG6mLO7cNy7tw0mHzp/VxqMstb6HKki8yrZ9KycyARIDEKUbKdb fzkA== X-Gm-Message-State: AOAM531SqPB8gsyL8sqjvyAnP3kJztJMg3xy3LhWhXi7W0r8zo28U8Ko EbaUz16NUMrMOqkzDeOsV+c19bJoG3xDSQ== X-Received: by 2002:a19:cb12:: with SMTP id b18mr958901lfg.417.1601315686219; Mon, 28 Sep 2020 10:54:46 -0700 (PDT) Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com. [209.85.167.46]) by smtp.gmail.com with ESMTPSA id y196sm2977904lfa.0.2020.09.28.10.54.45 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 28 Sep 2020 10:54:46 -0700 (PDT) Received: by mail-lf1-f46.google.com with SMTP id b22so2285149lfs.13 for ; Mon, 28 Sep 2020 10:54:45 -0700 (PDT) X-Received: by 2002:ac2:4a6a:: with SMTP id q10mr786502lfp.534.1601315685097; Mon, 28 Sep 2020 10:54:45 -0700 (PDT) MIME-Version: 1.0 References: <20200926004136.GJ9916@ziepe.ca> <20200927062337.GE2280698@unreal> <20200928124937.GN9916@ziepe.ca> <20200928172256.GB59869@xz-x1> In-Reply-To: <20200928172256.GB59869@xz-x1> From: Linus Torvalds Date: Mon, 28 Sep 2020 10:54:28 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned To: Peter Xu Cc: Jason Gunthorpe , Leon Romanovsky , John Hubbard , Linux-MM , Linux Kernel Mailing List , Andrew Morton , Jan Kara , Michal Hocko , Kirill Tkhai , Kirill Shutemov , Hugh Dickins , Christoph Hellwig , Andrea Arcangeli , Oleg Nesterov , Jann Horn Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 28, 2020 at 10:23 AM Peter Xu wrote: > > Yes... Actually I am also thinking about the complete solution to cover > read-only fast-gups too, but now I start to doubt this, at least for the fork() > path. E.g. if we'd finally like to use pte_protnone() to replace the current > pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer > the same degradation on normal fork()s, or even more. Seems unacceptable. So I think the real question about pinned read gups is what semantics they should have. Because honestly, I think we have two options: - the current "it gets a shared copy from the page tables" - the "this is an exclusive pin, and it _will_ follow the source VM changes, and never break" because honestly, if we get a shared copy at the time of the pinning (like we do now), then "fork()" is entirely immaterial. The fork() can have happened ages ago, that page is shared with other processes, and anybody process writing to it - including very much the pinning one - will cause a copy-on-write and get a copy of the page. IOW, the current - and past - semantics for read pinning is that you get a copy of the page, but any changes made by the pinning process may OR MAY NOT show up in your pinned copy. Again: doing a concurrent fork() is entirely immaterial, because the page can have been made a read-only COW page by _previous_ fork() calls (or KSM logic or whatever). In other words: read pinning gets a page efficiently, but there is zero guarantee of any future coherence with the process doing subsequent writes. That has always been the semantics, and FOLL_PIN didn't change that at all. You may have had things that worked almost by accident (ie you had made the page private by writing to it after the fork, so the read pinning _effectively_ gave you a page that was coherent), but even that was always accidental rather than anything else. Afaik it could easily be broken by KSM, for example. In other words, a read pin isn't really any different from a read GUP. You get a reference to a page that is valid at the time of the page lookup, and absolutely nothing more. Now, the alternative is to make a read pin have the same guarantees as a write pin, and say "this will stay attached to this MM until unmap or unpin". But honestly, that is largely going to _be_ the same as a write pin, because it absolutely needs to do a page COW at the time of the pinning to get that initial exclusive guarantee in the first place. Without that initial exclusivity, you cannot avoid future COW events breaking the wrong way. So I think the "you get a reference to the page at the time of the pin, and the page _may_ or may not change under you if the original process writes to it" are really the only relevant semantics. Because if you need those exclusive semantics, you might as well just use a write pin. The downside of a write pin is that it not only makes that page exclusive, it also (a) marks it dirty and (b) requires write access. That can matter particularly for shared mappings. So if you know you're doing the pin on a shared mmap, then a read pin is the right thing, because the page will stay around - not because of the VM it happens in, but because of the underlying file mapping! See the difference? > The other question is, whether we should emphasize and document somewhere that > MADV_DONTFORK is still (and should always be) the preferred way, because > changes like this series can potentially encourage the other way. I really suspect that the concurrent fork() case is fundamentally hard to handle. Is it impossible? No. Even without any real locking, we could change the code to do a seqcount_t, for example. The fastgup code wouldn't take a lock, but it would just fail and fall back to the slow code if the sequence count fails. So the copy_page_range() code would do a write count around the copy: write_seqcount_begin(&mm->seq); .. do the copy .. write_seqcount_end(&mm->seq); and the fast-gup code would do a seq = raw_read_seqcount(&mm->seq); if (seq & 1) return -EAGAIN; at the top, and do a if (__read_seqcount_t_retry(&mm->seq, seq) { .. Uhhuh, that failed, drop the ref to the page again .. return -EAGAIN; } after getting the pin reference. We could make this conditional on FOLL_PIN, or maybe even a new flag ("FOLL_FORK_CONSISTENT"). So I think we can serialize with fork() without serializing each and every PTE. If we want to and really need to. Hmm? Linus