Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0C98C4332F for ; Tue, 21 Dec 2021 15:19:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239151AbhLUPTq (ORCPT ); Tue, 21 Dec 2021 10:19:46 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:60062 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239330AbhLUPTm (ORCPT ); Tue, 21 Dec 2021 10:19:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1640099981; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5gm3ZLQRBKvCHbJaSdAbt4E0N/aoi9RgH7ebjbXdmrk=; b=Zu2Vi9GjAiJsHK4DcHzaAS7U0Ij3iZ4/oB4ldM0NUAIeSmBO7wpBo8iDLvb84uY6+maVDM +BxLWg3gHIp46pYysxXHEWG+UF21LRG3rSqMehvYK1XXaW/EHe8J2VacAUCWdzANX17ZCc JRVlc/GHoNDSOr+LoFXLonrNpu8IkKk= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-253-Zo6rEa5dNGyc72n9cZ5bEw-1; Tue, 21 Dec 2021 10:19:40 -0500 X-MC-Unique: Zo6rEa5dNGyc72n9cZ5bEw-1 Received: by mail-wm1-f72.google.com with SMTP id n31-20020a05600c3b9f00b0034440f99123so2889401wms.7 for ; Tue, 21 Dec 2021 07:19:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=5gm3ZLQRBKvCHbJaSdAbt4E0N/aoi9RgH7ebjbXdmrk=; b=W+Yp5qDRcIYz781x4bp+umo8ciUJf6nVNVUwNbrKufCledd9B/LZOZtWZJ9MMdOITt an1lZ4EGQU7drCn8uRU6QQRrLWceqznjf5Poeo38Hvt/YAhbfC7p5vUtoV+4U5U1KBVl W1Ng4Ka9QZw3YsTqEEc/MWhWixaKmMyj0yDu13gCK0hUeEN+QnLwbDNIbNUSV7DpS3lg vXtgCSFP6eHRi/toPffTjjNMWUCRNXoJ984Qjj0JR8bJWsTJhHp1llkd9q0TEMoS2LJu ishyHBWUaVj+b9QYRvBLMVJ+2FMyL8hAbDNZqPgfOPkT8d7VXUDQ3ttnAHaXHRwhUtFs Ax0Q== X-Gm-Message-State: AOAM5338QLBKe9CFch3ssrVpWG5v+kTL55UCsevx/h7uqBBLNW7b5XOK 0l7XCNpDzJOBSJ1YEMWj0qiLguX/ruEvpowaOnZkVl5X0XJJoK6RIPSE/u4aiRzMpoaFSgS0uav vA4s8ZYzdWWpjHj2jV9jcYJTA X-Received: by 2002:adf:db04:: with SMTP id s4mr3123277wri.467.1640099978752; Tue, 21 Dec 2021 07:19:38 -0800 (PST) X-Google-Smtp-Source: ABdhPJwj9M2fdwpVEPT599gXIT3luVo4I7J5XVEWxtoh26mgFPng1aOUYRQ3MOTjaZ8BGRLBTURDIQ== X-Received: by 2002:adf:db04:: with SMTP id s4mr3123235wri.467.1640099978458; Tue, 21 Dec 2021 07:19:38 -0800 (PST) Received: from [192.168.3.132] (p5b0c64a4.dip0.t-ipconnect.de. [91.12.100.164]) by smtp.gmail.com with ESMTPSA id o38sm2925317wms.4.2021.12.21.07.19.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 21 Dec 2021 07:19:37 -0800 (PST) Message-ID: <303f21d3-42b4-2f11-3f22-28f89f819080@redhat.com> Date: Tue, 21 Dec 2021 16:19:33 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 Content-Language: en-US To: Jason Gunthorpe Cc: Linus Torvalds , Nadav Amit , Linux Kernel Mailing List , Andrew Morton , Hugh Dickins , David Rientjes , Shakeel Butt , John Hubbard , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Linux-MM , "open list:KERNEL SELFTEST FRAMEWORK" , "open list:DOCUMENTATION" References: <5CA1D89F-9DDB-4F91-8929-FE29BB79A653@vmware.com> <4D97206A-3B32-4818-9980-8F24BC57E289@vmware.com> <5A7D771C-FF95-465E-95F6-CD249FE28381@vmware.com> <20211221010312.GC1432915@nvidia.com> <20211221142812.GD1432915@nvidia.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) In-Reply-To: <20211221142812.GD1432915@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 21.12.21 15:28, Jason Gunthorpe wrote: > On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote: >>> I'm having a hard time imagining how gup_fast can maintain any sort of >>> bit - it lacks all forms of locks so how can we do an atomic test and >>> set between two pieces of data? >> >> And exactly that is to be figured out. >> >> Note that I am trying to make also any kind of R/O pins on an anonymous >> page work as expected as well, to fix any kind of GUP after fork() and >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page >> similarly has to make sure that the page is exclusive -- even if it's >> mapped R/O (!). > > Why? AFAIK we don't have bugs here. If the page is RO and has an > elevated refcount it cannot be 'PageAnonExclusive' and so any place > that wants to drop the WP just cannot. What is the issue? Sure it can. 1. Map page R/W 2. Pin it R/W 3. Swapout 4. Read access Page is now mapped R/O and *has to be* marked PageAnonExclusive(), to properly skip the COW fault. That's literally 60% of the reproducers we have that need fixing. But what I think you actually mean is if we want to get R/O pins right. > >> BUT, it would mean that whenever we fork() and there is one additional >> reference on a page (even if it's from the swapcache), we would slow >> down fork() even if there was never any GUP. This would apply to any >> process out there that does a fork() ... > > You mean because we'd copy? Yes. > > Is this common? Linus' prior email was talking as though swap is so > rare we should't optimize for it? At least in the enterprise segment having swap enabled is mostly a hard documented requirement. On customer installations swap is still common, and even gets replaced zswap that is enabled automatically in many installations ... So in the world I live and work in, swap is used frequently. > >> So the idea is to mark a page only exclusive as soon as someone needs >> the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN >> or selected FOLL_GET like O_DIRECT). This can happen in my current >> approach using two ways: >> >> (1) Set the bit when we know we are the only users >> >> We can set PageAnonExclusive() in case *we sync against fork* and the >> page cannot get unmapped (pt lock) when: >> * The page is mapped writable >> * The page is mapped readable and page_count == 1 > > I'm still not sure I see that all this complexity is netting a gain? Avoid copy on fork(). > >> If we cannot set the page exclusive, we have to trigger a page fault. >> >> (2) During pagefaults when FOLL_FAULT_UNSHARE is set. > > Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this > series because of mapcount, once the hugetlb COW is fixed to use > refcount properly, as Linus showed, the bugs this was trying to fix go > away. The purpose of FOLL_FAULT_UNSHARE in the !mapcount version is to cleanly support R/O pins without the need for FOLL_WRITE. And it's comparatively easy to add on top. This is not core of the complexity, really. > > And as discussed before it is OK if READ gup becomes incoherent, that > is its defined semantic. And that's where I still disagree. But anyhow, this is really more about FOLL_FAULT_UNSHARE, which is pretty easy and natural to add on top and just gets this right. > >> The above should work fairly reliable with GUP. But indeed, >> gup-fast-only is the problem. I'm still investigating what kind of >> lightweight synchronization we could do against fork() such that we >> wouldn't try setting a page PageAnonExclusive() while fork() >> concurrently shares the page. >> >> We could eventually use the page lock and do a try_lock(), both in >> fork() and in gup-fast-only. fork() would only clear the bit if the >> try_lock() succeeded. gup-fast-only would only be able to set the bit >> and not fallback to the slow path if try_lock() succeeded. > > I suspect that is worse than just having fork clear the bit and leave > GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not > need to be atomic, it is protected by the PTL. There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now: 1) Whenever mapping an anonymous page R/W (after COW, during ordinary fault, on swapin), we mark the page exclusive. We must never lose the PageAnonExclusive bit, not during migration, not during swapout. fork() will process the bit for each and every process, even if there was no GUP, and will copy if there are additional references. 2) Whenever GUP wants to pin/ref a page, we try marking it exclusive. We can lose the PageAnonExclusive bit during migration and swapout, because that can only happen when there are no additional references. fork() will process the bit only if there was GUP. Ordinary fork() is left unchanged. Getting R/O supported in the same way just means that we have to check on a R/O pin if the page is PageAnonExclusive, and if that's not the case, trigger a FOLL_FAULT_UNSHARE fault. That's really the only "complexity" on top which is without the mapcount really easy. -- Thanks, David / dhildenb