Message-ID: <fd7e3195-4f36-3804-1793-d453d5bd3e9f@redhat.com>
Date:   Tue, 21 Dec 2021 09:58:32 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.4.0
Content-Language: en-US
To:     Jason Gunthorpe <jgg@nvidia.com>
Cc:     Linus Torvalds <torvalds@linux-foundation.org>,
        Nadav Amit <namit@vmware.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        Linux-MM <linux-mm@kvack.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@vger.kernel.org>,
        "open list:DOCUMENTATION" <linux-doc@vger.kernel.org>
References: <CAHk-=wj7eSOhbWDeADL_BJKLzdDF5s_5R9v7d-4P3L6v1T3mpQ@mail.gmail.com>
 <20211218184233.GB1432915@nvidia.com>
 <5CA1D89F-9DDB-4F91-8929-FE29BB79A653@vmware.com>
 <CAHk-=wh-ETqwd6EC2PR6JJzCFHVxJgdbUcMpW5MS7gCa76EDsQ@mail.gmail.com>
 <4D97206A-3B32-4818-9980-8F24BC57E289@vmware.com>
 <CAHk-=whxvVQReBqZeaV41=sAWfT4xTfn6sMSWDfkHKVS3zX85w@mail.gmail.com>
 <5A7D771C-FF95-465E-95F6-CD249FE28381@vmware.com>
 <CAHk-=wgMuSkumYxeaaxbKFoAbw_gjYo1eRXXSFcBHzNG2xauTA@mail.gmail.com>
 <CAHk-=whYT0Q1F=bxG0yi=LN5gXY64zBwefsbkLoRiP5p598d5A@mail.gmail.com>
 <fca16906-8e7d-5d04-6990-dfa8392bad8b@redhat.com>
 <20211221010312.GC1432915@nvidia.com>
From:   David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
 FAULT_FLAG_UNSHARE (!hugetlb)
In-Reply-To: <20211221010312.GC1432915@nvidia.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 21.12.21 02:03, Jason Gunthorpe wrote:
> On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
> 
>> handler (COW or unshare). Outside of COW/Unshare code, the bit can only
>> be set if page_count() == 1 and we sync against fork(). (and that's the
>> problem for gup-fast-only that I'm investigating right now, because it
>> would then always have to fallback to the slow variant if the bit isn't
>> already set)
> 

[in the meantime I figured out which pageflag we can reuse for anon
pages, which is at least one step into the right direction]

> I'm having a hard time imagining how gup_fast can maintain any sort of
> bit - it lacks all forms of locks so how can we do an atomic test and
> set between two pieces of data?

And exactly that is to be figured out.

Note that I am trying to make also any kind of R/O pins on an anonymous
page work as expected as well, to fix any kind of GUP after fork() and
GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
similarly has to make sure that the page is exclusive -- even if it's
mapped R/O (!).

In the pagefault handler we can then always reuse a PageAnonExclusive()
page, because we know it's exclusive and it will stay exclusive because
concurrent fork() isn't possible.

> 
> I think the point of Linus's plan really is that the new bit is
> derived from page_count, we get the set the new bit when we observe
> page_count==1 in various situations and we clear the new bit whenever
> we write protect with the intent to copy.

Here are is one problems I'm fighting with:


Assume we set the bit whenever we create a new anon page (either due to
COW, ordinary fault, unsharing request, ..., even if it's mapped R/O
first). We know the page is exclusive at that point because we created
it and fork() could not happen yet.

fork() is the only code that can share the page between processes and
turn it non-exclusive.

We can only clear the bit during fork() -- to turn the share page
exclusive and map it R/O into both processes -- when we are sure that
*nobody* concurrently takes a reference on the page the would be
problematic (-> GUP).

So to clear the bit during fork, we have to
(1) Check against page_count == 1
(2) Synchronize against GUP

(2) is easy using the mmap_lock and the mm->write_protect_seq


BUT, it would mean that whenever we fork() and there is one additional
reference on a page (even if it's from the swapcache), we would slow
down fork() even if there was never any GUP. This would apply to any
process out there that does a fork() ...


So the idea is to mark a page only exclusive as soon as someone needs
the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
or selected FOLL_GET like O_DIRECT). This can happen in my current
approach using two ways:

(1) Set the bit when we know we are the only users

We can set PageAnonExclusive() in case *we sync against fork* and the
page cannot get unmapped (pt lock) when:
* The page is mapped writable
* The page is mapped readable and page_count == 1

This should work during ordinary GUP in many cases.

If we cannot set the page exclusive, we have to trigger a page fault.

(2) During pagefaults when FOLL_FAULT_UNSHARE is set.

GUP will set FOLL_FAULT_UNSHARE for a pagefault when required (again
e.g., FOLL_PIN or selected FOLL_GET users), and manual setting of the
bit failed. The page fault will then try once again to set the bit if
there is a page mapped, and if that fails, do the COW/unshare and set
the bit.


The above should work fairly reliable with GUP. But indeed,
gup-fast-only is the problem. I'm still investigating what kind of
lightweight synchronization we could do against fork() such that we
wouldn't try setting a page PageAnonExclusive() while fork()
concurrently shares the page.

We could eventually use the page lock and do a try_lock(), both in
fork() and in gup-fast-only. fork() would only clear the bit if the
try_lock() succeeded. gup-fast-only would only be able to set the bit
and not fallback to the slow path if try_lock() succeeded.


But I'm still investigating if there are better alternatives ...

> 
> GUP doesn't interact with this bit. A writable page would still be the
> second way you can say "you clearly have full ownership of the page',
> so GUP just checks writability and bumps the refcount. The challenge
> of GUP reamins to sanely sequence it with things that are doing WP.
> 
> The elevated GUP refcount prevents the page from getting the bit set
> again, regardless of what happens to it.
> 
> Then on the WP sides.. Obviously we clear the bit when applying a WP
> for copy. So all the bad GUP cases are halted now, as with a cleared
> bit and a != 1 refcount COW must happen.
> 
> Then, it is also the case that most often a read-only page will have
> this bit cleared, UFFWD WP being the exception.
> 
> UFFWD WP works fine as it will have the bit set in the cases we care
> about and COW will not happen.
> 
> If the bit is not set then everything works as is today and you get
> extra COWs. We still have to fix the things that are missing the extra
> COWs to check the page ref to fix this.
> 
> It seems this new bit is acting as a 'COW disable', so, the accuracy
> of COW vs GUP&speculative pagerefs now relies on setting the bit as
> aggressively as possible when it is safe and cheap to do so.

But we really want to avoid degrading fork() for everybody that doesn't
do heavy GUP ...

> 
> If I got it right this is why it is not just mapcount reduced to 1
> bit. It is quite different, even though "this VM owns the page
> outright" sure sounds like "mapcount == 1"..
> 
> It seems like an interesting direction - the security properties seem
> good as we only have to take care of sites applying WP to decide what
> to do with the extra bit, and all places that set the bit to 1 do so
> after testing refcount under various locks preventing PTE WP.
> 
> That just leave the THP splitting.. I suppose we get the PTL, then
> compute the current value of the new bit based on refcount and diffuse
> it to all tail pages, then update the PMD and release the PTL. Safe
> against concurrent WP - don't need DoubleMap horrors because it isn't
> a counter.
> 
>> Not set it stone, just an idea what I'm playing with right now ... and I
>> have to tripple-check if
>> * page is PTE mapped in the page table I'm walking
>> * page_count() == 1
>> Really means that "this is the only reference.". I do strongly believe
>> so .. :)
> 
> AFAIK the only places that can break this are places putting struct
> page memory into special PTEs. Which is horrific and is just bugs, but
> I think I've seen it from time to time :(

As we only care about anon pages, I think that doesn't apply. At least
that's what I hope.


-- 
Thanks,

David / dhildenb