2022-01-13 13:25:35

by zhangliang (AG)

[permalink] [raw]
Subject: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

In current implementation, process's read requestions will fault in pages
with WP flags in PTEs. Next, if process emit a write requestion will go
into do_wp_page() and copy data to a new allocated page from the old one
due to refcount > 1 (page table mapped and swapcache), which could be
result in performance degradation. In fact, this page is exclusively owned
by this process and the duplication from old to a new allocated page is
really unnecessary.

So In this situation, these unshared pages can be reused by its process.

Signed-off-by: Liang Zhang <[email protected]>
---
This patch has been tested with redis benchmark. Here is the test
result.

Hardware
========
Memory (GB): 512G
CPU (total #): 88
NVMe SSD (GB): 1024

OS
==
kernel 5.10.0

Testcase
========
step 1:
Run 16 VMs (4U8G), each running with redis-server, in a cgroup
limiting memory.limit_in_bytes to 100G.
step 2:
Run memtier_bemchmark in host with params "--threads=1 --clients=1 \
--pipeline=256 --data-size=2048 --requests=allkeys --key-minimum=1 \
--key-maximum=30000000 --key-prefix=memtier-benchmark-prefix-redistests"
to test every VM concurrently.

Workset size
============
cat memory.memsw.usage_in_bytes
125403303936

Result
======
Comparing with Baseline, this patch can achieved 41% more Ops/sec,
41% more Hits/sec, 41% more Misses/sec, 30% less Latency and
41% more KB/sec.

Index(average) Baseline kernel Patched kernel
Ops/sec 109497 155428
Hits/sec 8653 12283
Misses/sec 90889 129014
Latency 2.297 1.603
KB/sec 44569 63186


mm/memory.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 23f2f1300d42..fd4d868b1c2d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3291,10 +3291,16 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
struct page *page = vmf->page;

/* PageKsm() doesn't necessarily raise the page refcount */
- if (PageKsm(page) || page_count(page) != 1)
+ if (PageKsm(page))
goto copy;
if (!trylock_page(page))
goto copy;
+
+ /* reuse the unshared swapcache page */
+ if (PageSwapCache(page) && reuse_swap_page(page, NULL)) {
+ goto reuse;
+ }
+
if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
unlock_page(page);
goto copy;
@@ -3304,6 +3310,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
* page count reference, and the page is locked,
* it's dark out, and we're wearing sunglasses. Hit it.
*/
+reuse:
unlock_page(page);
wp_page_reuse(vmf);
return VM_FAULT_WRITE;
--
2.30.0



2022-01-13 14:39:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 13, 2022 at 10:03:18PM +0800, Liang Zhang wrote:
> In current implementation, process's read requestions will fault in pages
> with WP flags in PTEs. Next, if process emit a write requestion will go
> into do_wp_page() and copy data to a new allocated page from the old one
> due to refcount > 1 (page table mapped and swapcache), which could be
> result in performance degradation. In fact, this page is exclusively owned
> by this process and the duplication from old to a new allocated page is
> really unnecessary.
>
> So In this situation, these unshared pages can be reused by its process.

Let's bring Linus in on this, but I think this reintroduces all of the
mapcount problems that we've been discussing recently.

How about this as an alternative?

+++ b/mm/memory.c
@@ -3291,11 +3291,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
struct page *page = vmf->page;

/* PageKsm() doesn't necessarily raise the page refcount */
- if (PageKsm(page) || page_count(page) != 1)
+ if (PageKsm(page) || page_count(page) != 1 + PageSwapCache(page))
goto copy;
if (!trylock_page(page))
goto copy;
- if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+ if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1 + PageSwapCache(page)) {
unlock_page(page);
goto copy;
}


> Signed-off-by: Liang Zhang <[email protected]>
> ---
> This patch has been tested with redis benchmark. Here is the test
> result.
>
> Hardware
> ========
> Memory (GB): 512G
> CPU (total #): 88
> NVMe SSD (GB): 1024
>
> OS
> ==
> kernel 5.10.0
>
> Testcase
> ========
> step 1:
> Run 16 VMs (4U8G), each running with redis-server, in a cgroup
> limiting memory.limit_in_bytes to 100G.
> step 2:
> Run memtier_bemchmark in host with params "--threads=1 --clients=1 \
> --pipeline=256 --data-size=2048 --requests=allkeys --key-minimum=1 \
> --key-maximum=30000000 --key-prefix=memtier-benchmark-prefix-redistests"
> to test every VM concurrently.
>
> Workset size
> ============
> cat memory.memsw.usage_in_bytes
> 125403303936
>
> Result
> ======
> Comparing with Baseline, this patch can achieved 41% more Ops/sec,
> 41% more Hits/sec, 41% more Misses/sec, 30% less Latency and
> 41% more KB/sec.
>
> Index(average) Baseline kernel Patched kernel
> Ops/sec 109497 155428
> Hits/sec 8653 12283
> Misses/sec 90889 129014
> Latency 2.297 1.603
> KB/sec 44569 63186
>
>
> mm/memory.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 23f2f1300d42..fd4d868b1c2d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3291,10 +3291,16 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> struct page *page = vmf->page;
>
> /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + if (PageKsm(page))
> goto copy;
> if (!trylock_page(page))
> goto copy;
> +
> + /* reuse the unshared swapcache page */
> + if (PageSwapCache(page) && reuse_swap_page(page, NULL)) {
> + goto reuse;
> + }
> +
> if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> unlock_page(page);
> goto copy;
> @@ -3304,6 +3310,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> * page count reference, and the page is locked,
> * it's dark out, and we're wearing sunglasses. Hit it.
> */
> +reuse:
> unlock_page(page);
> wp_page_reuse(vmf);
> return VM_FAULT_WRITE;
> --
> 2.30.0
>
>

2022-01-13 14:47:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 13.01.22 15:39, Matthew Wilcox wrote:
> On Thu, Jan 13, 2022 at 10:03:18PM +0800, Liang Zhang wrote:
>> In current implementation, process's read requestions will fault in pages
>> with WP flags in PTEs. Next, if process emit a write requestion will go
>> into do_wp_page() and copy data to a new allocated page from the old one
>> due to refcount > 1 (page table mapped and swapcache), which could be
>> result in performance degradation. In fact, this page is exclusively owned
>> by this process and the duplication from old to a new allocated page is
>> really unnecessary.
>>
>> So In this situation, these unshared pages can be reused by its process.
>
> Let's bring Linus in on this, but I think this reintroduces all of the
> mapcount problems that we've been discussing recently.
>
> How about this as an alternative?
>
> +++ b/mm/memory.c
> @@ -3291,11 +3291,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> struct page *page = vmf->page;
>
> /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + if (PageKsm(page) || page_count(page) != 1 + PageSwapCache(page))
> goto copy;
> if (!trylock_page(page))
> goto copy;
> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1 + PageSwapCache(page)) {
> unlock_page(page);
> goto copy;
> }

Funny, I was staring at swap reuse code as I received this mail ...
because if we're not using reuse_swap_page() here anymore, we shouldn't
really be reusing it anywhere for consistency, most prominently in
do_swap_page() when we handle vmf->flags & FAULT_FLAG_WRITE just
similarly as we do here ...

And that's where things get hairy and I am still trying to figure out
all of the details.

Regarding above: If the page is swapped out in multiple processes but
was only faulted into the current process R/O, and then we try to write:

1. Still in the swapcache: PageSwapCache()
2. Mapped only by one process: page_mapcount(page) == 1
3. Reference from one page table and the swap cache: page_count(page) ==

But other processes could read-fault on the swapcache page, no?

I think we'd really have to check against the swapcount as well ...
essentially reuse_swap_page(), no?

--
Thanks,

David / dhildenb


2022-01-13 15:02:48

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 13, 2022 at 03:46:54PM +0100, David Hildenbrand wrote:
> On 13.01.22 15:39, Matthew Wilcox wrote:
> > On Thu, Jan 13, 2022 at 10:03:18PM +0800, Liang Zhang wrote:
> >> In current implementation, process's read requestions will fault in pages
> >> with WP flags in PTEs. Next, if process emit a write requestion will go
> >> into do_wp_page() and copy data to a new allocated page from the old one
> >> due to refcount > 1 (page table mapped and swapcache), which could be
> >> result in performance degradation. In fact, this page is exclusively owned
> >> by this process and the duplication from old to a new allocated page is
> >> really unnecessary.
> >>
> >> So In this situation, these unshared pages can be reused by its process.
> >
> > Let's bring Linus in on this, but I think this reintroduces all of the
> > mapcount problems that we've been discussing recently.
> >
> > How about this as an alternative?
> >
> > +++ b/mm/memory.c
> > @@ -3291,11 +3291,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > struct page *page = vmf->page;
> >
> > /* PageKsm() doesn't necessarily raise the page refcount */
> > - if (PageKsm(page) || page_count(page) != 1)
> > + if (PageKsm(page) || page_count(page) != 1 + PageSwapCache(page))
> > goto copy;
> > if (!trylock_page(page))
> > goto copy;
> > - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> > + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1 + PageSwapCache(page)) {
> > unlock_page(page);
> > goto copy;
> > }
>
> Funny, I was staring at swap reuse code as I received this mail ...
> because if we're not using reuse_swap_page() here anymore, we shouldn't
> really be reusing it anywhere for consistency, most prominently in
> do_swap_page() when we handle vmf->flags & FAULT_FLAG_WRITE just
> similarly as we do here ...
>
> And that's where things get hairy and I am still trying to figure out
> all of the details.
>
> Regarding above: If the page is swapped out in multiple processes but
> was only faulted into the current process R/O, and then we try to write:
>
> 1. Still in the swapcache: PageSwapCache()
> 2. Mapped only by one process: page_mapcount(page) == 1
> 3. Reference from one page table and the swap cache: page_count(page) ==
>
> But other processes could read-fault on the swapcache page, no?
>
> I think we'd really have to check against the swapcount as well ...
> essentially reuse_swap_page(), no?

Unfortunately the last digit is missing from your "3.", but I
think you're absolutely right; we need to check swapcount. So
once reuse_swap_page() checks page_count instead of mapcount, we'll
be good?

2022-01-13 15:04:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 13.01.22 16:02, Matthew Wilcox wrote:
> On Thu, Jan 13, 2022 at 03:46:54PM +0100, David Hildenbrand wrote:
>> On 13.01.22 15:39, Matthew Wilcox wrote:
>>> On Thu, Jan 13, 2022 at 10:03:18PM +0800, Liang Zhang wrote:
>>>> In current implementation, process's read requestions will fault in pages
>>>> with WP flags in PTEs. Next, if process emit a write requestion will go
>>>> into do_wp_page() and copy data to a new allocated page from the old one
>>>> due to refcount > 1 (page table mapped and swapcache), which could be
>>>> result in performance degradation. In fact, this page is exclusively owned
>>>> by this process and the duplication from old to a new allocated page is
>>>> really unnecessary.
>>>>
>>>> So In this situation, these unshared pages can be reused by its process.
>>>
>>> Let's bring Linus in on this, but I think this reintroduces all of the
>>> mapcount problems that we've been discussing recently.
>>>
>>> How about this as an alternative?
>>>
>>> +++ b/mm/memory.c
>>> @@ -3291,11 +3291,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>> struct page *page = vmf->page;
>>>
>>> /* PageKsm() doesn't necessarily raise the page refcount */
>>> - if (PageKsm(page) || page_count(page) != 1)
>>> + if (PageKsm(page) || page_count(page) != 1 + PageSwapCache(page))
>>> goto copy;
>>> if (!trylock_page(page))
>>> goto copy;
>>> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>>> + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1 + PageSwapCache(page)) {
>>> unlock_page(page);
>>> goto copy;
>>> }
>>
>> Funny, I was staring at swap reuse code as I received this mail ...
>> because if we're not using reuse_swap_page() here anymore, we shouldn't
>> really be reusing it anywhere for consistency, most prominently in
>> do_swap_page() when we handle vmf->flags & FAULT_FLAG_WRITE just
>> similarly as we do here ...
>>
>> And that's where things get hairy and I am still trying to figure out
>> all of the details.
>>
>> Regarding above: If the page is swapped out in multiple processes but
>> was only faulted into the current process R/O, and then we try to write:
>>
>> 1. Still in the swapcache: PageSwapCache()
>> 2. Mapped only by one process: page_mapcount(page) == 1
>> 3. Reference from one page table and the swap cache: page_count(page) ==
>>
>> But other processes could read-fault on the swapcache page, no?
>>
>> I think we'd really have to check against the swapcount as well ...
>> essentially reuse_swap_page(), no?
>
> Unfortunately the last digit is missing from your "3.", but I

Sorry, == 2.

> think you're absolutely right; we need to check swapcount. So
> once reuse_swap_page() checks page_count instead of mapcount, we'll
> be good?
>

That's something I've been thinking of. Either get rid of
reuse_swap_page() completely or make it obey the same rules everywhere.

It's highly inconsistent how we handle COW.

--
Thanks,

David / dhildenb


2022-01-13 16:38:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <[email protected]> wrote:
>
> Let's bring Linus in on this, but I think this reintroduces all of the
> mapcount problems that we've been discussing recently.
>
> How about this as an alternative?

No, at that point reuse_swap_page() is the better thing to do.

Don't play games with page_count() (or even worse games with
swap_count). The page count is only stable if it's 1. Any other value
means that it can fluctuate due to concurrent lookups, some of which
can be done locklessly under RCU.

The biggest problem in the COW path used to be that it was completely
incomprehensible. Plus it had that pointless synchronization if the
page was locked for entirely unrelated reasons.

Doing a "trylock()" - and just copying if it fails fixes that
pointless "let's wait because we know somebody else is doing something
to this page".

And doing the

if (PageSwapCache(page) && reuse_swap_page(page, NULL)) {

thing after holding the lock is certainly not incomprehensible.

I just wanted to try to avoid that on the assumption that swap really
isn't all that relevant any more - even swap cache.

The most incomprehensible part in that sequence is actually the KSM
tests. I think they are BS and should be removed (the page_count()
check should be sufficient), but they are so incomprehensible that I
left them when I did my crapectomy.

Linus

2022-01-13 16:48:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 13.01.22 17:37, Linus Torvalds wrote:
> On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <[email protected]> wrote:
>>
>> Let's bring Linus in on this, but I think this reintroduces all of the
>> mapcount problems that we've been discussing recently.
>>
>> How about this as an alternative?
>
> No, at that point reuse_swap_page() is the better thing to do.
>
> Don't play games with page_count() (or even worse games with
> swap_count). The page count is only stable if it's 1. Any other value
> means that it can fluctuate due to concurrent lookups, some of which
> can be done locklessly under RCU.

I'm pretty sure the patch as is will reintroduce the CVE. So I think in
addition to the reuse_swap_page() check we need more.

I'm wondering if we can get rid of the mapcount checks in
reuse_swap_page() and instead check for page_count() and swapcount only.

We don't care if it's unstable in a sense than it will be bigger than
what we expect. In that case we COW as we would already do.

Thoughts?

--
Thanks,

David / dhildenb


2022-01-13 17:15:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 13, 2022 at 8:48 AM David Hildenbrand <[email protected]> wrote:
>
> I'm wondering if we can get rid of the mapcount checks in
> reuse_swap_page() and instead check for page_count() and swapcount only.

Honestly, I think even checking page_count() is pointless.

If the page has users, then that's fine. That's irrelevant for whether
it's a swap page or not, no?

So if you want to remove it from the swap cache, all that matters is that

(a) you have it locked so that there can be no new users trying to mix it up

(b) there are no swapcount() users of this page (which don't have a
ref to the page itself, they only have a swap entry), so that you
can't have somebody trying to look it up (whether some racy
"concurrent" lookup _or_ any later one, since we're about to remove
the swap cache association).

Why would "map_count()" matter - it's now many times the *page* is
mapped, it's irrelevant to swap cache status? And for the same reason,
what difference does "page_count()" have?

One big reason I did the COW rewrite was literally that the rules were
pure voodoo and made no sense at all. There was no underlying logic,
it was just a random collection of random tests that didn't have any
logical architecture to them.

Our VM is really really complicated already, so I really want our code
to make sense.

So even if I'm entirely wrong in my swap/map-count arguments above,
I'd like whatever patches in this area to be really well commented and
have some fundamental rules and logic to them so that people can read
the code and go "Ok, makes sense".

Please?

Linus

2022-01-13 17:25:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 13.01.22 18:14, Linus Torvalds wrote:
> On Thu, Jan 13, 2022 at 8:48 AM David Hildenbrand <[email protected]> wrote:
>>
>> I'm wondering if we can get rid of the mapcount checks in
>> reuse_swap_page() and instead check for page_count() and swapcount only.
>
> Honestly, I think even checking page_count() is pointless.
>
> If the page has users, then that's fine. That's irrelevant for whether
> it's a swap page or not, no?
>
> So if you want to remove it from the swap cache, all that matters is that
>
> (a) you have it locked so that there can be no new users trying to mix it up
>
> (b) there are no swapcount() users of this page (which don't have a
> ref to the page itself, they only have a swap entry), so that you
> can't have somebody trying to look it up (whether some racy
> "concurrent" lookup _or_ any later one, since we're about to remove
> the swap cache association).
>
> Why would "map_count()" matter - it's now many times the *page* is
> mapped, it's irrelevant to swap cache status? And for the same reason,
> what difference does "page_count()" have?
>
> One big reason I did the COW rewrite was literally that the rules were
> pure voodoo and made no sense at all. There was no underlying logic,
> it was just a random collection of random tests that didn't have any
> logical architecture to them.
>
> Our VM is really really complicated already, so I really want our code
> to make sense.
>
> So even if I'm entirely wrong in my swap/map-count arguments above,
> I'd like whatever patches in this area to be really well commented and
> have some fundamental rules and logic to them so that people can read
> the code and go "Ok, makes sense".
>
> Please?

I might be missing something, but it's not only about whether we can remove
the page from the swap cache, it's about whether we can reuse the page
exclusively in a process with write access, avoiding a COW. And for that we
have to check if it's mapped somewhere else already (readable).

Here is a sketch (buggy, untested, uncompiled) of how I think reuse_swap_page()
could look like, removing any mapcount logic and incorporating the
refount, leaving the page_trans_huge_swapcount() changes etc. out of the picture.

Would that make any sense?


bool reuse_swap_page(struct page *page, bool mapped)
{
int swapcount = 0, total_swapcount;

VM_BUG_ON_PAGE(!PageLocked(page), page);
if (unlikely(PageKsm(page)))
return false;

if (PageSwapCache(page)) {
swapcount = page_trans_huge_swapcount(page, &total_swapcount);

if (swapcount == 1 && !mapped &&
(likely(!PageTransCompound(page)) ||
/* The remaining swap count will be freed soon */
total_swapcount == page_swapcount(page))) {
if (!PageWriteback(page)) {
page = compound_head(page);
delete_from_swap_cache(page);
SetPageDirty(page);
} else {
swp_entry_t entry;
struct swap_info_struct *p;

entry.val = page_private(page);
p = swap_info_get(entry);
if (p->flags & SWP_STABLE_WRITES) {
spin_unlock(&p->lock);
return false;
}
spin_unlock(&p->lock);
}
}
}

/*
* We expect exactly one ref from our mapped page (if already mapped)
* and one ref from the swapcache (if in the swapcache).
*/
if (!mapped)
return swapcount == 1 && page_count(page) == !!PageSwapCache(page)
return swapcount == 0 && page_count(page) == 1 + !!PageSwapCache(page)
}


--
Thanks,

David / dhildenb


2022-01-13 17:44:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 13, 2022 at 9:25 AM David Hildenbrand <[email protected]> wrote:
>
> I might be missing something, but it's not only about whether we can remove
> the page from the swap cache, it's about whether we can reuse the page
> exclusively in a process with write access, avoiding a COW. And for that we
> have to check if it's mapped somewhere else already (readable).

No.

The "try to remove from swap cache" is one thing. That uses the swap count.

The "see if we can reuse this page for COW" is a completely different
test, and that's the "page_count() == 1" one.

The two should not be mixed up with each other. Just don't do it.
There's no reason - except for legacy confusion that should be
actively avoided and removed.

IOW, the COW path would do

trylock - copy if fails
try to remove from swap cache
if page_count() is now 1, we can reuse it

Note how the "try to remove from swap cache" is entirely independent
of whether we then reuse it or not.

And yes, we can have optimistic other tests - like not even bothering
to trylock if we can see that the page-count is so elevated that it
makes no difference and trying to remove from swap cache would be just
pointless extra work (both the removal itself, and then potentially
later re-additions).

But those should be seen for what they are - not important for
semantics, only a "don't bother, this page has so many users that we
already know that removing the swapcache one doesn't make any
difference at all".

Now, it's possible that I'm missing something, but I think this kind
of clarity is very much what we should aim for. Clear rules, no mixing
of "can I COW this" with "can I remove this from the swap cache".

And now I need to start my travel nightmare, so I'll be effectively
offline for the next 24 hours ;(

Linus

2022-01-13 17:55:33

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 13.01.22 18:44, Linus Torvalds wrote:
> On Thu, Jan 13, 2022 at 9:25 AM David Hildenbrand <[email protected]> wrote:
>>
>> I might be missing something, but it's not only about whether we can remove
>> the page from the swap cache, it's about whether we can reuse the page
>> exclusively in a process with write access, avoiding a COW. And for that we
>> have to check if it's mapped somewhere else already (readable).
>
> No.
>
> The "try to remove from swap cache" is one thing. That uses the swap count.

However, reuse_swap_page() currently does multiple things, and that's part of the issue.

>
> The "see if we can reuse this page for COW" is a completely different
> test, and that's the "page_count() == 1" one.
>
> The two should not be mixed up with each other. Just don't do it.
> There's no reason - except for legacy confusion that should be
> actively avoided and removed.
>
> IOW, the COW path would do
>
> trylock - copy if fails
> try to remove from swap cache
> if page_count() is now 1, we can reuse it

I thought about that exact sequence as well. I remember stumbling over
one of the other users of reuse_swap_page() that would require more thought
-- do_swap_page(). There, we essentially do a COW before having the
page mapped. (nasty)

I'll give it more thought.

>
> Note how the "try to remove from swap cache" is entirely independent
> of whether we then reuse it or not.
>
> And yes, we can have optimistic other tests - like not even bothering
> to trylock if we can see that the page-count is so elevated that it
> makes no difference and trying to remove from swap cache would be just
> pointless extra work (both the removal itself, and then potentially
> later re-additions).
>
> But those should be seen for what they are - not important for
> semantics, only a "don't bother, this page has so many users that we
> already know that removing the swapcache one doesn't make any
> difference at all".

Right.

>
> Now, it's possible that I'm missing something, but I think this kind
> of clarity is very much what we should aim for. Clear rules, no mixing
> of "can I COW this" with "can I remove this from the swap cache".

I consider reuse_swap_page() at this point just absolutely nasty.

While we're at it, is there a real reason we can't simplify to

diff --git a/mm/memory.c b/mm/memory.c
index e8e2144cbfa6..ab114a5862a0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3295,7 +3295,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
goto copy;
if (!trylock_page(page))
goto copy;
- if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+ if (PageKsm(page) || page_count(page) != 1) {
unlock_page(page);
goto copy;

Our page mapping has to hold a reference, so it seems unnecessary to check both.

>
> And now I need to start my travel nightmare, so I'll be effectively
> offline for the next 24 hours ;(

Happy traveling then :) No worries, I'll be working on all this more
than 24 hours, especially PageAnonExclusive() that makes my head
hurt when it comes to swap, but this discussion already helps a lot
on how to eventually sort it all out.

--
Thanks,

David / dhildenb


2022-01-13 18:55:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 13, 2022 at 9:55 AM David Hildenbrand <[email protected]> wrote:
>
> However, reuse_swap_page() currently does multiple things, and that's part of the issue.

Yeah, I think it's a horrible function. The COW path was explicitly
changed not to use it because it's so confusing.

> While we're at it, is there a real reason we can't simplify to
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e8e2144cbfa6..ab114a5862a0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3295,7 +3295,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + if (PageKsm(page) || page_count(page) != 1) {

No, I think both the mapcount and the ksm check are bogus and should
be removed, but they are both examples of "get rid of the nastiest
complexity, leave the stuff that doesn't much matter".

Somebody who really knows the ksm logic should check all the ksm
tests. I really think they are pointless, and came from the old and
horribly broken page_mapcount() logic that did *not* check
page_count().

When you think page_mapcount() matters, suddenly you have to worry
about all the other non-mapping uses of a page, which is why I think
that ksm test still exists. But I didn't mind keeping a couple of
extraneous tests that I didn't see the point of , as long as the core
page-count-based logfc was solid

Linus

2022-01-13 21:07:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 13, 2022 at 09:44:04AM -0800, Linus Torvalds wrote:
> IOW, the COW path would do
>
> trylock - copy if fails
> try to remove from swap cache
> if page_count() is now 1, we can reuse it
>
> Note how the "try to remove from swap cache" is entirely independent
> of whether we then reuse it or not.
>
> And yes, we can have optimistic other tests - like not even bothering
> to trylock if we can see that the page-count is so elevated that it
> makes no difference and trying to remove from swap cache would be just
> pointless extra work (both the removal itself, and then potentially
> later re-additions).
>
> But those should be seen for what they are - not important for
> semantics, only a "don't bother, this page has so many users that we
> already know that removing the swapcache one doesn't make any
> difference at all".

I think what you mean is something like ...

+++ b/mm/memory.c
@@ -3290,15 +3290,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (PageAnon(vmf->page)) {
struct page *page = vmf->page;

- /* PageKsm() doesn't necessarily raise the page refcount */
- if (PageKsm(page) || page_count(page) != 1)
+ /* Lots of people are using this page, just copy */
+ if (page_count(page) > 5)
goto copy;
if (!trylock_page(page))
goto copy;
- if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
- unlock_page(page);
- goto copy;
- }
+ if (!reuse_swap_page(page, NULL))
+ goto unlock;
/*
* Ok, we've got the only map reference, and the only
* page count reference, and the page is locked,
@@ -3311,6 +3309,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
+unlock:
+ unlock_page(page);
copy:
/*
* Ok, we need to copy. Oh, well..

... with a suitably sensible reuse_swap_page().

2022-01-14 03:29:40

by zhangliang (AG)

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

Thank you very much for your reply, Matthew Wilcox.

On 2022/1/13 22:39, Matthew Wilcox wrote:
> On Thu, Jan 13, 2022 at 10:03:18PM +0800, Liang Zhang wrote:
>> In current implementation, process's read requestions will fault in pages
>> with WP flags in PTEs. Next, if process emit a write requestion will go
>> into do_wp_page() and copy data to a new allocated page from the old one
>> due to refcount > 1 (page table mapped and swapcache), which could be
>> result in performance degradation. In fact, this page is exclusively owned
>> by this process and the duplication from old to a new allocated page is
>> really unnecessary.
>>
>> So In this situation, these unshared pages can be reused by its process.
>
> Let's bring Linus in on this, but I think this reintroduces all of the
> mapcount problems that we've been discussing recently.
>
> How about this as an alternative?
>
> +++ b/mm/memory.c
> @@ -3291,11 +3291,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> struct page *page = vmf->page;
>
> /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + if (PageKsm(page) || page_count(page) != 1 + PageSwapCache(page))

I don't think the only PageSwapCache() check can replace reuse_swap_page(), since
the latter also includes a very important check -- stable page requirement,
see Fixes: f05714293a59 (mm: support anonymous stable page) . Without this additional check,
e.g. page is writebacking by zram under us, at this moment we make this page R/W to its
process which may change memory content right now will trigger BUG.

> goto copy;
> if (!trylock_page(page))
> goto copy;
> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1 + PageSwapCache(page)) {
> unlock_page(page);
> goto copy;
> }
>
>
>> Signed-off-by: Liang Zhang <[email protected]>
>> ---
>> This patch has been tested with redis benchmark. Here is the test
>> result.
>>
>> Hardware
>> ========
>> Memory (GB): 512G
>> CPU (total #): 88
>> NVMe SSD (GB): 1024
>>
>> OS
>> ==
>> kernel 5.10.0
>>
>> Testcase
>> ========
>> step 1:
>> Run 16 VMs (4U8G), each running with redis-server, in a cgroup
>> limiting memory.limit_in_bytes to 100G.
>> step 2:
>> Run memtier_bemchmark in host with params "--threads=1 --clients=1 \
>> --pipeline=256 --data-size=2048 --requests=allkeys --key-minimum=1 \
>> --key-maximum=30000000 --key-prefix=memtier-benchmark-prefix-redistests"
>> to test every VM concurrently.
>>
>> Workset size
>> ============
>> cat memory.memsw.usage_in_bytes
>> 125403303936
>>
>> Result
>> ======
>> Comparing with Baseline, this patch can achieved 41% more Ops/sec,
>> 41% more Hits/sec, 41% more Misses/sec, 30% less Latency and
>> 41% more KB/sec.
>>
>> Index(average) Baseline kernel Patched kernel
>> Ops/sec 109497 155428
>> Hits/sec 8653 12283
>> Misses/sec 90889 129014
>> Latency 2.297 1.603
>> KB/sec 44569 63186
>>
>>
>> mm/memory.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 23f2f1300d42..fd4d868b1c2d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3291,10 +3291,16 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>> struct page *page = vmf->page;
>>
>> /* PageKsm() doesn't necessarily raise the page refcount */
>> - if (PageKsm(page) || page_count(page) != 1)
>> + if (PageKsm(page))
>> goto copy;
>> if (!trylock_page(page))
>> goto copy;
>> +
>> + /* reuse the unshared swapcache page */
>> + if (PageSwapCache(page) && reuse_swap_page(page, NULL)) {
>> + goto reuse;
>> + }
>> >> if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>> unlock_page(page);
>> goto copy;
>> @@ -3304,6 +3310,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>> * page count reference, and the page is locked,
>> * it's dark out, and we're wearing sunglasses. Hit it.
>> */
>> +reuse:
>> unlock_page(page);
>> wp_page_reuse(vmf);
>> return VM_FAULT_WRITE;
>> --
>> 2.30.0
>>
>>
> .

--
Best Regards,
Liang Zhang

2022-01-14 05:06:21

by zhangliang (AG)

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page



On 2022/1/14 0:48, David Hildenbrand wrote:
> On 13.01.22 17:37, Linus Torvalds wrote:
>> On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <[email protected]> wrote:
>>>
>>> Let's bring Linus in on this, but I think this reintroduces all of the
>>> mapcount problems that we've been discussing recently.
>>>
>>> How about this as an alternative?
>>
>> No, at that point reuse_swap_page() is the better thing to do.
>>
>> Don't play games with page_count() (or even worse games with
>> swap_count). The page count is only stable if it's 1. Any other value
>> means that it can fluctuate due to concurrent lookups, some of which
>> can be done locklessly under RCU.
>
> I'm pretty sure the patch as is will reintroduce the CVE. So I think in

Actually, I wonder how reuse_swap_page() in this patch can reintroduce CVE,
I think the invoking logic here is as same as that in do_swap_page().
So, could you give me some hint about this? Thanks :)

> addition to the reuse_swap_page() check we need more.
>
> I'm wondering if we can get rid of the mapcount checks in
> reuse_swap_page() and instead check for page_count() and swapcount only.
>
> We don't care if it's unstable in a sense than it will be bigger than
> what we expect. In that case we COW as we would already do.
>
> Thoughts?
>

--
Best Regards,
Liang Zhang

2022-01-14 11:23:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 14.01.22 06:00, zhangliang (AG) wrote:
>
>
> On 2022/1/14 0:48, David Hildenbrand wrote:
>> On 13.01.22 17:37, Linus Torvalds wrote:
>>> On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <[email protected]> wrote:
>>>>
>>>> Let's bring Linus in on this, but I think this reintroduces all of the
>>>> mapcount problems that we've been discussing recently.
>>>>
>>>> How about this as an alternative?
>>>
>>> No, at that point reuse_swap_page() is the better thing to do.
>>>
>>> Don't play games with page_count() (or even worse games with
>>> swap_count). The page count is only stable if it's 1. Any other value
>>> means that it can fluctuate due to concurrent lookups, some of which
>>> can be done locklessly under RCU.
>>
>> I'm pretty sure the patch as is will reintroduce the CVE. So I think in
>
> Actually, I wonder how reuse_swap_page() in this patch can reintroduce CVE,
> I think the invoking logic here is as same as that in do_swap_page().
> So, could you give me some hint about this? Thanks :)

The CVE essentially is

a) Process A maps an anon page
-> refcount = 1, mapcount = 1
b) Process A forks process B
-> Page mapped R/O
-> refcount = 2, mapcount = 2
c) Process B pins the page R/O (e.g., vmsplice)
-> refcount > 2, mapcount = 2
d) Process B unmaps the page
-> refcount > 1, mapcount = 1
e) Process A writes to the page

If e) ends up reusing the page for process A (e.g., because mapcount == 1),
process B can observe the memory modifications via the
page pinning

As I previously said, our COW logic is completely inconsistent
and should be streamlined. Regarding the CVE, what would have
to happen is that we put the page into the swapcache and unmap
from both processes after c). Then, do_swap_page() would reuse
the page during e) similarly.

We do have a check in place that protects against pinned pages
not being able to enter the swapcache, but IIRC it can race with
GUP, so it might race with c) eventually, meaning that with swap
there might still be a small chance for the CVE in current upstream IIRC.

I think with your patch we could trick do_wp_page() do go down the
reuse_swap_page() path and reuse the page, even though there is a pending
pin on the page from process B.


a) Process A maps an anon page
-> refcount = 1, mapcount = 1
b) Page is added to the swapcache and unmapped
-> refcount = 1, mapcount = 0
c) Process A reads page
-> do_swap_page() will map the page R/O
-> the page might remain in the swapcache
-> refcount = 2, mapcount = 1
d) Process A forks process B
-> Page mapped R/O
-> refcount = 3, mapcount = 2
e) Process B pins the page R/O (e.g., vmsplice)
-> refcount > 3, mapcount = 2
d) Process B unmaps the page
-> refcount > 2, mapcount = 1
e) Process A writes to the page

In e), we would go into reuse_swap_page() and could succeed, being left with
refcount > 1 and mapcount = 1.

But maybe I am missing something :)



There are only two real ways I know to handle the scenario above:
(1) Rely on page_count() == 1 in COW logic to decide whether we can
reuse a page. This comes with false negatives, some being harmful,
some just being a performance problem.
(2) Rely on mapcount(), swapcount() ... when deciding to reuse a page
and when it's safe to take a GUP pin.

(2) was discussed in
https://lkml.kernel.org/r/[email protected]
but was essentially NAKed by Linus in the current form (well, it didn't
consider the swapcount so it was just buggy, fortunately Linus pointed that
out).

To handle problematic false negatives with (1) I'm looking into marking
anon pages as exclusively owned by a single process, such that we can simply
reuse them in the COW path -- PageAnonExclusive() also discussed inside that
posting. That posting also contains a test case which tests all different
variants of the "child can observe MAP_PRIVAT modifications of the parent"
we know of (e.g., THP, hugetlb), which you might use to verify if the issue is
reintroduced by your patch or not.


Yesterday, I played with removing reuse_swap_page() completely
and streamlining our COW logic like so:

(based on https://www.spinics.net/lists/linux-mm/msg279460.html)


commit f1fa31a40b13ade5cd93ceb40daadf1196526145
Author: David Hildenbrand <[email protected]>
Date: Fri Jan 14 09:29:52 2022 +0100

mm: optimize do_wp_page() for exclusive pages in the swapcache

Let's optimize for a page with a single user that has been added to the
swapcache. Try removing the swapcache reference if there is hope that
we're the exclusive user, but keep the page_count(page) == 1 check in
place.

Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page()
users next.

While at it, remove the superfluous page_mapcount() check: it's
implicitly covered by the page_count() for ordinary anon pages.

Signed-off-by: David Hildenbrand <[email protected]>

diff --git a/mm/memory.c b/mm/memory.c
index e8e2144cbfa6..bd2af7a36791 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3290,12 +3290,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (PageAnon(vmf->page)) {
struct page *page = vmf->page;

- /* PageKsm() doesn't necessarily raise the page refcount */
- if (PageKsm(page) || page_count(page) != 1)
+ /*
+ * PageKsm() doesn't necessarily raise the page refcount.
+ *
+ * These checks are racy as long as we haven't locked the page;
+ * they are a pure optimization to avoid trying to lock the page
+ * and trying to free the swap cache when there is little hope
+ * it will actually result in a refcount of 1.
+ */
+ if (PageKsm(page) || page_count(page) <= 1 + PageSwapCache(page))
goto copy;
if (!trylock_page(page))
goto copy;
- if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+ if (PageSwapCache(page))
+ try_to_free_swap(page);
+ if (PageKsm(page) || page_count(page) != 1) {
unlock_page(page);
goto copy;
}



Followed by



commit 754aa8f04f7b6d1ffc2b9844be63bbf7b527929f (HEAD)
Author: David Hildenbrand <[email protected]>
Date: Fri Jan 14 09:53:20 2022 +0100

mm: streamline COW logic in do_swap_page()

Let's apply the same COW logic as in do_wp_page(), conditionally trying to
remove the page from the swapcache after freeing the swap entry, however,
before actually mapping our page. We can change the order now that
reuse_swap_page() is no longer used.

Signed-off-by: David Hildenbrand <[email protected]>


diff --git a/mm/memory.c b/mm/memory.c
index bd2af7a36791..a1bd2b5c818a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3489,6 +3489,24 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
return 0;
}

+static inline bool should_try_to_free_swap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int fault_flags)
+{
+ if (!PageSwapCache(page))
+ return false;
+ if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
+ PageMlocked(page))
+ return true;
+ /*
+ * We must only reuse the page if the page_count() is 1. Try to
+ * free the swapcache to get rid of the swapcache reference in case
+ * it's likely that we will succeed.
+ */
+ return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+ page_count(page) == 2;
+}
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -3612,7 +3630,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/*
- * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
+ * Make sure try_to_free_swap or swapoff did not
* release the swapcache from under us. The page pin, and pte_same
* test below, are not enough to exclude that. Even if it is still
* swapcache, we need to check that the page's swap has not changed.
@@ -3644,19 +3662,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/*
- * The page isn't present yet, go ahead with the fault.
- *
- * Be careful about the sequence of operations here.
- * To get its accounting right, reuse_swap_page() must be called
- * while the page is counted on swap but not yet in mapcount i.e.
- * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
- * must be called after the swap_free(), or it will never succeed.
+ * Remove the swap entry and conditionally try to free up the swapcache.
+ * We're already holding a reference on the page but haven't mapped it
+ * yet. We won't be waiting on writeback, so if there is concurrent
+ * writeback we won't map the page writable just now.
*/
+ swap_free(entry);
+ if (should_try_to_free_swap(page, vma, vmf->flags))
+ try_to_free_swap(page);

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
- if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+
+ /* Same logic as in do_wp_page(). */
+ if ((vmf->flags & FAULT_FLAG_WRITE) &&
+ !PageKsm(page) && page_count(page) == 1) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
@@ -3681,10 +3702,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
}

- swap_free(entry);
- if (mem_cgroup_swap_full(page) ||
- (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
- try_to_free_swap(page);
unlock_page(page);
if (page != swapcache && swapcache) {
/*




And followed by more patches that rip out reuse_swap_page() completely.


But I still have to understand what's safe to do and not do in do_swap_page(),
and which effect it would have. Especially PTE-mapped THP are a head-scratcher
for me, and the interaction with THP in the swapcache before/after actual swap.

So consider above a WIP prototype that's mostly untested.

--
Thanks,

David / dhildenb


2022-01-17 11:08:33

by zhangliang (AG)

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

Hi, thank you very much for your patient reply, and I completely agree with the viewpoint from you and Linus about COW and reuse_swap_page(). :)

On 2022/1/14 19:23, David Hildenbrand wrote:
> On 14.01.22 06:00, zhangliang (AG) wrote:
>>
>>
>> On 2022/1/14 0:48, David Hildenbrand wrote:
>>> On 13.01.22 17:37, Linus Torvalds wrote:
>>>> On Thu, Jan 13, 2022 at 6:39 AM Matthew Wilcox <[email protected]> wrote:
>>>>>
>>>>> Let's bring Linus in on this, but I think this reintroduces all of the
>>>>> mapcount problems that we've been discussing recently.
>>>>>
>>>>> How about this as an alternative?
>>>>
>>>> No, at that point reuse_swap_page() is the better thing to do.
>>>>
>>>> Don't play games with page_count() (or even worse games with
>>>> swap_count). The page count is only stable if it's 1. Any other value
>>>> means that it can fluctuate due to concurrent lookups, some of which
>>>> can be done locklessly under RCU.
>>>
>>> I'm pretty sure the patch as is will reintroduce the CVE. So I think in
>>
>> Actually, I wonder how reuse_swap_page() in this patch can reintroduce CVE,
>> I think the invoking logic here is as same as that in do_swap_page().
>> So, could you give me some hint about this? Thanks :)
>
> The CVE essentially is
>
> a) Process A maps an anon page
> -> refcount = 1, mapcount = 1
> b) Process A forks process B
> -> Page mapped R/O
> -> refcount = 2, mapcount = 2
> c) Process B pins the page R/O (e.g., vmsplice)
> -> refcount > 2, mapcount = 2
> d) Process B unmaps the page
> -> refcount > 1, mapcount = 1
> e) Process A writes to the page
>
> If e) ends up reusing the page for process A (e.g., because mapcount == 1),
> process B can observe the memory modifications via the
> page pinning
>
> As I previously said, our COW logic is completely inconsistent
> and should be streamlined. Regarding the CVE, what would have
> to happen is that we put the page into the swapcache and unmap
> from both processes after c). Then, do_swap_page() would reuse
> the page during e) similarly.
>
> We do have a check in place that protects against pinned pages
> not being able to enter the swapcache, but IIRC it can race with
> GUP, so it might race with c) eventually, meaning that with swap
> there might still be a small chance for the CVE in current upstream IIRC.
>
> I think with your patch we could trick do_wp_page() do go down the
> reuse_swap_page() path and reuse the page, even though there is a pending
> pin on the page from process B.
>
>
> a) Process A maps an anon page
> -> refcount = 1, mapcount = 1
> b) Page is added to the swapcache and unmapped
> -> refcount = 1, mapcount = 0
> c) Process A reads page
> -> do_swap_page() will map the page R/O
> -> the page might remain in the swapcache
> -> refcount = 2, mapcount = 1
> d) Process A forks process B
> -> Page mapped R/O
> -> refcount = 3, mapcount = 2
> e) Process B pins the page R/O (e.g., vmsplice)
> -> refcount > 3, mapcount = 2
> d) Process B unmaps the page
> -> refcount > 2, mapcount = 1
> e) Process A writes to the page
>
> In e), we would go into reuse_swap_page() and could succeed, being left with
> refcount > 1 and mapcount = 1.
>
> But maybe I am missing something :)
>
>
>
> There are only two real ways I know to handle the scenario above:
> (1) Rely on page_count() == 1 in COW logic to decide whether we can
> reuse a page. This comes with false negatives, some being harmful,
> some just being a performance problem.
> (2) Rely on mapcount(), swapcount() ... when deciding to reuse a page
> and when it's safe to take a GUP pin.
>
> (2) was discussed in
> https://lkml.kernel.org/r/[email protected]
> but was essentially NAKed by Linus in the current form (well, it didn't
> consider the swapcount so it was just buggy, fortunately Linus pointed that
> out).
>
> To handle problematic false negatives with (1) I'm looking into marking
> anon pages as exclusively owned by a single process, such that we can simply
> reuse them in the COW path -- PageAnonExclusive() also discussed inside that
> posting. That posting also contains a test case which tests all different
> variants of the "child can observe MAP_PRIVAT modifications of the parent"
> we know of (e.g., THP, hugetlb), which you might use to verify if the issue is
> reintroduced by your patch or not.
>
>
> Yesterday, I played with removing reuse_swap_page() completely
> and streamlining our COW logic like so:
>
> (based on https://www.spinics.net/lists/linux-mm/msg279460.html)
>
>
> commit f1fa31a40b13ade5cd93ceb40daadf1196526145
> Author: David Hildenbrand <[email protected]>
> Date: Fri Jan 14 09:29:52 2022 +0100
>
> mm: optimize do_wp_page() for exclusive pages in the swapcache
>
> Let's optimize for a page with a single user that has been added to the
> swapcache. Try removing the swapcache reference if there is hope that
> we're the exclusive user, but keep the page_count(page) == 1 check in
> place.
>
> Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page()
> users next.
>
> While at it, remove the superfluous page_mapcount() check: it's
> implicitly covered by the page_count() for ordinary anon pages.
>
> Signed-off-by: David Hildenbrand <[email protected]>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e8e2144cbfa6..bd2af7a36791 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3290,12 +3290,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> if (PageAnon(vmf->page)) {
> struct page *page = vmf->page;
>
> - /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + /*
> + * PageKsm() doesn't necessarily raise the page refcount.
> + *
> + * These checks are racy as long as we haven't locked the page;
> + * they are a pure optimization to avoid trying to lock the page
> + * and trying to free the swap cache when there is little hope
> + * it will actually result in a refcount of 1.
> + */
> + if (PageKsm(page) || page_count(page) <= 1 + PageSwapCache(page))
> goto copy;
> if (!trylock_page(page))
> goto copy;
> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + if (PageSwapCache(page))
> + try_to_free_swap(page);
> + if (PageKsm(page) || page_count(page) != 1) {
> unlock_page(page);
> goto copy;
> }
>
>
>
> Followed by
>
>
>
> commit 754aa8f04f7b6d1ffc2b9844be63bbf7b527929f (HEAD)
> Author: David Hildenbrand <[email protected]>
> Date: Fri Jan 14 09:53:20 2022 +0100
>
> mm: streamline COW logic in do_swap_page()
>
> Let's apply the same COW logic as in do_wp_page(), conditionally trying to
> remove the page from the swapcache after freeing the swap entry, however,
> before actually mapping our page. We can change the order now that
> reuse_swap_page() is no longer used.
>
> Signed-off-by: David Hildenbrand <[email protected]>
>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index bd2af7a36791..a1bd2b5c818a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3489,6 +3489,24 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> return 0;
> }
>
> +static inline bool should_try_to_free_swap(struct page *page,
> + struct vm_area_struct *vma,
> + unsigned int fault_flags)
> +{
> + if (!PageSwapCache(page))
> + return false;
> + if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
> + PageMlocked(page))
> + return true;
> + /*
> + * We must only reuse the page if the page_count() is 1. Try to
> + * free the swapcache to get rid of the swapcache reference in case
> + * it's likely that we will succeed.
> + */
> + return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
> + page_count(page) == 2;
> +}
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3612,7 +3630,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /*
> - * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
> + * Make sure try_to_free_swap or swapoff did not
> * release the swapcache from under us. The page pin, and pte_same
> * test below, are not enough to exclude that. Even if it is still
> * swapcache, we need to check that the page's swap has not changed.
> @@ -3644,19 +3662,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /*
> - * The page isn't present yet, go ahead with the fault.
> - *
> - * Be careful about the sequence of operations here.
> - * To get its accounting right, reuse_swap_page() must be called
> - * while the page is counted on swap but not yet in mapcount i.e.
> - * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
> - * must be called after the swap_free(), or it will never succeed.
> + * Remove the swap entry and conditionally try to free up the swapcache.
> + * We're already holding a reference on the page but haven't mapped it
> + * yet. We won't be waiting on writeback, so if there is concurrent
> + * writeback we won't map the page writable just now.
> */
> + swap_free(entry);
> + if (should_try_to_free_swap(page, vma, vmf->flags))
> + try_to_free_swap(page);
>
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> pte = mk_pte(page, vma->vm_page_prot);
> - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> +
> + /* Same logic as in do_wp_page(). */
> + if ((vmf->flags & FAULT_FLAG_WRITE) &&
> + !PageKsm(page) && page_count(page) == 1) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> ret |= VM_FAULT_WRITE;
> @@ -3681,10 +3702,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
> }
>
> - swap_free(entry);
> - if (mem_cgroup_swap_full(page) ||
> - (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> - try_to_free_swap(page);
> unlock_page(page);
> if (page != swapcache && swapcache) {
> /*
>
>
>
>
> And followed by more patches that rip out reuse_swap_page() completely.
>
>
> But I still have to understand what's safe to do and not do in do_swap_page(),
> and which effect it would have. Especially PTE-mapped THP are a head-scratcher
> for me, and the interaction with THP in the swapcache before/after actual swap.
>
> So consider above a WIP prototype that's mostly untested.
>

--
Best Regards,
Liang Zhang

2022-01-18 02:26:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 17.01.22 03:11, zhangliang (AG) wrote:
> Hi, thank you very much for your patient reply, and I completely agree with the viewpoint from you and Linus about COW and reuse_swap_page(). :)
>

If you have some sparse cycles, it would be great if you could see if

https://github.com/davidhildenbrand/linux/tree/reuse_swap_page

works as expected as well for your setup.

--
Thanks,

David / dhildenb

2022-01-18 02:28:08

by zhangliang (AG)

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

Sure, I will do that :)

On 2022/1/17 20:58, David Hildenbrand wrote:
> On 17.01.22 03:11, zhangliang (AG) wrote:
>> Hi, thank you very much for your patient reply, and I completely agree with the viewpoint from you and Linus about COW and reuse_swap_page(). :)
>>
>
> If you have some sparse cycles, it would be great if you could see if
>
> https://github.com/davidhildenbrand/linux/tree/reuse_swap_page
>
> works as expected as well for your setup.
>

--
Best Regards,
Liang Zhang

2022-01-21 22:14:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 17.01.22 14:31, zhangliang (AG) wrote:
> Sure, I will do that :)

I'm polishing up / testing the patches and might send something out for discussion shortly.
Just a note that on my branch was a version with a wrong condition that should have been fixed now.

I am still thinking about PTE mapped THP. For these, we'll always
have page_count() > 1, essentially corresponding to the number of still-mapped sub-pages.

So if we end up with a R/O mapped part of a THP, we'll always have to COW and cannot reuse ever,
although it's really just a single process mapping the THP via PTEs.

One approach would be to scan the currently locked page table for entries mapping
this same page. If page_count() corresponds to that value, we know that only we are
mapping the THP and there are no additional references. That would be a special case
if we find an anon THP in do_wp_page(). Hm.

--
Thanks,

David / dhildenb

2022-01-21 22:16:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 20, 2022 at 03:15:37PM +0100, David Hildenbrand wrote:
> On 17.01.22 14:31, zhangliang (AG) wrote:
> > Sure, I will do that :)
>
> I'm polishing up / testing the patches and might send something out for discussion shortly.
> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>
> I am still thinking about PTE mapped THP. For these, we'll always
> have page_count() > 1, essentially corresponding to the number of still-mapped sub-pages.
>
> So if we end up with a R/O mapped part of a THP, we'll always have to COW and cannot reuse ever,
> although it's really just a single process mapping the THP via PTEs.
>
> One approach would be to scan the currently locked page table for entries mapping
> this same page. If page_count() corresponds to that value, we know that only we are
> mapping the THP and there are no additional references. That would be a special case
> if we find an anon THP in do_wp_page(). Hm.

You're starting to optimise for some pretty weird cases at that point.
Anon THP is always going to start out aligned (and can be moved by
mremap()). Arguably it should be broken up if it's moved so it can be
reformed into aligned THPs by khugepaged.

This is completely different from file-backed THPs, where misalignment
might be considered normal (if unfortunate).

2022-01-21 22:21:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 20, 2022 at 04:26:22PM +0100, David Hildenbrand wrote:
> On 20.01.22 15:39, Matthew Wilcox wrote:
> > On Thu, Jan 20, 2022 at 03:15:37PM +0100, David Hildenbrand wrote:
> >> On 17.01.22 14:31, zhangliang (AG) wrote:
> >>> Sure, I will do that :)
> >>
> >> I'm polishing up / testing the patches and might send something out for discussion shortly.
> >> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
> >>
> >> I am still thinking about PTE mapped THP. For these, we'll always
> >> have page_count() > 1, essentially corresponding to the number of still-mapped sub-pages.
> >>
> >> So if we end up with a R/O mapped part of a THP, we'll always have to COW and cannot reuse ever,
> >> although it's really just a single process mapping the THP via PTEs.
> >>
> >> One approach would be to scan the currently locked page table for entries mapping
> >> this same page. If page_count() corresponds to that value, we know that only we are
> >> mapping the THP and there are no additional references. That would be a special case
> >> if we find an anon THP in do_wp_page(). Hm.
> >
> > You're starting to optimise for some pretty weird cases at that point.
>
> So your claim is that read-only, PTE mapped pages are weird? How do you
> come to that conclusion?

Because normally anon THP pages are PMD mapped. That's rather
the point of anon THPs.

> If we adjust the THP reuse logic to split on additional references
> (page_count() == 1) -- similarly as suggested by Linus to fix the CVE --
> we're going to end up with exactly that more frequently.

I don't understand. Are we talking past each other? As I understand
the situation we're talking about here, a process has created a THP,
done something to cause it to be partially mapped (or mapped in a
misaligned way) in its own address space, then forked, and we're
trying to figure out if it's safe to reuse it? I say that situation is
rare enough that it's OK to always allocate an order-0 page and
copy into it.

> > Anon THP is always going to start out aligned (and can be moved by
> > mremap()). Arguably it should be broken up if it's moved so it can be
> > reformed into aligned THPs by khugepaged.
>
> Can you elaborate, I'm missing the point where something gets moved. I
> don't care about mremap() at all here.
>
>
> 1. You have a read-only, PTE mapped THP
> 2. Write fault on the THP
> 3. We PTE-map the THP because we run into a false positive in our COW
> logic to handle COW on PTE
> 4. Write fault on the PTE
> 5. We always have to COW each and every sub-page and can never reuse,
> because page_count() > 1
>
> That's essentially what reuse_swap_page() tried to handle before.
> Eventually optimizing for this is certainly the next step, but I'd like
> to document which effect the removal of reuse_swap_page() will have to THP.

I'm talking about step 0. How do we get a read-only, PTE-mapped THP?
Through mremap() or perhaps through an mprotect()/mmap()/munmap() that
failed to split the THP.

2022-01-21 22:21:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 20, 2022 at 5:26 PM David Hildenbrand <[email protected]> wrote:
>
> So your claim is that read-only, PTE mapped pages are weird? How do you
> come to that conclusion?
>
> If we adjust the THP reuse logic to split on additional references
> (page_count() == 1) -- similarly as suggested by Linus to fix the CVE --
> we're going to end up with exactly that more frequently.

If you write to a THP page that has page_count() elevated - presumably
because of a fork() - then that COW is exactly what you want to
happen.

And presumably you want it to happen page-by-page.

So I fail to see what the problem is.

The *normal* THP case is that there hasn't been a fork, and there is
no COW activity. That's the only thing worth trying to optimize for
and worry about.

If you do some kind of fork with huge-pages, and actually write to
those pages (as opposed to just execve() in the child and wait in the
parent), you only have yourself to blame. You *will* take COW faults,
and you have to do it, and at that point spliting the THP in whoever
did the COW is likely the right thing to do just to hope that you
don't have to allocate a whole new hugepage. So it will cause new
(small-page) allocations and copies.

And yes, at that point, writes to the THP page will cause COW's for
both sides as they both end up making that "split it" decision.

Honestly, would anything else ever even make sense?

If you care about THP performance, you make sure that the COW THP case
simply never happens.

Linus

2022-01-21 22:22:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 15:39, Matthew Wilcox wrote:
> On Thu, Jan 20, 2022 at 03:15:37PM +0100, David Hildenbrand wrote:
>> On 17.01.22 14:31, zhangliang (AG) wrote:
>>> Sure, I will do that :)
>>
>> I'm polishing up / testing the patches and might send something out for discussion shortly.
>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>>
>> I am still thinking about PTE mapped THP. For these, we'll always
>> have page_count() > 1, essentially corresponding to the number of still-mapped sub-pages.
>>
>> So if we end up with a R/O mapped part of a THP, we'll always have to COW and cannot reuse ever,
>> although it's really just a single process mapping the THP via PTEs.
>>
>> One approach would be to scan the currently locked page table for entries mapping
>> this same page. If page_count() corresponds to that value, we know that only we are
>> mapping the THP and there are no additional references. That would be a special case
>> if we find an anon THP in do_wp_page(). Hm.
>
> You're starting to optimise for some pretty weird cases at that point.

So your claim is that read-only, PTE mapped pages are weird? How do you
come to that conclusion?

If we adjust the THP reuse logic to split on additional references
(page_count() == 1) -- similarly as suggested by Linus to fix the CVE --
we're going to end up with exactly that more frequently.

> Anon THP is always going to start out aligned (and can be moved by
> mremap()). Arguably it should be broken up if it's moved so it can be
> reformed into aligned THPs by khugepaged.

Can you elaborate, I'm missing the point where something gets moved. I
don't care about mremap() at all here.


1. You have a read-only, PTE mapped THP
2. Write fault on the THP
3. We PTE-map the THP because we run into a false positive in our COW
logic to handle COW on PTE
4. Write fault on the PTE
5. We always have to COW each and every sub-page and can never reuse,
because page_count() > 1

That's essentially what reuse_swap_page() tried to handle before.
Eventually optimizing for this is certainly the next step, but I'd like
to document which effect the removal of reuse_swap_page() will have to THP.

--
Thanks,

David / dhildenb

2022-01-21 22:22:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 20, 2022 at 04:39:55PM +0100, David Hildenbrand wrote:
> On 20.01.22 16:36, Matthew Wilcox wrote:
> > On Thu, Jan 20, 2022 at 04:26:22PM +0100, David Hildenbrand wrote:
> >> On 20.01.22 15:39, Matthew Wilcox wrote:
> >>> On Thu, Jan 20, 2022 at 03:15:37PM +0100, David Hildenbrand wrote:
> >>>> On 17.01.22 14:31, zhangliang (AG) wrote:
> >>>>> Sure, I will do that :)
> >>>>
> >>>> I'm polishing up / testing the patches and might send something out for discussion shortly.
> >>>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
> >>>>
> >>>> I am still thinking about PTE mapped THP. For these, we'll always
> >>>> have page_count() > 1, essentially corresponding to the number of still-mapped sub-pages.
> >>>>
> >>>> So if we end up with a R/O mapped part of a THP, we'll always have to COW and cannot reuse ever,
> >>>> although it's really just a single process mapping the THP via PTEs.
> >>>>
> >>>> One approach would be to scan the currently locked page table for entries mapping
> >>>> this same page. If page_count() corresponds to that value, we know that only we are
> >>>> mapping the THP and there are no additional references. That would be a special case
> >>>> if we find an anon THP in do_wp_page(). Hm.
> >>>
> >>> You're starting to optimise for some pretty weird cases at that point.
> >>
> >> So your claim is that read-only, PTE mapped pages are weird? How do you
> >> come to that conclusion?
> >
> > Because normally anon THP pages are PMD mapped. That's rather
> > the point of anon THPs.
>
> For example unless we are talking about *drumroll* COW handling.
>
> >
> >> If we adjust the THP reuse logic to split on additional references
> >> (page_count() == 1) -- similarly as suggested by Linus to fix the CVE --
> >> we're going to end up with exactly that more frequently.
> >
> > I don't understand. Are we talking past each other? As I understand
> > the situation we're talking about here, a process has created a THP,
> > done something to cause it to be partially mapped (or mapped in a
> > misaligned way) in its own address space, then forked, and we're
> > trying to figure out if it's safe to reuse it? I say that situation is
> > rare enough that it's OK to always allocate an order-0 page and
> > copy into it.
>
> Yes, we are talking past each other and no, I am talking about fully
> mapped THP, just mapped via PTEs.
>
> Please refer to our THP COW logic: do_huge_pmd_wp_page()

You're going to have to be a bit more explicit. That's clearly handling
the case where there's a PMD mapping. If there is _also_ a PTE mapping,
then obviously the page is mapped more than once and can't be reused!

> >
> >>> Anon THP is always going to start out aligned (and can be moved by
> >>> mremap()). Arguably it should be broken up if it's moved so it can be
> >>> reformed into aligned THPs by khugepaged.
> >>
> >> Can you elaborate, I'm missing the point where something gets moved. I
> >> don't care about mremap() at all here.
> >>
> >>
> >> 1. You have a read-only, PTE mapped THP
> >> 2. Write fault on the THP
> >> 3. We PTE-map the THP because we run into a false positive in our COW
> >> logic to handle COW on PTE
> >> 4. Write fault on the PTE
> >> 5. We always have to COW each and every sub-page and can never reuse,
> >> because page_count() > 1
> >>
> >> That's essentially what reuse_swap_page() tried to handle before.
> >> Eventually optimizing for this is certainly the next step, but I'd like
> >> to document which effect the removal of reuse_swap_page() will have to THP.
> >
> > I'm talking about step 0. How do we get a read-only, PTE-mapped THP?
> > Through mremap() or perhaps through an mprotect()/mmap()/munmap() that
> > failed to split the THP.
>
> do_huge_pmd_wp_page()

I feel you could be a little more verbose about what you think is
going on here. Are you talking about the fallback: path where we
call __split_huge_pmd()?

2022-01-21 22:22:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 16:37, Linus Torvalds wrote:
> On Thu, Jan 20, 2022 at 5:26 PM David Hildenbrand <[email protected]> wrote:
>>
>> So your claim is that read-only, PTE mapped pages are weird? How do you
>> come to that conclusion?
>>
>> If we adjust the THP reuse logic to split on additional references
>> (page_count() == 1) -- similarly as suggested by Linus to fix the CVE --
>> we're going to end up with exactly that more frequently.
>
> If you write to a THP page that has page_count() elevated - presumably
> because of a fork() - then that COW is exactly what you want to
> happen.
>
> And presumably you want it to happen page-by-page.
>
> So I fail to see what the problem is.
>
> The *normal* THP case is that there hasn't been a fork, and there is
> no COW activity. That's the only thing worth trying to optimize for
> and worry about.
>
> If you do some kind of fork with huge-pages, and actually write to
> those pages (as opposed to just execve() in the child and wait in the
> parent), you only have yourself to blame. You *will* take COW faults,
> and you have to do it, and at that point spliting the THP in whoever
> did the COW is likely the right thing to do just to hope that you
> don't have to allocate a whole new hugepage. So it will cause new
> (small-page) allocations and copies.
>
> And yes, at that point, writes to the THP page will cause COW's for
> both sides as they both end up making that "split it" decision.
>
> Honestly, would anything else ever even make sense?
>
> If you care about THP performance, you make sure that the COW THP case
> simply never happens.

I'm, not concerned about fork(), I'm concerned about other false positives.

Here is what I currently have, I hope that makes sense:

commit f8fe5e15e04f7563606d58daee795aa0cc0e454c (HEAD)
Author: David Hildenbrand <[email protected]>
Date: Fri Jan 14 09:53:35 2022 +0100

mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()

We currently have a different COW logic for THP than we have for
ordinary pages in do_wp_page(): the effect is that the issue reported in
CVE-2020-29374 is currently still possible for THP: a child process can
observe memory modifications of the parent process to MAP_PRIVATE pages.

Let's apply the same logic (page_count() == 1), conditionally trying to
remove the page from the swapcache after freeing the swap entry, however,
before actually mapping our page.

Note that KSM does not apply to THP and that we won't be waiting on
writeback. In the future, we might want to wait for writeback in case
the page is not swapped by any other process. For now, let's keep it
simple.

If we find a page_count() > 1, we'll have to split and fallback to
do_wp_page(), which will copy the page.

This change has the following effects:

(1) Mapping a THP with GUP references read-only and triggering a write
fault will result in a split and disconnect of GUP from the actual page
table just as we know it for ordinary pages already: this will require a
fix for all anonymous pages, for example, by marking anon pages exclusive
to a single process and allowing GUP pins only on exclusive anon pages.

(2) Additional references on the compound page, or concurrent writeback,
will now result in the THP getting mapped via PTEs, whereby each PTE
requires a reference on the compound page. Once we have a PTE-mapped
PMD, do_wp_page() will always copy and never reuse.

Signed-off-by: David Hildenbrand <[email protected]>

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..a4f2754142aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1303,7 +1303,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
page = pmd_page(orig_pmd);
VM_BUG_ON_PAGE(!PageHead(page), page);

- /* Lock page for reuse_swap_page() */
if (!trylock_page(page)) {
get_page(page);
spin_unlock(vmf->ptl);
@@ -1319,10 +1318,14 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
}

/*
- * We can only reuse the page if nobody else maps the huge page or it's
- * part.
+ * See do_wp_page(): we can reuse only if there are no additional
+ * references (page_count() is 1). Try to free the swapcache to get
+ * rid of the swapcache references in case it's likely that we will
+ * succeed.
*/
- if (reuse_swap_page(page)) {
+ if (PageSwapCache(page) && page_count(page) == 1 + thp_nr_pages(page))
+ try_to_free_swap(page);
+ if (page_count(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);


--
Thanks,

David / dhildenb

2022-01-21 22:22:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 16:36, Matthew Wilcox wrote:
> On Thu, Jan 20, 2022 at 04:26:22PM +0100, David Hildenbrand wrote:
>> On 20.01.22 15:39, Matthew Wilcox wrote:
>>> On Thu, Jan 20, 2022 at 03:15:37PM +0100, David Hildenbrand wrote:
>>>> On 17.01.22 14:31, zhangliang (AG) wrote:
>>>>> Sure, I will do that :)
>>>>
>>>> I'm polishing up / testing the patches and might send something out for discussion shortly.
>>>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>>>>
>>>> I am still thinking about PTE mapped THP. For these, we'll always
>>>> have page_count() > 1, essentially corresponding to the number of still-mapped sub-pages.
>>>>
>>>> So if we end up with a R/O mapped part of a THP, we'll always have to COW and cannot reuse ever,
>>>> although it's really just a single process mapping the THP via PTEs.
>>>>
>>>> One approach would be to scan the currently locked page table for entries mapping
>>>> this same page. If page_count() corresponds to that value, we know that only we are
>>>> mapping the THP and there are no additional references. That would be a special case
>>>> if we find an anon THP in do_wp_page(). Hm.
>>>
>>> You're starting to optimise for some pretty weird cases at that point.
>>
>> So your claim is that read-only, PTE mapped pages are weird? How do you
>> come to that conclusion?
>
> Because normally anon THP pages are PMD mapped. That's rather
> the point of anon THPs.

For example unless we are talking about *drumroll* COW handling.

>
>> If we adjust the THP reuse logic to split on additional references
>> (page_count() == 1) -- similarly as suggested by Linus to fix the CVE --
>> we're going to end up with exactly that more frequently.
>
> I don't understand. Are we talking past each other? As I understand
> the situation we're talking about here, a process has created a THP,
> done something to cause it to be partially mapped (or mapped in a
> misaligned way) in its own address space, then forked, and we're
> trying to figure out if it's safe to reuse it? I say that situation is
> rare enough that it's OK to always allocate an order-0 page and
> copy into it.

Yes, we are talking past each other and no, I am talking about fully
mapped THP, just mapped via PTEs.

Please refer to our THP COW logic: do_huge_pmd_wp_page()

>
>>> Anon THP is always going to start out aligned (and can be moved by
>>> mremap()). Arguably it should be broken up if it's moved so it can be
>>> reformed into aligned THPs by khugepaged.
>>
>> Can you elaborate, I'm missing the point where something gets moved. I
>> don't care about mremap() at all here.
>>
>>
>> 1. You have a read-only, PTE mapped THP
>> 2. Write fault on the THP
>> 3. We PTE-map the THP because we run into a false positive in our COW
>> logic to handle COW on PTE
>> 4. Write fault on the PTE
>> 5. We always have to COW each and every sub-page and can never reuse,
>> because page_count() > 1
>>
>> That's essentially what reuse_swap_page() tried to handle before.
>> Eventually optimizing for this is certainly the next step, but I'd like
>> to document which effect the removal of reuse_swap_page() will have to THP.
>
> I'm talking about step 0. How do we get a read-only, PTE-mapped THP?
> Through mremap() or perhaps through an mprotect()/mmap()/munmap() that
> failed to split the THP.

do_huge_pmd_wp_page()

--
Thanks,

David / dhildenb

2022-01-21 22:23:43

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page


>> Yes, we are talking past each other and no, I am talking about fully
>> mapped THP, just mapped via PTEs.
>>
>> Please refer to our THP COW logic: do_huge_pmd_wp_page()
>
> You're going to have to be a bit more explicit. That's clearly handling
> the case where there's a PMD mapping. If there is _also_ a PTE mapping,
> then obviously the page is mapped more than once and can't be reused!
>
>>>
>>>>> Anon THP is always going to start out aligned (and can be moved by
>>>>> mremap()). Arguably it should be broken up if it's moved so it can be
>>>>> reformed into aligned THPs by khugepaged.
>>>>
>>>> Can you elaborate, I'm missing the point where something gets moved. I
>>>> don't care about mremap() at all here.
>>>>
>>>>
>>>> 1. You have a read-only, PTE mapped THP
>>>> 2. Write fault on the THP
>>>> 3. We PTE-map the THP because we run into a false positive in our COW
>>>> logic to handle COW on PTE
>>>> 4. Write fault on the PTE
>>>> 5. We always have to COW each and every sub-page and can never reuse,
>>>> because page_count() > 1
>>>>
>>>> That's essentially what reuse_swap_page() tried to handle before.
>>>> Eventually optimizing for this is certainly the next step, but I'd like
>>>> to document which effect the removal of reuse_swap_page() will have to THP.
>>>
>>> I'm talking about step 0. How do we get a read-only, PTE-mapped THP?
>>> Through mremap() or perhaps through an mprotect()/mmap()/munmap() that
>>> failed to split the THP.
>>
>> do_huge_pmd_wp_page()
>
> I feel you could be a little more verbose about what you think is
> going on here. Are you talking about the fallback: path where we
> call __split_huge_pmd()?

Sorry, I was less verbose because I was just sending out the
patch+description to Linus' reply and was assuming you're going to read
it anyways ;)

Yes, I'm speaking about exactly that fallback path.

--
Thanks,

David / dhildenb

2022-01-21 22:24:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 20, 2022 at 04:51:47PM +0100, David Hildenbrand wrote:
>
> >> Yes, we are talking past each other and no, I am talking about fully
> >> mapped THP, just mapped via PTEs.
> >>
> >> Please refer to our THP COW logic: do_huge_pmd_wp_page()
> >
> > You're going to have to be a bit more explicit. That's clearly handling
> > the case where there's a PMD mapping. If there is _also_ a PTE mapping,
> > then obviously the page is mapped more than once and can't be reused!
> >
> >>>
> >>>>> Anon THP is always going to start out aligned (and can be moved by
> >>>>> mremap()). Arguably it should be broken up if it's moved so it can be
> >>>>> reformed into aligned THPs by khugepaged.
> >>>>
> >>>> Can you elaborate, I'm missing the point where something gets moved. I
> >>>> don't care about mremap() at all here.
> >>>>
> >>>>
> >>>> 1. You have a read-only, PTE mapped THP
> >>>> 2. Write fault on the THP
> >>>> 3. We PTE-map the THP because we run into a false positive in our COW
> >>>> logic to handle COW on PTE
> >>>> 4. Write fault on the PTE
> >>>> 5. We always have to COW each and every sub-page and can never reuse,
> >>>> because page_count() > 1
> >>>>
> >>>> That's essentially what reuse_swap_page() tried to handle before.
> >>>> Eventually optimizing for this is certainly the next step, but I'd like
> >>>> to document which effect the removal of reuse_swap_page() will have to THP.
> >>>
> >>> I'm talking about step 0. How do we get a read-only, PTE-mapped THP?
> >>> Through mremap() or perhaps through an mprotect()/mmap()/munmap() that
> >>> failed to split the THP.
> >>
> >> do_huge_pmd_wp_page()
> >
> > I feel you could be a little more verbose about what you think is
> > going on here. Are you talking about the fallback: path where we
> > call __split_huge_pmd()?
>
> Sorry, I was less verbose because I was just sending out the
> patch+description to Linus' reply and was assuming you're going to read
> it anyways ;)

This reply arrived before your reply to Linus ;-) Anyway ...

> Yes, I'm speaking about exactly that fallback path.

OK, so in that fallback path, we're already determined the THP has
more than one reference to it (ok, maybe that extra reference was
temporary and now gone), but we've already split the PMD down into
PTEs, and COWed one of the other pages that was in the THP. If
anything, we should be more aggressive about COWing the remaining
pages in the THP, not looking for reasons why we might be able to
avoid COWing this particular page.

2022-01-21 22:26:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

>>
>> Sorry, I was less verbose because I was just sending out the
>> patch+description to Linus' reply and was assuming you're going to read
>> it anyways ;)
>
> This reply arrived before your reply to Linus ;-) Anyway ...

Yes, I could just have added that I'm going to reply with the patch to
Linus after Linus' mail flew in and I made up my mind how to proceed,
that would have been nicer :)

>
>> Yes, I'm speaking about exactly that fallback path.
>
> OK, so in that fallback path, we're already determined the THP has
> more than one reference to it (ok, maybe that extra reference was
> temporary and now gone), but we've already split the PMD down into

Yes, any extra references or concurrent writeback. Swapcache was once
example that my patch hopefully handles properly.

> PTEs, and COWed one of the other pages that was in the THP. If
> anything, we should be more aggressive about COWing the remaining
> pages in the THP, not looking for reasons why we might be able to
> avoid COWing this particular page.

At that point in time we didn't COW yet, we only PTE-mapped the THP, but
yes, once we reach do_wp_page() we will always COW, even if there are no
additional references anymore.

Ideas I had what we could do once we're in do_cow_page() and we spot
that we do have a PTE-mapped THP:
* Count the mappings in the process page table and use that as baseline
(instead of 1). Would fail if there are additional references.
* Try to split the compound page immediately. Will also fail if there
are additional references.

COWing more extremely sounds like an interesting idea to free up the
compound page after we fragmented it -- which will succeed once
additional references are gone.

--
Thanks,

David / dhildenb

2022-01-21 22:26:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 20, 2022 at 5:46 PM David Hildenbrand <[email protected]> wrote:
>
> I'm, not concerned about fork(), I'm concerned about other false positives.

Without a fork(), you won't have the THP marked for COW, so is it
really an issue?

> Here is what I currently have, I hope that makes sense:

From a quick look, that patch looks fine to me, but there might be
something I'm missing... And who knows what odd usage patterns there
might be in this area. The whole odd Android thing with forking that
zygote process.

Because that zygote thing _does_ use THP, I think, and it's where the
wrong-way COW thing mattered. Obviously doing COW is the right thing
to do, and that case doesn't want any sharing of pages (all copies),
but it might be worth at least checking that it works and there isn't
some odd performance gotcha.

Linus

2022-01-21 22:28:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 18:22, Linus Torvalds wrote:
> On Thu, Jan 20, 2022 at 5:46 PM David Hildenbrand <[email protected]> wrote:
>>
>> I'm, not concerned about fork(), I'm concerned about other false positives.
>
> Without a fork(), you won't have the THP marked for COW, so is it
> really an issue?

Oh sorry, I should have been clearer.

I absolutely agree that there is no way around COW if there are multiple
sharers of a page :) If you fork() and write to a shared page, you have
to expect that THP get split and that you get a copy. No change on that
front.

At this point I'm trying to assess the impact of the change and how to
eventually mitigate it if it turns out to be a real problem.

(I've been spending a lot of time trying to understand all the
complexity, and sometimes my brain is ... a bit overloaded by all of
that. Well, at least I learn a lot ...)

One scenario I have in mind how we can end up easily with R/O mapped
exclusive THP is a simple fork() with an immediate exec(), unmap() or
exit() of the child. Might not be the most efficient way of doing
things, but that doesn't mean that existing user space doesn't rely on
it not happening. Then, of course, there are other ways to read-protect
THP temporarily (mprotect() and friends), where you wouldn't expect to
lose your THP, but it's somewhat a secondary concern most probably .

Obviously, we need another (temporary/speculative) reference on the THP
or writeback to actually split it. I'm *hoping* that it will be so rare
that we really don't care. I decided to always lock the THP in
do_wp_page() to at least handle page migration and swapcache more reliably.

To answer you question: people optimized the reuse case heavily
previously (reuse_swap_page()), and I can see how it might happen. I'm
hoping that it won't matter in practice, but I'd like to at least
document the impact and have some magic solution in sleeve that I can
just pull out when reports start coming in.

>
>> Here is what I currently have, I hope that makes sense:
>
> From a quick look, that patch looks fine to me, but there might be
> something I'm missing... And who knows what odd usage patterns there
> might be in this area. The whole odd Android thing with forking that
> zygote process.

Exactly my thoughts. I'm trying to be very careful.

--
Thanks,

David / dhildenb

2022-01-21 22:29:36

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page


> On Jan 20, 2022, at 6:15 AM, David Hildenbrand <[email protected]> wrote:
>
> On 17.01.22 14:31, zhangliang (AG) wrote:
>> Sure, I will do that :)
>
> I'm polishing up / testing the patches and might send something out for discussion shortly.
> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>

Sorry for being late for the discussion.

David, does any of it regards the lru_cache_add() reference issue that I
mentioned? [1]

Seems to me that any solution should also regard this problem, or am I
missing something?

Thanks,
Nadav

[1] https://lore.kernel.org/linux-mm/[email protected]/

2022-01-21 22:30:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 18:48, Nadav Amit wrote:
>
>> On Jan 20, 2022, at 6:15 AM, David Hildenbrand <[email protected]> wrote:
>>
>> On 17.01.22 14:31, zhangliang (AG) wrote:
>>> Sure, I will do that :)
>>
>> I'm polishing up / testing the patches and might send something out for discussion shortly.
>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>>
>
> Sorry for being late for the discussion.
>
> David, does any of it regards the lru_cache_add() reference issue that I
> mentioned? [1]

No, unfortunately not in that part of my work. *Maybe* we could also try
to handle that reference similarly to the swapcache, but the question is
if we can't wait for PageAnonExclusive.

Right now I have the following in mind to get most parts working as
exptected:

1. Optimize reuse logic for the swapcache as it seems to be easy
2. Streamline COW logic and remove reuse_swap_page() -- fix the CVE for
THP.
3. Introduce PageAnonExclusive and allow FOLL_PIN only on
PageAnonExclusive pages.
4. Convert O_DIRECT to FOLL_PIN

We will never ever have to copy a page PageAnonExclusive page in the COW
handler and can immediately reuse it without even locking the page. The
existing reuse logic is essentially then used to reset PageAnonExclusive
on a page (thus it makes sense to work on it) where the flag is not set
anymore -- or on a fresh page if we have to copy.

That implies that all these additional references won't care if your app
doesn't fork() or KSM isn't active. Consequently, anything that
read-protects anonymous pages will work as expected and should be as
fast as it gets.

Sounds good? At least to me. If only swap/migration entries wouldn't be
harder to handle than I'd wish, that's why it's taking a little and will
take a little longer.

>
> Seems to me that any solution should also regard this problem, or am I
> missing something?

The bigger picture has to handle it, yes!

--
Thanks,

David / dhildenb

2022-01-21 22:31:13

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page



> On Jan 20, 2022, at 10:00 AM, David Hildenbrand <[email protected]> wrote:
>
> On 20.01.22 18:48, Nadav Amit wrote:
>>
>>> On Jan 20, 2022, at 6:15 AM, David Hildenbrand <[email protected]> wrote:
>>>
>>> On 17.01.22 14:31, zhangliang (AG) wrote:
>>>> Sure, I will do that :)
>>>
>>> I'm polishing up / testing the patches and might send something out for discussion shortly.
>>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>>>
>>
>> Sorry for being late for the discussion.
>>
>> David, does any of it regards the lru_cache_add() reference issue that I
>> mentioned? [1]
>
> No, unfortunately not in that part of my work. *Maybe* we could also try
> to handle that reference similarly to the swapcache, but the question is
> if we can't wait for PageAnonExclusive.
>
> Right now I have the following in mind to get most parts working as
> exptected:
>
> 1. Optimize reuse logic for the swapcache as it seems to be easy
> 2. Streamline COW logic and remove reuse_swap_page() -- fix the CVE for
> THP.
> 3. Introduce PageAnonExclusive and allow FOLL_PIN only on
> PageAnonExclusive pages.
> 4. Convert O_DIRECT to FOLL_PIN
>
> We will never ever have to copy a page PageAnonExclusive page in the COW
> handler and can immediately reuse it without even locking the page. The
> existing reuse logic is essentially then used to reset PageAnonExclusive
> on a page (thus it makes sense to work on it) where the flag is not set
> anymore -- or on a fresh page if we have to copy.
>
> That implies that all these additional references won't care if your app
> doesn't fork() or KSM isn't active. Consequently, anything that
> read-protects anonymous pages will work as expected and should be as
> fast as it gets.
>
> Sounds good? At least to me. If only swap/migration entries wouldn't be
> harder to handle than I'd wish, that's why it's taking a little and will
> take a little longer.

Thanks for the quick response. I would have to see the logic to set/clear
PageAnonExclusive to fully understand how things are handled.

BTW, I just saw this patch form PeterZ [1] that seems to be related, as
it deals with changing protection on pinned pages.


[1] https://lore.kernel.org/linux-mm/[email protected]/

2022-01-21 22:31:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

>> Sounds good? At least to me. If only swap/migration entries wouldn't be
>> harder to handle than I'd wish, that's why it's taking a little and will
>> take a little longer.
>
> Thanks for the quick response. I would have to see the logic to set/clear
> PageAnonExclusive to fully understand how things are handled.
>

I'll let you know as soon as I have something. I'll most probably share
a document explaining the design first.

> BTW, I just saw this patch form PeterZ [1] that seems to be related, as
> it deals with changing protection on pinned pages.

Unfortunately the use of page_maybe_dma_pinned() is racy, as we can race
with GUP fast. It's a problem for vmscan which currently relies on it
for correctness. For migration, it might be good enough as we should
fail later when trying to freeze the refcount.

>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>


--
Thanks,

David / dhildenb

2022-01-21 22:35:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 19:11, Nadav Amit wrote:
>
>
>> On Jan 20, 2022, at 10:00 AM, David Hildenbrand <[email protected]> wrote:
>>
>> On 20.01.22 18:48, Nadav Amit wrote:
>>>
>>>> On Jan 20, 2022, at 6:15 AM, David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 17.01.22 14:31, zhangliang (AG) wrote:
>>>>> Sure, I will do that :)
>>>>
>>>> I'm polishing up / testing the patches and might send something out for discussion shortly.
>>>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>>>>
>>>
>>> Sorry for being late for the discussion.
>>>
>>> David, does any of it regards the lru_cache_add() reference issue that I
>>> mentioned? [1]
>>
>> No, unfortunately not in that part of my work. *Maybe* we could also try
>> to handle that reference similarly to the swapcache, but the question is
>> if we can't wait for PageAnonExclusive.
>>
>> Right now I have the following in mind to get most parts working as
>> exptected:
>>
>> 1. Optimize reuse logic for the swapcache as it seems to be easy
>> 2. Streamline COW logic and remove reuse_swap_page() -- fix the CVE for
>> THP.
>> 3. Introduce PageAnonExclusive and allow FOLL_PIN only on
>> PageAnonExclusive pages.
>> 4. Convert O_DIRECT to FOLL_PIN
>>
>> We will never ever have to copy a page PageAnonExclusive page in the COW
>> handler and can immediately reuse it without even locking the page. The
>> existing reuse logic is essentially then used to reset PageAnonExclusive
>> on a page (thus it makes sense to work on it) where the flag is not set
>> anymore -- or on a fresh page if we have to copy.
>>
>> That implies that all these additional references won't care if your app
>> doesn't fork() or KSM isn't active. Consequently, anything that
>> read-protects anonymous pages will work as expected and should be as
>> fast as it gets.
>>
>> Sounds good? At least to me. If only swap/migration entries wouldn't be
>> harder to handle than I'd wish, that's why it's taking a little and will
>> take a little longer.
>
> Thanks for the quick response. I would have to see the logic to set/clear
> PageAnonExclusive to fully understand how things are handled.
>
> BTW, I just saw this patch form PeterZ [1] that seems to be related, as
> it deals with changing protection on pinned pages.

Hi Nadav,

I'm trying to see how effective the following patch is with your forceswap.c [1] reproducer.

commit b08d494deb319a63b7c776636b960258c48775e1
Author: David Hildenbrand <[email protected]>
Date: Fri Jan 14 09:29:52 2022 +0100

mm: optimize do_wp_page() for exclusive pages in the swapcache

Let's optimize for a page with a single user that has been added to the
swapcache. Try removing the swapcache reference if there is hope that
we're the exclusive user, but keep the page_count(page) == 1 check in
place.

Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page()
users next.

While at it, remove the superfluous page_mapcount() check: it's
implicitly covered by the page_count() for ordinary anon pages.

Signed-off-by: David Hildenbrand <[email protected]>

diff --git a/mm/memory.c b/mm/memory.c
index f306e698a1e3..d9186981662a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (PageAnon(vmf->page)) {
struct page *page = vmf->page;

- /* PageKsm() doesn't necessarily raise the page refcount */
- if (PageKsm(page) || page_count(page) != 1)
+ /*
+ * PageKsm() doesn't necessarily raise the page refcount.
+ *
+ * These checks are racy as long as we haven't locked the page;
+ * they are a pure optimization to avoid trying to lock the page
+ * and trying to free the swap cache when there is little hope
+ * it will actually result in a refcount of 1.
+ */
+ if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
goto copy;
if (!trylock_page(page))
goto copy;
- if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+ if (PageSwapCache(page))
+ try_to_free_swap(page);
+ if (PageKsm(page) || page_count(page) != 1) {
unlock_page(page);
goto copy;
}
/*
- * Ok, we've got the only map reference, and the only
- * page count reference, and the page is locked,
- * it's dark out, and we're wearing sunglasses. Hit it.
+ * Ok, we've got the only page reference from our mapping
+ * and the page is locked, it's dark out, and we're wearing
+ * sunglasses. Hit it.
*/
unlock_page(page);
wp_page_reuse(vmf);


I added some vmstats that monitor various paths. After one run of
./forceswap 2 1000000 1
I'm left with a rough delta (including some noise) of
anon_wp_copy_count 1799
anon_wp_copy_count_early 1
anon_wp_copy_lock 983396
anon_wp_reuse 0

The relevant part of your reproducer is

for (i = 0; i < nops; i++) {
if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
perror("madvise");
exit(-1);
}

for (j = 0; j < npages; j++) {
c = p[j * PAGE_SIZE];
c++;
time -= rdtscp();
p[j * PAGE_SIZE] = c;
time += rdtscp();
}
}

For this specific reproducer at least, the page lock seems to be the thingy that prohibits
reuse if I interpret the numbers correctly. We pass the initial page_count() check.

Haven't looked into the details, and I would be curious how that performs with actual
workloads, if we can reproduce similar behavior.


[1] https://lkml.kernel.org/r/[email protected]

--
Thanks,

David / dhildenb

2022-01-21 22:35:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On Thu, Jan 20, 2022 at 08:55:12PM +0100, David Hildenbrand wrote:
> >>> David, does any of it regards the lru_cache_add() reference issue that I
> >>> mentioned? [1]

> +++ b/mm/memory.c
> @@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> if (PageAnon(vmf->page)) {
> struct page *page = vmf->page;
>
> - /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + /*
> + * PageKsm() doesn't necessarily raise the page refcount.
> + *
> + * These checks are racy as long as we haven't locked the page;
> + * they are a pure optimization to avoid trying to lock the page
> + * and trying to free the swap cache when there is little hope
> + * it will actually result in a refcount of 1.
> + */
> + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
> goto copy;
> if (!trylock_page(page))
> goto copy;
> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + if (PageSwapCache(page))
> + try_to_free_swap(page);
> + if (PageKsm(page) || page_count(page) != 1) {
> unlock_page(page);
> goto copy;
> }
> /*
> - * Ok, we've got the only map reference, and the only
> - * page count reference, and the page is locked,
> - * it's dark out, and we're wearing sunglasses. Hit it.
> + * Ok, we've got the only page reference from our mapping
> + * and the page is locked, it's dark out, and we're wearing
> + * sunglasses. Hit it.
> */
> unlock_page(page);
> wp_page_reuse(vmf);
>
>
> I added some vmstats that monitor various paths. After one run of
> ./forceswap 2 1000000 1
> I'm left with a rough delta (including some noise) of
> anon_wp_copy_count 1799
> anon_wp_copy_count_early 1
> anon_wp_copy_lock 983396
> anon_wp_reuse 0
>
> The relevant part of your reproducer is
>
> for (i = 0; i < nops; i++) {
> if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
> perror("madvise");
> exit(-1);
> }
>
> for (j = 0; j < npages; j++) {
> c = p[j * PAGE_SIZE];
> c++;
> time -= rdtscp();
> p[j * PAGE_SIZE] = c;
> time += rdtscp();
> }
> }
>
> For this specific reproducer at least, the page lock seems to be the thingy that prohibits
> reuse if I interpret the numbers correctly. We pass the initial page_count() check.
>
> Haven't looked into the details, and I would be curious how that performs with actual
> workloads, if we can reproduce similar behavior.

I don't see how that patch addresses the lru issue. Wouldn't we need
something like ...

if (!PageLRU(page))
lru_add_drain_all();

2022-01-21 22:36:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 21:07, Matthew Wilcox wrote:
> On Thu, Jan 20, 2022 at 08:55:12PM +0100, David Hildenbrand wrote:
>>>>> David, does any of it regards the lru_cache_add() reference issue that I
>>>>> mentioned? [1]
>
>> +++ b/mm/memory.c
>> @@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>> if (PageAnon(vmf->page)) {
>> struct page *page = vmf->page;
>>
>> - /* PageKsm() doesn't necessarily raise the page refcount */
>> - if (PageKsm(page) || page_count(page) != 1)
>> + /*
>> + * PageKsm() doesn't necessarily raise the page refcount.
>> + *
>> + * These checks are racy as long as we haven't locked the page;
>> + * they are a pure optimization to avoid trying to lock the page
>> + * and trying to free the swap cache when there is little hope
>> + * it will actually result in a refcount of 1.
>> + */
>> + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
>> goto copy;
>> if (!trylock_page(page))
>> goto copy;
>> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>> + if (PageSwapCache(page))
>> + try_to_free_swap(page);
>> + if (PageKsm(page) || page_count(page) != 1) {
>> unlock_page(page);
>> goto copy;
>> }
>> /*
>> - * Ok, we've got the only map reference, and the only
>> - * page count reference, and the page is locked,
>> - * it's dark out, and we're wearing sunglasses. Hit it.
>> + * Ok, we've got the only page reference from our mapping
>> + * and the page is locked, it's dark out, and we're wearing
>> + * sunglasses. Hit it.
>> */
>> unlock_page(page);
>> wp_page_reuse(vmf);
>>
>>
>> I added some vmstats that monitor various paths. After one run of
>> ./forceswap 2 1000000 1
>> I'm left with a rough delta (including some noise) of
>> anon_wp_copy_count 1799
>> anon_wp_copy_count_early 1
>> anon_wp_copy_lock 983396
>> anon_wp_reuse 0
>>
>> The relevant part of your reproducer is
>>
>> for (i = 0; i < nops; i++) {
>> if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
>> perror("madvise");
>> exit(-1);
>> }
>>
>> for (j = 0; j < npages; j++) {
>> c = p[j * PAGE_SIZE];
>> c++;
>> time -= rdtscp();
>> p[j * PAGE_SIZE] = c;
>> time += rdtscp();
>> }
>> }
>>
>> For this specific reproducer at least, the page lock seems to be the thingy that prohibits
>> reuse if I interpret the numbers correctly. We pass the initial page_count() check.
>>
>> Haven't looked into the details, and I would be curious how that performs with actual
>> workloads, if we can reproduce similar behavior.
>
> I don't see how that patch addresses the lru issue. Wouldn't we need
> something like ...
>
> if (!PageLRU(page))
> lru_add_drain_all();
>

See my other reply "No, unfortunately not in that part of my work.".

Would the lru handling help here where we force swapout of a single
page, reuse code passes the "page_count(page) > 1 + PageSwapCache(page)"
check but fails to lock the page?

--
Thanks,

David / dhildenb

2022-01-21 22:36:11

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 20:55, David Hildenbrand wrote:
> On 20.01.22 19:11, Nadav Amit wrote:
>>
>>
>>> On Jan 20, 2022, at 10:00 AM, David Hildenbrand <[email protected]> wrote:
>>>
>>> On 20.01.22 18:48, Nadav Amit wrote:
>>>>
>>>>> On Jan 20, 2022, at 6:15 AM, David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 17.01.22 14:31, zhangliang (AG) wrote:
>>>>>> Sure, I will do that :)
>>>>>
>>>>> I'm polishing up / testing the patches and might send something out for discussion shortly.
>>>>> Just a note that on my branch was a version with a wrong condition that should have been fixed now.
>>>>>
>>>>
>>>> Sorry for being late for the discussion.
>>>>
>>>> David, does any of it regards the lru_cache_add() reference issue that I
>>>> mentioned? [1]
>>>
>>> No, unfortunately not in that part of my work. *Maybe* we could also try
>>> to handle that reference similarly to the swapcache, but the question is
>>> if we can't wait for PageAnonExclusive.
>>>
>>> Right now I have the following in mind to get most parts working as
>>> exptected:
>>>
>>> 1. Optimize reuse logic for the swapcache as it seems to be easy
>>> 2. Streamline COW logic and remove reuse_swap_page() -- fix the CVE for
>>> THP.
>>> 3. Introduce PageAnonExclusive and allow FOLL_PIN only on
>>> PageAnonExclusive pages.
>>> 4. Convert O_DIRECT to FOLL_PIN
>>>
>>> We will never ever have to copy a page PageAnonExclusive page in the COW
>>> handler and can immediately reuse it without even locking the page. The
>>> existing reuse logic is essentially then used to reset PageAnonExclusive
>>> on a page (thus it makes sense to work on it) where the flag is not set
>>> anymore -- or on a fresh page if we have to copy.
>>>
>>> That implies that all these additional references won't care if your app
>>> doesn't fork() or KSM isn't active. Consequently, anything that
>>> read-protects anonymous pages will work as expected and should be as
>>> fast as it gets.
>>>
>>> Sounds good? At least to me. If only swap/migration entries wouldn't be
>>> harder to handle than I'd wish, that's why it's taking a little and will
>>> take a little longer.
>>
>> Thanks for the quick response. I would have to see the logic to set/clear
>> PageAnonExclusive to fully understand how things are handled.
>>
>> BTW, I just saw this patch form PeterZ [1] that seems to be related, as
>> it deals with changing protection on pinned pages.
>
> Hi Nadav,
>
> I'm trying to see how effective the following patch is with your forceswap.c [1] reproducer.
>
> commit b08d494deb319a63b7c776636b960258c48775e1
> Author: David Hildenbrand <[email protected]>
> Date: Fri Jan 14 09:29:52 2022 +0100
>
> mm: optimize do_wp_page() for exclusive pages in the swapcache
>
> Let's optimize for a page with a single user that has been added to the
> swapcache. Try removing the swapcache reference if there is hope that
> we're the exclusive user, but keep the page_count(page) == 1 check in
> place.
>
> Avoid using reuse_swap_page(), we'll streamline all reuse_swap_page()
> users next.
>
> While at it, remove the superfluous page_mapcount() check: it's
> implicitly covered by the page_count() for ordinary anon pages.
>
> Signed-off-by: David Hildenbrand <[email protected]>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f306e698a1e3..d9186981662a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> if (PageAnon(vmf->page)) {
> struct page *page = vmf->page;
>
> - /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + /*
> + * PageKsm() doesn't necessarily raise the page refcount.
> + *
> + * These checks are racy as long as we haven't locked the page;
> + * they are a pure optimization to avoid trying to lock the page
> + * and trying to free the swap cache when there is little hope
> + * it will actually result in a refcount of 1.
> + */
> + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
> goto copy;
> if (!trylock_page(page))
> goto copy;
> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + if (PageSwapCache(page))
> + try_to_free_swap(page);
> + if (PageKsm(page) || page_count(page) != 1) {
> unlock_page(page);
> goto copy;
> }
> /*
> - * Ok, we've got the only map reference, and the only
> - * page count reference, and the page is locked,
> - * it's dark out, and we're wearing sunglasses. Hit it.
> + * Ok, we've got the only page reference from our mapping
> + * and the page is locked, it's dark out, and we're wearing
> + * sunglasses. Hit it.
> */
> unlock_page(page);
> wp_page_reuse(vmf);
>
>
> I added some vmstats that monitor various paths. After one run of
> ./forceswap 2 1000000 1
> I'm left with a rough delta (including some noise) of
> anon_wp_copy_count 1799
> anon_wp_copy_count_early 1
> anon_wp_copy_lock 983396
> anon_wp_reuse 0
>
> The relevant part of your reproducer is
>
> for (i = 0; i < nops; i++) {
> if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
> perror("madvise");
> exit(-1);
> }
>
> for (j = 0; j < npages; j++) {
> c = p[j * PAGE_SIZE];
> c++;
> time -= rdtscp();
> p[j * PAGE_SIZE] = c;
> time += rdtscp();
> }
> }
>
> For this specific reproducer at least, the page lock seems to be the thingy that prohibits
> reuse if I interpret the numbers correctly. We pass the initial page_count() check.
>
> Haven't looked into the details, and I would be curious how that performs with actual
> workloads, if we can reproduce similar behavior.

I should stop working for today, I messed up the counter names *cries in
German* :(

anon_wp_reuse 1799
anon_wp_copy_count 1
anon_wp_copy_count_early 983396
anon_wp_copy_lock 0

which makes *a lot* more sense and might indicate the PageLRU() issue.

--
Thanks,

David / dhildenb

2022-01-21 22:38:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 21:09, David Hildenbrand wrote:
> On 20.01.22 21:07, Matthew Wilcox wrote:
>> On Thu, Jan 20, 2022 at 08:55:12PM +0100, David Hildenbrand wrote:
>>>>>> David, does any of it regards the lru_cache_add() reference issue that I
>>>>>> mentioned? [1]
>>
>>> +++ b/mm/memory.c
>>> @@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>> if (PageAnon(vmf->page)) {
>>> struct page *page = vmf->page;
>>>
>>> - /* PageKsm() doesn't necessarily raise the page refcount */
>>> - if (PageKsm(page) || page_count(page) != 1)
>>> + /*
>>> + * PageKsm() doesn't necessarily raise the page refcount.
>>> + *
>>> + * These checks are racy as long as we haven't locked the page;
>>> + * they are a pure optimization to avoid trying to lock the page
>>> + * and trying to free the swap cache when there is little hope
>>> + * it will actually result in a refcount of 1.
>>> + */
>>> + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
>>> goto copy;
>>> if (!trylock_page(page))
>>> goto copy;
>>> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>>> + if (PageSwapCache(page))
>>> + try_to_free_swap(page);
>>> + if (PageKsm(page) || page_count(page) != 1) {
>>> unlock_page(page);
>>> goto copy;
>>> }
>>> /*
>>> - * Ok, we've got the only map reference, and the only
>>> - * page count reference, and the page is locked,
>>> - * it's dark out, and we're wearing sunglasses. Hit it.
>>> + * Ok, we've got the only page reference from our mapping
>>> + * and the page is locked, it's dark out, and we're wearing
>>> + * sunglasses. Hit it.
>>> */
>>> unlock_page(page);
>>> wp_page_reuse(vmf);
>>>
>>>
>>> I added some vmstats that monitor various paths. After one run of
>>> ./forceswap 2 1000000 1
>>> I'm left with a rough delta (including some noise) of
>>> anon_wp_copy_count 1799
>>> anon_wp_copy_count_early 1
>>> anon_wp_copy_lock 983396
>>> anon_wp_reuse 0
>>>
>>> The relevant part of your reproducer is
>>>
>>> for (i = 0; i < nops; i++) {
>>> if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
>>> perror("madvise");
>>> exit(-1);
>>> }
>>>
>>> for (j = 0; j < npages; j++) {
>>> c = p[j * PAGE_SIZE];
>>> c++;
>>> time -= rdtscp();
>>> p[j * PAGE_SIZE] = c;
>>> time += rdtscp();
>>> }
>>> }
>>>
>>> For this specific reproducer at least, the page lock seems to be the thingy that prohibits
>>> reuse if I interpret the numbers correctly. We pass the initial page_count() check.
>>>
>>> Haven't looked into the details, and I would be curious how that performs with actual
>>> workloads, if we can reproduce similar behavior.
>>
>> I don't see how that patch addresses the lru issue. Wouldn't we need
>> something like ...
>>
>> if (!PageLRU(page))
>> lru_add_drain_all();
>>

lru_add_drain_all() takes a mutex ... best we can do I guess is drain
the local CPU using lru_add_drain(). I'll go play with it and see what
breaks :)

--
Thanks,

David / dhildenb

2022-01-21 22:39:45

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page



> On Jan 20, 2022, at 12:37 PM, David Hildenbrand <[email protected]> wrote:
>
> On 20.01.22 21:09, David Hildenbrand wrote:
>> On 20.01.22 21:07, Matthew Wilcox wrote:
>>> On Thu, Jan 20, 2022 at 08:55:12PM +0100, David Hildenbrand wrote:
>>>>>>> David, does any of it regards the lru_cache_add() reference issue that I
>>>>>>> mentioned? [1]
>>>
>>>> +++ b/mm/memory.c
>>>> @@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>>> if (PageAnon(vmf->page)) {
>>>> struct page *page = vmf->page;
>>>>
>>>> - /* PageKsm() doesn't necessarily raise the page refcount */
>>>> - if (PageKsm(page) || page_count(page) != 1)
>>>> + /*
>>>> + * PageKsm() doesn't necessarily raise the page refcount.
>>>> + *
>>>> + * These checks are racy as long as we haven't locked the page;
>>>> + * they are a pure optimization to avoid trying to lock the page
>>>> + * and trying to free the swap cache when there is little hope
>>>> + * it will actually result in a refcount of 1.
>>>> + */
>>>> + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
>>>> goto copy;
>>>> if (!trylock_page(page))
>>>> goto copy;
>>>> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>>>> + if (PageSwapCache(page))
>>>> + try_to_free_swap(page);
>>>> + if (PageKsm(page) || page_count(page) != 1) {
>>>> unlock_page(page);
>>>> goto copy;
>>>> }
>>>> /*
>>>> - * Ok, we've got the only map reference, and the only
>>>> - * page count reference, and the page is locked,
>>>> - * it's dark out, and we're wearing sunglasses. Hit it.
>>>> + * Ok, we've got the only page reference from our mapping
>>>> + * and the page is locked, it's dark out, and we're wearing
>>>> + * sunglasses. Hit it.
>>>> */
>>>> unlock_page(page);
>>>> wp_page_reuse(vmf);
>>>>
>>>>
>>>> I added some vmstats that monitor various paths. After one run of
>>>> ./forceswap 2 1000000 1
>>>> I'm left with a rough delta (including some noise) of
>>>> anon_wp_copy_count 1799
>>>> anon_wp_copy_count_early 1
>>>> anon_wp_copy_lock 983396
>>>> anon_wp_reuse 0
>>>>
>>>> The relevant part of your reproducer is
>>>>
>>>> for (i = 0; i < nops; i++) {
>>>> if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
>>>> perror("madvise");
>>>> exit(-1);
>>>> }
>>>>
>>>> for (j = 0; j < npages; j++) {
>>>> c = p[j * PAGE_SIZE];
>>>> c++;
>>>> time -= rdtscp();
>>>> p[j * PAGE_SIZE] = c;
>>>> time += rdtscp();
>>>> }
>>>> }
>>>>
>>>> For this specific reproducer at least, the page lock seems to be the thingy that prohibits
>>>> reuse if I interpret the numbers correctly. We pass the initial page_count() check.
>>>>
>>>> Haven't looked into the details, and I would be curious how that performs with actual
>>>> workloads, if we can reproduce similar behavior.
>>>
>>> I don't see how that patch addresses the lru issue. Wouldn't we need
>>> something like ...
>>>
>>> if (!PageLRU(page))
>>> lru_add_drain_all();
>>>
>
> lru_add_drain_all() takes a mutex ... best we can do I guess is drain
> the local CPU using lru_add_drain(). I'll go play with it and see what
> breaks :)
>

I did hack something similar and it solved the problem, but I felt it is
a hack. If the thread is scheduled on another core, or if the write fault
is triggered by another thread it wouldn’t work.

If you look for a real-world workload that behaves similarly, you can try
memcached with memory pressure and low latency device (I used
pmem-emulated). This is the workload in which I encountered the issue
first.


2022-01-21 22:40:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 21:46, Nadav Amit wrote:
>
>
>> On Jan 20, 2022, at 12:37 PM, David Hildenbrand <[email protected]> wrote:
>>
>> On 20.01.22 21:09, David Hildenbrand wrote:
>>> On 20.01.22 21:07, Matthew Wilcox wrote:
>>>> On Thu, Jan 20, 2022 at 08:55:12PM +0100, David Hildenbrand wrote:
>>>>>>>> David, does any of it regards the lru_cache_add() reference issue that I
>>>>>>>> mentioned? [1]
>>>>
>>>>> +++ b/mm/memory.c
>>>>> @@ -3291,19 +3291,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>>>> if (PageAnon(vmf->page)) {
>>>>> struct page *page = vmf->page;
>>>>>
>>>>> - /* PageKsm() doesn't necessarily raise the page refcount */
>>>>> - if (PageKsm(page) || page_count(page) != 1)
>>>>> + /*
>>>>> + * PageKsm() doesn't necessarily raise the page refcount.
>>>>> + *
>>>>> + * These checks are racy as long as we haven't locked the page;
>>>>> + * they are a pure optimization to avoid trying to lock the page
>>>>> + * and trying to free the swap cache when there is little hope
>>>>> + * it will actually result in a refcount of 1.
>>>>> + */
>>>>> + if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
>>>>> goto copy;
>>>>> if (!trylock_page(page))
>>>>> goto copy;
>>>>> - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>>>>> + if (PageSwapCache(page))
>>>>> + try_to_free_swap(page);
>>>>> + if (PageKsm(page) || page_count(page) != 1) {
>>>>> unlock_page(page);
>>>>> goto copy;
>>>>> }
>>>>> /*
>>>>> - * Ok, we've got the only map reference, and the only
>>>>> - * page count reference, and the page is locked,
>>>>> - * it's dark out, and we're wearing sunglasses. Hit it.
>>>>> + * Ok, we've got the only page reference from our mapping
>>>>> + * and the page is locked, it's dark out, and we're wearing
>>>>> + * sunglasses. Hit it.
>>>>> */
>>>>> unlock_page(page);
>>>>> wp_page_reuse(vmf);
>>>>>
>>>>>
>>>>> I added some vmstats that monitor various paths. After one run of
>>>>> ./forceswap 2 1000000 1
>>>>> I'm left with a rough delta (including some noise) of
>>>>> anon_wp_copy_count 1799
>>>>> anon_wp_copy_count_early 1
>>>>> anon_wp_copy_lock 983396
>>>>> anon_wp_reuse 0
>>>>>
>>>>> The relevant part of your reproducer is
>>>>>
>>>>> for (i = 0; i < nops; i++) {
>>>>> if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
>>>>> perror("madvise");
>>>>> exit(-1);
>>>>> }
>>>>>
>>>>> for (j = 0; j < npages; j++) {
>>>>> c = p[j * PAGE_SIZE];
>>>>> c++;
>>>>> time -= rdtscp();
>>>>> p[j * PAGE_SIZE] = c;
>>>>> time += rdtscp();
>>>>> }
>>>>> }
>>>>>
>>>>> For this specific reproducer at least, the page lock seems to be the thingy that prohibits
>>>>> reuse if I interpret the numbers correctly. We pass the initial page_count() check.
>>>>>
>>>>> Haven't looked into the details, and I would be curious how that performs with actual
>>>>> workloads, if we can reproduce similar behavior.
>>>>
>>>> I don't see how that patch addresses the lru issue. Wouldn't we need
>>>> something like ...
>>>>
>>>> if (!PageLRU(page))
>>>> lru_add_drain_all();
>>>>
>>
>> lru_add_drain_all() takes a mutex ... best we can do I guess is drain
>> the local CPU using lru_add_drain(). I'll go play with it and see what
>> breaks :)
>>
>
> I did hack something similar and it solved the problem, but I felt it is
> a hack. If the thread is scheduled on another core, or if the write fault
> is triggered by another thread it wouldn’t work.

Yes, it will not match easily. One question would be how often it would
help in practice and if it would be worth the price.

>
> If you look for a real-world workload that behaves similarly, you can try
> memcached with memory pressure and low latency device (I used
> pmem-emulated). This is the workload in which I encountered the issue
> first.

Yes, I agree. So PageAnonExclusive is our best bet ... hopefully.

--
Thanks,

David / dhildenb

2022-01-22 00:42:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

>>
>> I did hack something similar and it solved the problem, but I felt it is
>> a hack. If the thread is scheduled on another core, or if the write fault
>> is triggered by another thread it wouldn’t work.
>
> Yes, it will not match easily. One question would be how often it would
> help in practice and if it would be worth the price.
>


I did some more testing and I have to admit that your reproducer is
really good at finding corner cases.

Assume we try to handle LRU as discussed ... what I get is a delta
during the test: ./forceswap 2 100000 1


anon_wp_reuse 920
-> we were able to reuse
anon_wp_copy_count 0
-> we failed the final page_count() == 1 check
anon_wp_copy_count_early 634
-> we failed the early page_count() check considering swapcache and lru
anon_wp_copy_lock 1
-> we failed trylock
anon_wp_copy_lru 19
-> we failed to clear the lru cache reference
anon_wp_copy_writeback 99974
-> we failed to clear the swapcache reference due to concurrent
writeback
anon_wp_copy_swapcache 0
-> we failed to clear the swapcache reference for other reasons

So, yeah, we mostly always hit writeback in forceswap.c.
reuse_swap_page() would have been able to reuse the page if the swap
backend would have supported concurrent writes during writeback (IIUC,
zswap doesn't).

But I think triggering that case that often really is an oddity about
the test case.

--
Thanks,

David / dhildenb

2022-01-22 02:04:50

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page



> On Jan 21, 2022, at 1:01 AM, David Hildenbrand <[email protected]> wrote:
>
>>>
>>> I did hack something similar and it solved the problem, but I felt it is
>>> a hack. If the thread is scheduled on another core, or if the write fault
>>> is triggered by another thread it wouldn’t work.
>>
>> Yes, it will not match easily. One question would be how often it would
>> help in practice and if it would be worth the price.
>>
>
>
> I did some more testing and I have to admit that your reproducer is
> really good at finding corner cases.
>
> Assume we try to handle LRU as discussed ... what I get is a delta
> during the test: ./forceswap 2 100000 1
>
>
> anon_wp_reuse 920
> -> we were able to reuse
> anon_wp_copy_count 0
> -> we failed the final page_count() == 1 check
> anon_wp_copy_count_early 634
> -> we failed the early page_count() check considering swapcache and lru
> anon_wp_copy_lock 1
> -> we failed trylock
> anon_wp_copy_lru 19
> -> we failed to clear the lru cache reference
> anon_wp_copy_writeback 99974
> -> we failed to clear the swapcache reference due to concurrent
> writeback
> anon_wp_copy_swapcache 0
> -> we failed to clear the swapcache reference for other reasons
>
> So, yeah, we mostly always hit writeback in forceswap.c.
> reuse_swap_page() would have been able to reuse the page if the swap
> backend would have supported concurrent writes during writeback (IIUC,
> zswap doesn't).
>
> But I think triggering that case that often really is an oddity about
> the test case.

I am glad you find it useful (not my greatest piece of work).

IIRC, I encountered the scenario you describe. It happens when you use
a device driver that uses async operations (most of them). If you use
pmem, it does not happen.

This behavior is not intentional, anyhow.