With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after
COW, the idle swap cache (neither the page nor the corresponding swap
entry is mapped by any process) will be left at the original position
in the LRU list. While it may be in the active list or the head of
the inactive list, so that vmscan may take more overhead or time to
reclaim these actually unused pages.
To help the page reclaiming, in this patch, after COW, the idle swap
cache will be tried to be moved to the tail of the inactive LRU list.
To avoid to introduce much overhead to the hot COW code path, all
locks are acquired with try locking.
To test the patch, we used pmbench memory accessing benchmark with
working-set larger than available memory on a 2-socket Intel server
with a NVMe SSD as swap device. Test results shows that the pmbench
score increases up to 21.8% with the decreased size of swap cache and
swapin throughput.
Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Tim Chen <[email protected]>
---
include/linux/memcontrol.h | 10 ++++++++++
include/linux/swap.h | 3 +++
mm/memcontrol.c | 12 ++++++++++++
mm/memory.c | 5 +++++
mm/swapfile.c | 29 +++++++++++++++++++++++++++++
5 files changed, 59 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0ce97eff79e2..68956db13772 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -761,6 +761,7 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
struct lruvec *lock_page_lruvec(struct page *page);
struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *trylock_page_lruvec_irq(struct page *page);
struct lruvec *lock_page_lruvec_irqsave(struct page *page,
unsigned long *flags);
@@ -1251,6 +1252,15 @@ static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
return &pgdat->__lruvec;
}
+static inline struct lruvec *trylock_page_lruvec_irq(struct page *page)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ if (spin_trylock_irq(&pgdat->__lruvec.lru_lock))
+ return &pgdat->__lruvec;
+ return NULL;
+}
+
static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
unsigned long *flagsp)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46d51d058d05..d344b0fa7925 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -504,6 +504,7 @@ extern struct swap_info_struct *page_swap_info(struct page *);
extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
extern bool reuse_swap_page(struct page *, int *);
extern int try_to_free_swap(struct page *);
+extern void try_to_free_idle_swapcache(struct page *page);
struct backing_dev_info;
extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
extern void exit_swap_address_space(unsigned int type);
@@ -668,6 +669,8 @@ static inline int try_to_free_swap(struct page *page)
return 0;
}
+static inline void try_to_free_idle_swapcache(struct page *page) {}
+
static inline swp_entry_t get_swap_page(struct page *page)
{
swp_entry_t entry;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index db29b96f7311..e3e813bfebe2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1213,6 +1213,18 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
return lruvec;
}
+struct lruvec *trylock_page_lruvec_irq(struct page *page)
+{
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_page_lruvec(page);
+ if (spin_trylock_irq(&lruvec->lru_lock)) {
+ lruvec_memcg_debug(lruvec, page);
+ return lruvec;
+ }
+ return NULL;
+}
+
struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
{
struct lruvec *lruvec;
diff --git a/mm/memory.c b/mm/memory.c
index b83f734c4e1d..2b6847f4c03e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
munlock_vma_page(old_page);
unlock_page(old_page);
}
+ if (page_copied && PageSwapCache(old_page) &&
+ !page_mapped(old_page) && trylock_page(old_page)) {
+ try_to_free_idle_swapcache(old_page);
+ unlock_page(old_page);
+ }
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2aad85751991..e0dd8937de4e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -40,6 +40,7 @@
#include <linux/swap_slots.h>
#include <linux/sort.h>
#include <linux/completion.h>
+#include <linux/mm_inline.h>
#include <asm/tlbflush.h>
#include <linux/swapops.h>
@@ -1788,6 +1789,34 @@ int try_to_free_swap(struct page *page)
return 1;
}
+void try_to_free_idle_swapcache(struct page *page)
+{
+ struct lruvec *lruvec;
+ swp_entry_t entry;
+
+ if (!PageSwapCache(page))
+ return;
+ if (PageWriteback(page))
+ return;
+ if (!PageLRU(page))
+ return;
+ if (page_mapped(page))
+ return;
+ entry.val = page_private(page);
+ if (__swap_count(entry))
+ return;
+
+ lruvec = trylock_page_lruvec_irq(page);
+ if (!lruvec)
+ return;
+
+ del_page_from_lru_list(page, lruvec);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list_tail(page, lruvec);
+
+ unlock_page_lruvec_irq(lruvec);
+}
/*
* Free the swap entry like above, but also try to
* free the page cache entry if it is the last user.
--
2.30.2
On Wed, 2021-05-19 at 09:33 +0800, Huang Ying wrote:
> To test the patch, we used pmbench memory accessing benchmark with
> working-set larger than available memory on a 2-socket Intel server
> with a NVMe SSD as swap device. Test results shows that the pmbench
> score increases up to 21.8% with the decreased size of swap cache and
> swapin throughput.
Nice!
> +++ b/mm/memory.c
> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault
> *vmf)
> munlock_vma_page(old_page);
> unlock_page(old_page);
> }
> + if (page_copied && PageSwapCache(old_page) &&
> + !page_mapped(old_page) && trylock_page(old_page)) {
> + try_to_free_idle_swapcache(old_page);
> + unlock_page(old_page);
> + }
That's quite the if condition!
Would it make sense to move some of the tests, as well
as the trylock and unlock into try_to_free_idle_swapcache()
itself?
Especially considering that page_mapped is already tested
in that function, too...
> put_page(old_page);
> }
--
All Rights Reversed.
On Tue, May 18, 2021 at 5:17 PM Linus Torvalds
<[email protected]> wrote:
>
> This looks sensible to me (and numbers talk!), but as Rik says, it
> would probably be a good idea to move the trylock_page()/unlock_page()
> into try_to_free_idle_swapcache(), and that would make the calling
> side a whole lot cleaner and easier to read.
To keep the error handling simple, and keep that "if that didn't work,
just return" logic in you had, doing it as two functions like:
static inline void locked_try_to_free_idle_swapcache(struct page *page)
{ .. your current try_to_free_idle_swapcache() .. }
void try_to_free_idle_swapcache(struct page *page)
{
if (trylock_page(page)) {
locked_try_to_free_idle_swapcache(page);
unlock_page(page);
}
}
would keep that readability and simplicity.
And then the wp_page_copy() code ends up being
if (page_copied && PageSwapCache(old_page) && !page_mapped(old_page))
try_to_free_idle_swapcache(old_page);
which looks pretty sensible to me: if we copied the page, and the old
page is a no longer mapped swap cache page, let's try to free it.
That's still a hell of a long conditional, partly because of those
long names. But at least it's conceptually fairly straightforward and
easy to understand what's going on.
No?
Linus
On Tue, May 18, 2021 at 3:33 PM Huang Ying <[email protected]> wrote:
>
> With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after
> COW, the idle swap cache (neither the page nor the corresponding swap
> entry is mapped by any process) will be left at the original position
> in the LRU list. While it may be in the active list or the head of
> the inactive list, so that vmscan may take more overhead or time to
> reclaim these actually unused pages.
This looks sensible to me (and numbers talk!), but as Rik says, it
would probably be a good idea to move the trylock_page()/unlock_page()
into try_to_free_idle_swapcache(), and that would make the calling
side a whole lot cleaner and easier to read.
Linus
Rik van Riel <[email protected]> writes:
> On Wed, 2021-05-19 at 09:33 +0800, Huang Ying wrote:
>
>> To test the patch, we used pmbench memory accessing benchmark with
>> working-set larger than available memory on a 2-socket Intel server
>> with a NVMe SSD as swap device. Test results shows that the pmbench
>> score increases up to 21.8% with the decreased size of swap cache and
>> swapin throughput.
>
> Nice!
>
>> +++ b/mm/memory.c
>> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault
>> *vmf)
>> munlock_vma_page(old_page);
>> unlock_page(old_page);
>> }
>> + if (page_copied && PageSwapCache(old_page) &&
>> + !page_mapped(old_page) && trylock_page(old_page)) {
>> + try_to_free_idle_swapcache(old_page);
>> + unlock_page(old_page);
>> + }
>
> That's quite the if condition!
>
> Would it make sense to move some of the tests, as well
> as the trylock and unlock into try_to_free_idle_swapcache()
> itself?
Sure. Will put trylock/unlock into try_to_free_idle_swapcache() as
suggested by Linus.
> Especially considering that page_mapped is already tested
> in that function, too...
The two page_mapped() tests are intended. The first one is a quick
check with the page unlocked, the second one is to confirm with the page
locked. Because if the page is unlocked, the swap count may be
transited to map count or vice versa.
Best Regards,
Huang, Ying
On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index b83f734c4e1d..2b6847f4c03e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> munlock_vma_page(old_page);
> unlock_page(old_page);
> }
> + if (page_copied && PageSwapCache(old_page) &&
> + !page_mapped(old_page) && trylock_page(old_page)) {
> + try_to_free_idle_swapcache(old_page);
> + unlock_page(old_page);
If there are no more swap or pte references, can we just attempt to
free the page right away, like we do during regular unmap?
if (page_copied)
free_swap_cache(old_page);
put_page(old_page);
Linus Torvalds <[email protected]> writes:
> On Tue, May 18, 2021 at 5:17 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> This looks sensible to me (and numbers talk!), but as Rik says, it
>> would probably be a good idea to move the trylock_page()/unlock_page()
>> into try_to_free_idle_swapcache(), and that would make the calling
>> side a whole lot cleaner and easier to read.
>
> To keep the error handling simple, and keep that "if that didn't work,
> just return" logic in you had, doing it as two functions like:
>
> static inline void locked_try_to_free_idle_swapcache(struct page *page)
> { .. your current try_to_free_idle_swapcache() .. }
>
> void try_to_free_idle_swapcache(struct page *page)
> {
> if (trylock_page(page)) {
> locked_try_to_free_idle_swapcache(page);
> unlock_page(page);
> }
> }
>
> would keep that readability and simplicity.
>
> And then the wp_page_copy() code ends up being
>
> if (page_copied && PageSwapCache(old_page) && !page_mapped(old_page))
> try_to_free_idle_swapcache(old_page);
>
> which looks pretty sensible to me: if we copied the page, and the old
> page is a no longer mapped swap cache page, let's try to free it.
>
> That's still a hell of a long conditional, partly because of those
> long names. But at least it's conceptually fairly straightforward and
> easy to understand what's going on.
Thanks! That looks much better. I will do that in the next version.
Best Regards,
Huang, Ying
Johannes Weiner <[email protected]> writes:
> On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote:
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b83f734c4e1d..2b6847f4c03e 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>> munlock_vma_page(old_page);
>> unlock_page(old_page);
>> }
>> + if (page_copied && PageSwapCache(old_page) &&
>> + !page_mapped(old_page) && trylock_page(old_page)) {
>> + try_to_free_idle_swapcache(old_page);
>> + unlock_page(old_page);
>
> If there are no more swap or pte references, can we just attempt to
> free the page right away, like we do during regular unmap?
>
> if (page_copied)
> free_swap_cache(old_page);
> put_page(old_page);
A previous version of the patch does roughly this.
https://lore.kernel.org/lkml/[email protected]/
But Linus has concerns with the overhead introduced in the hot COW path.
Another possibility is to move the idle swap cache page to the tail of
the file LRU list. But the question is how to identify the page.
Best Regards,
Huang, Ying
On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote:
> Johannes Weiner <[email protected]> writes:
>
> > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote:
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b83f734c4e1d..2b6847f4c03e 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> >> munlock_vma_page(old_page);
> >> unlock_page(old_page);
> >> }
> >> + if (page_copied && PageSwapCache(old_page) &&
> >> + !page_mapped(old_page) && trylock_page(old_page)) {
> >> + try_to_free_idle_swapcache(old_page);
> >> + unlock_page(old_page);
> >
> > If there are no more swap or pte references, can we just attempt to
> > free the page right away, like we do during regular unmap?
> >
> > if (page_copied)
> > free_swap_cache(old_page);
> > put_page(old_page);
>
> A previous version of the patch does roughly this.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> But Linus has concerns with the overhead introduced in the hot COW path.
Sorry, I had missed that thread.
It sounds like there were the same concerns about the LRU shuffling
overhead in the COW page. Now we have numbers for that, but not the
free_swap_cache version. Would you be able to run the numbers for that
as well? It would be interesting to see how much the additional code
complexity buys us.
> Another possibility is to move the idle swap cache page to the tail of
> the file LRU list. But the question is how to identify the page.
The LRU type is identified by PG_swapbacked, and we do clear that for
anon pages to implement MADV_FREE. It may work here too. But I'm
honestly a bit skeptical about the ROI on this...
Johannes Weiner <[email protected]> writes:
> On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote:
>> Johannes Weiner <[email protected]> writes:
>>
>> > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote:
>> >> diff --git a/mm/memory.c b/mm/memory.c
>> >> index b83f734c4e1d..2b6847f4c03e 100644
>> >> --- a/mm/memory.c
>> >> +++ b/mm/memory.c
>> >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>> >> munlock_vma_page(old_page);
>> >> unlock_page(old_page);
>> >> }
>> >> + if (page_copied && PageSwapCache(old_page) &&
>> >> + !page_mapped(old_page) && trylock_page(old_page)) {
>> >> + try_to_free_idle_swapcache(old_page);
>> >> + unlock_page(old_page);
>> >
>> > If there are no more swap or pte references, can we just attempt to
>> > free the page right away, like we do during regular unmap?
>> >
>> > if (page_copied)
>> > free_swap_cache(old_page);
>> > put_page(old_page);
>>
>> A previous version of the patch does roughly this.
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> But Linus has concerns with the overhead introduced in the hot COW path.
>
> Sorry, I had missed that thread.
>
> It sounds like there were the same concerns about the LRU shuffling
> overhead in the COW page. Now we have numbers for that, but not the
> free_swap_cache version. Would you be able to run the numbers for that
> as well? It would be interesting to see how much the additional code
> complexity buys us.
The number for which workload? The workload that is used to evaluate
this patch?
>> Another possibility is to move the idle swap cache page to the tail of
>> the file LRU list. But the question is how to identify the page.
>
> The LRU type is identified by PG_swapbacked, and we do clear that for
> anon pages to implement MADV_FREE. It may work here too. But I'm
> honestly a bit skeptical about the ROI on this...
The definition of PageSwapCache() is
static __always_inline int PageSwapCache(struct page *page)
{
#ifdef CONFIG_THP_SWAP
page = compound_head(page);
#endif
return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);
}
So we cannot clear PG_swapbacked directly.
Best Regards,
Huang, Ying
On Thu, May 20, 2021 at 09:59:15AM +0800, Huang, Ying wrote:
> Johannes Weiner <[email protected]> writes:
>
> > On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote:
> >> Johannes Weiner <[email protected]> writes:
> >>
> >> > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote:
> >> >> diff --git a/mm/memory.c b/mm/memory.c
> >> >> index b83f734c4e1d..2b6847f4c03e 100644
> >> >> --- a/mm/memory.c
> >> >> +++ b/mm/memory.c
> >> >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> >> >> munlock_vma_page(old_page);
> >> >> unlock_page(old_page);
> >> >> }
> >> >> + if (page_copied && PageSwapCache(old_page) &&
> >> >> + !page_mapped(old_page) && trylock_page(old_page)) {
> >> >> + try_to_free_idle_swapcache(old_page);
> >> >> + unlock_page(old_page);
> >> >
> >> > If there are no more swap or pte references, can we just attempt to
> >> > free the page right away, like we do during regular unmap?
> >> >
> >> > if (page_copied)
> >> > free_swap_cache(old_page);
> >> > put_page(old_page);
> >>
> >> A previous version of the patch does roughly this.
> >>
> >> https://lore.kernel.org/lkml/[email protected]/
> >>
> >> But Linus has concerns with the overhead introduced in the hot COW path.
> >
> > Sorry, I had missed that thread.
> >
> > It sounds like there were the same concerns about the LRU shuffling
> > overhead in the COW page. Now we have numbers for that, but not the
> > free_swap_cache version. Would you be able to run the numbers for that
> > as well? It would be interesting to see how much the additional code
> > complexity buys us.
>
> The number for which workload? The workload that is used to evaluate
> this patch?
Yeah, the pmbench one from the changelog.
Johannes Weiner <[email protected]> writes:
> On Thu, May 20, 2021 at 09:59:15AM +0800, Huang, Ying wrote:
>> Johannes Weiner <[email protected]> writes:
>>
>> > On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote:
>> >> Johannes Weiner <[email protected]> writes:
>> >>
>> >> > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote:
>> >> >> diff --git a/mm/memory.c b/mm/memory.c
>> >> >> index b83f734c4e1d..2b6847f4c03e 100644
>> >> >> --- a/mm/memory.c
>> >> >> +++ b/mm/memory.c
>> >> >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>> >> >> munlock_vma_page(old_page);
>> >> >> unlock_page(old_page);
>> >> >> }
>> >> >> + if (page_copied && PageSwapCache(old_page) &&
>> >> >> + !page_mapped(old_page) && trylock_page(old_page)) {
>> >> >> + try_to_free_idle_swapcache(old_page);
>> >> >> + unlock_page(old_page);
>> >> >
>> >> > If there are no more swap or pte references, can we just attempt to
>> >> > free the page right away, like we do during regular unmap?
>> >> >
>> >> > if (page_copied)
>> >> > free_swap_cache(old_page);
>> >> > put_page(old_page);
>> >>
>> >> A previous version of the patch does roughly this.
>> >>
>> >> https://lore.kernel.org/lkml/[email protected]/
>> >>
>> >> But Linus has concerns with the overhead introduced in the hot COW path.
>> >
>> > Sorry, I had missed that thread.
>> >
>> > It sounds like there were the same concerns about the LRU shuffling
>> > overhead in the COW page. Now we have numbers for that, but not the
>> > free_swap_cache version. Would you be able to run the numbers for that
>> > as well? It would be interesting to see how much the additional code
>> > complexity buys us.
>>
>> The number for which workload? The workload that is used to evaluate
>> this patch?
>
> Yeah, the pmbench one from the changelog.
Sure. I have rebased the original patch that frees the idle swap cache
directly and done the test. The results show that the pmbench score of
freeing directly is a little better than that of moving to the tail of
LRU. The pmbench score increases about 3.6%. I think this is expected,
because we need to free the page finally even if we move the idle swap
cache to the tail of LRU.
Best Regards,
Huang, Ying