2020-10-29 10:48:25

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 00/20] per memcg lru lock

This version just a rebase on v5.10-rc1. and moves the lru_lock position
down after lists[] in lruvec, which resolves a fio.read regression that
revealed by Rong Chen -- Intel LKP.

Many thanks for line by line review by Hugh Dickins and Alexander Duyck.

So now this patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation.
2, use TestCleanPageLRU as page isolation's precondition.
3, replace per node lru_lock with per memcg per node lru_lock.

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
with this v20.
https://lore.kernel.org/lkml/[email protected]/

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan,
Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Alex Shi (17):
mm/memcg: warning on !memcg after readahead page charged
mm/memcg: bail early from swap accounting if memcg disabled
mm/thp: move lru_add_page_tail func to huge_memory.c
mm/thp: use head for head page in lru_add_page_tail
mm/thp: Simplify lru_add_page_tail()
mm/thp: narrow lru locking
mm/vmscan: remove unnecessary lruvec adding
mm/memcg: add debug checking in lock_page_memcg
mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
mm/lru: move lock into lru_note_cost
mm/vmscan: remove lruvec reget in move_pages_to_lru
mm/mlock: remove lru_lock on TestClearPageMlocked
mm/mlock: remove __munlock_isolate_lru_page
mm/lru: introduce TestClearPageLRU
mm/compaction: do page isolation first in compaction
mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
mm/lru: replace pgdat lru_lock with lruvec lock

Alexander Duyck (1):
mm/lru: introduce the relock_page_lruvec function

Hugh Dickins (2):
mm: page_idle_get_page() does not need lru_lock
mm/lru: revise the comments of lru_lock

Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +-
Documentation/admin-guide/cgroup-v1/memory.rst | 21 +--
Documentation/trace/events-kmem.rst | 2 +-
Documentation/vm/unevictable-lru.rst | 22 +--
include/linux/memcontrol.h | 110 +++++++++++
include/linux/mm_types.h | 2 +-
include/linux/mmdebug.h | 13 ++
include/linux/mmzone.h | 6 +-
include/linux/page-flags.h | 1 +
include/linux/swap.h | 4 +-
mm/compaction.c | 94 +++++++---
mm/filemap.c | 4 +-
mm/huge_memory.c | 45 +++--
mm/memcontrol.c | 85 ++++++++-
mm/mlock.c | 63 ++-----
mm/mmzone.c | 1 +
mm/page_alloc.c | 1 -
mm/page_idle.c | 4 -
mm/rmap.c | 4 +-
mm/swap.c | 199 ++++++++------------
mm/vmscan.c | 203 +++++++++++----------
mm/workingset.c | 2 -
22 files changed, 523 insertions(+), 378 deletions(-)

--
1.8.3.1


2020-10-29 10:48:25

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail()

Simplify lru_add_page_tail(), there are actually only two cases possible:
split_huge_page_to_list(), with list supplied and head isolated from lru
by its caller; or split_huge_page(), with NULL list and head on lru -
because when head is racily isolated from lru, the isolator's reference
will stop the split from getting any further than its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mika Penttilä <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/huge_memory.c | 21 ++++++---------------
1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 93c0b73eb8c6..4b72dd7b8b34 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,25 +2354,16 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
VM_BUG_ON_PAGE(PageLRU(page_tail), head);
lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);

- if (!list)
- SetPageLRU(page_tail);
-
- if (likely(PageLRU(head)))
- list_add_tail(&page_tail->lru, &head->lru);
- else if (list) {
+ if (list) {
/* page reclaim is reclaiming a huge page */
+ VM_WARN_ON(PageLRU(head));
get_page(page_tail);
list_add_tail(&page_tail->lru, list);
} else {
- /*
- * Head page has not yet been counted, as an hpage,
- * so we must account for each subpage individually.
- *
- * Put page_tail on the list at the correct position
- * so they all end up in order.
- */
- add_page_to_lru_list_tail(page_tail, lruvec,
- page_lru(page_tail));
+ /* head is still on lru (and we have it frozen) */
+ VM_WARN_ON(!PageLRU(head));
+ SetPageLRU(page_tail);
+ list_add_tail(&page_tail->lru, &head->lru);
}
}

--
1.8.3.1

2020-10-29 10:48:33

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page

The func only has one caller, remove it to clean up code and simplify
code.

Signed-off-by: Alex Shi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/mlock.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 796c726a0407..d487aa864e86 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,26 +106,6 @@ void mlock_vma_page(struct page *page)
}

/*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
-{
- if (PageLRU(page)) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
- if (getpage)
- get_page(page);
- ClearPageLRU(page);
- del_page_from_lru_list(page, lruvec, page_lru(page));
- return true;
- }
-
- return false;
-}
-
-/*
* Finish munlock after successful page isolation
*
* Page must be locked. This is a wrapper for try_to_munlock()
@@ -296,9 +276,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
- if (__munlock_isolate_lru_page(page, false))
+ if (PageLRU(page)) {
+ struct lruvec *lruvec;
+
+ ClearPageLRU(page);
+ lruvec = mem_cgroup_page_lruvec(page,
+ page_pgdat(page));
+ del_page_from_lru_list(page, lruvec,
+ page_lru(page));
continue;
- else
+ } else
__munlock_isolation_failed(page);
} else {
delta_munlocked++;
--
1.8.3.1

2020-10-29 10:48:37

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we could clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
1, get_page(); #pin the page avoid to be free
2, TestClearPageLRU(); #block other isolation like memcg change
3, spin_lock on lru_lock; #serialize lru list access
4, delete page from lru list;
The step 2 could be optimzed/replaced in scenarios which page is
unlikely be accessed or be moved between memcgs.

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
temporary moment(isolating), the page may have no lru bit when
it's on lru list. but the page still must be on lru list when the
lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable in Rong Chen
<[email protected]> report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/page-flags.h | 1 +
mm/mlock.c | 3 +--
mm/vmscan.c | 39 +++++++++++++++++++--------------------
3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4f6ba9379112..14a0cac9e099 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -335,6 +335,7 @@ static inline void page_init_poison(struct page *page, size_t size)
PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+ TESTCLEARFLAG(LRU, lru, PF_HEAD)
PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index d487aa864e86..7b0e6334be6f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -276,10 +276,9 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
- if (PageLRU(page)) {
+ if (TestClearPageLRU(page)) {
struct lruvec *lruvec;

- ClearPageLRU(page);
lruvec = mem_cgroup_page_lruvec(page,
page_pgdat(page));
del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 845733afccde..ce4ab932805c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
*/
int __isolate_lru_page(struct page *page, isolate_mode_t mode)
{
- int ret = -EINVAL;
+ int ret = -EBUSY;

/* Only take pages on the LRU. */
if (!PageLRU(page))
@@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
return ret;

- ret = -EBUSY;
-
/*
* To minimise LRU disruption, the caller can indicate that it only
* wants to isolate pages it will be able to operate on without
@@ -1600,8 +1598,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
* sure the page is not being freed elsewhere -- the
* page release code relies on it.
*/
- ClearPageLRU(page);
- ret = 0;
+ if (TestClearPageLRU(page))
+ ret = 0;
+ else
+ put_page(page);
}

return ret;
@@ -1667,8 +1667,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);

- VM_BUG_ON_PAGE(!PageLRU(page), page);
-
nr_pages = compound_nr(page);
total_scan += nr_pages;

@@ -1765,21 +1763,18 @@ int isolate_lru_page(struct page *page)
VM_BUG_ON_PAGE(!page_count(page), page);
WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");

- if (PageLRU(page)) {
+ if (TestClearPageLRU(page)) {
pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;

- spin_lock_irq(&pgdat->lru_lock);
+ get_page(page);
lruvec = mem_cgroup_page_lruvec(page, pgdat);
- if (PageLRU(page)) {
- int lru = page_lru(page);
- get_page(page);
- ClearPageLRU(page);
- del_page_from_lru_list(page, lruvec, lru);
- ret = 0;
- }
+ spin_lock_irq(&pgdat->lru_lock);
+ del_page_from_lru_list(page, lruvec, page_lru(page));
spin_unlock_irq(&pgdat->lru_lock);
+ ret = 0;
}
+
return ret;
}

@@ -4289,6 +4284,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
nr_pages = thp_nr_pages(page);
pgscanned += nr_pages;

+ /* block memcg migration during page moving between lru */
+ if (!TestClearPageLRU(page))
+ continue;
+
if (pagepgdat != pgdat) {
if (pgdat)
spin_unlock_irq(&pgdat->lru_lock);
@@ -4297,10 +4296,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
}
lruvec = mem_cgroup_page_lruvec(page, pgdat);

- if (!PageLRU(page) || !PageUnevictable(page))
- continue;
-
- if (page_evictable(page)) {
+ if (page_evictable(page) && PageUnevictable(page)) {
enum lru_list lru = page_lru_base_type(page);

VM_BUG_ON_PAGE(PageActive(page), page);
@@ -4309,12 +4305,15 @@ void check_move_unevictable_pages(struct pagevec *pvec)
add_page_to_lru_list(page, lruvec, lru);
pgrescued += nr_pages;
}
+ SetPageLRU(page);
}

if (pgdat) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
spin_unlock_irq(&pgdat->lru_lock);
+ } else if (pgscanned) {
+ count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
}
}
EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
--
1.8.3.1

2020-10-29 10:49:15

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged

Add VM_WARN_ON_ONCE_PAGE() macro.

Since readahead page is charged on memcg too, in theory we don't have to
check this exception now. Before safely remove them all, add a warning
for the unexpected !memcg.

Signed-off-by: Alex Shi <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/mmdebug.h | 13 +++++++++++++
mm/memcontrol.c | 11 ++++-------
2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
index 2ad72d2c8cc5..5d0767cb424a 100644
--- a/include/linux/mmdebug.h
+++ b/include/linux/mmdebug.h
@@ -37,6 +37,18 @@
BUG(); \
} \
} while (0)
+#define VM_WARN_ON_ONCE_PAGE(cond, page) ({ \
+ static bool __section(".data.once") __warned; \
+ int __ret_warn_once = !!(cond); \
+ \
+ if (unlikely(__ret_warn_once && !__warned)) { \
+ dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\
+ __warned = true; \
+ WARN_ON(1); \
+ } \
+ unlikely(__ret_warn_once); \
+})
+
#define VM_WARN_ON(cond) (void)WARN_ON(cond)
#define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
#define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format)
@@ -48,6 +60,7 @@
#define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond)
#define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)
#define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond)
+#define VM_WARN_ON_ONCE_PAGE(cond, page) BUILD_BUG_ON_INVALID(cond)
#define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
#define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a24e3b619f5..6b67da305958 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1350,10 +1350,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
}

memcg = page->mem_cgroup;
- /*
- * Swapcache readahead pages are added to the LRU - and
- * possibly migrated - before they are charged.
- */
+ VM_WARN_ON_ONCE_PAGE(!memcg, page);
if (!memcg)
memcg = root_mem_cgroup;

@@ -6979,8 +6976,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
if (newpage->mem_cgroup)
return;

- /* Swapcache readahead pages can get replaced before being charged */
memcg = oldpage->mem_cgroup;
+ VM_WARN_ON_ONCE_PAGE(!memcg, oldpage);
if (!memcg)
return;

@@ -7177,7 +7174,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)

memcg = page->mem_cgroup;

- /* Readahead page, never charged */
+ VM_WARN_ON_ONCE_PAGE(!memcg, page);
if (!memcg)
return;

@@ -7241,7 +7238,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)

memcg = page->mem_cgroup;

- /* Readahead page, never charged */
+ VM_WARN_ON_ONCE_PAGE(!memcg, page);
if (!memcg)
return 0;

--
1.8.3.1

2020-10-29 10:49:18

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 06/20] mm/thp: narrow lru locking

lru_lock and page cache xa_lock have no obvious reason to be taken
one way round or the other: until now, lru_lock has been taken before
page cache xa_lock, when splitting a THP; but nothing else takes them
together. Reverse that ordering: let's narrow the lru locking - but
leave local_irq_disable to block interrupts throughout, like before.

Hugh Dickins point: split_huge_page_to_list() was already silly, to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled. So we
can save passing flags argument down to __split_huge_page().

Why change the lock ordering here? That was hard to decide. One reason:
when this series reaches per-memcg lru locking, it relies on the THP's
memcg to be stable when taking the lru_lock: that is now done after the
THP's refcount has been frozen, which ensures page memcg cannot change.

Another reason: previously, lock_page_memcg()'s move_lock was presumed
to nest inside lru_lock; but now lru_lock must nest inside (page cache
lock inside) move_lock, so it becomes possible to use lock_page_memcg()
to stabilize page memcg before taking its lru_lock. That is not the
mechanism used in this series, but it is an option we want to keep open.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/huge_memory.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b72dd7b8b34..5fa890e26975 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2433,7 +2433,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
}

static void __split_huge_page(struct page *page, struct list_head *list,
- pgoff_t end, unsigned long flags)
+ pgoff_t end)
{
struct page *head = compound_head(page);
pg_data_t *pgdat = page_pgdat(head);
@@ -2443,8 +2443,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
unsigned int nr = thp_nr_pages(head);
int i;

- lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);

@@ -2456,6 +2454,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_lock(&swap_cache->i_pages);
}

+ /* prevent PageLRU to go away from under us, and freeze lru stats */
+ spin_lock(&pgdat->lru_lock);
+
+ lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
for (i = nr - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
/* Some pages can be beyond i_size: drop them from page cache */
@@ -2475,6 +2478,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
}

ClearPageCompound(head);
+ spin_unlock(&pgdat->lru_lock);
+ /* Caller disabled irqs, so they are still disabled here */

split_page_owner(head, nr);

@@ -2492,8 +2497,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
page_ref_add(head, 2);
xa_unlock(&head->mapping->i_pages);
}
-
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ local_irq_enable();

remap_page(head, nr);

@@ -2639,12 +2643,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
- struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
struct deferred_split *ds_queue = get_deferred_split_queue(head);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
int count, mapcount, extra_pins, ret;
- unsigned long flags;
pgoff_t end;

VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2705,9 +2707,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
unmap_page(head);
VM_BUG_ON_PAGE(compound_mapcount(head), head);

- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+ /* block interrupt reentry in xa_lock and spinlock */
+ local_irq_disable();
if (mapping) {
XA_STATE(xas, &mapping->i_pages, page_index(head));

@@ -2737,7 +2738,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
__dec_node_page_state(head, NR_FILE_THPS);
}

- __split_huge_page(page, list, end, flags);
+ __split_huge_page(page, list, end);
ret = 0;
} else {
if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2751,7 +2752,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&ds_queue->split_queue_lock);
fail: if (mapping)
xa_unlock(&mapping->i_pages);
- spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+ local_irq_enable();
remap_page(head, thp_nr_pages(head));
ret = -EBUSY;
}
--
1.8.3.1

2020-10-29 10:49:23

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

From: Hugh Dickins <[email protected]>

It is necessary for page_idle_get_page() to recheck PageLRU() after
get_page_unless_zero(), but holding lru_lock around that serves no
useful purpose, and adds to lru_lock contention: delete it.

See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
discussion that led to lru_lock there; but __page_set_anon_rmap() now
uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
but not entirely prevented by page_count() check in ksm.c's
write_protect_page(): that risk being shared with page_referenced() and
not helped by lru_lock).

Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/page_idle.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..64e5344a992c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,15 @@
static struct page *page_idle_get_page(unsigned long pfn)
{
struct page *page = pfn_to_online_page(pfn);
- pg_data_t *pgdat;

if (!page || !PageLRU(page) ||
!get_page_unless_zero(page))
return NULL;

- pgdat = page_pgdat(page);
- spin_lock_irq(&pgdat->lru_lock);
if (unlikely(!PageLRU(page))) {
put_page(page);
page = NULL;
}
- spin_unlock_irq(&pgdat->lru_lock);
return page;
}

--
1.8.3.1

2020-10-29 10:49:31

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/huge_memory.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 038db815ebba..93c0b73eb8c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,19 +2346,19 @@ static void remap_page(struct page *page, unsigned int nr)
}
}

-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *page_tail,
struct lruvec *lruvec, struct list_head *list)
{
- VM_BUG_ON_PAGE(!PageHead(page), page);
- VM_BUG_ON_PAGE(PageCompound(page_tail), page);
- VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+ VM_BUG_ON_PAGE(!PageHead(head), head);
+ VM_BUG_ON_PAGE(PageCompound(page_tail), head);
+ VM_BUG_ON_PAGE(PageLRU(page_tail), head);
lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);

if (!list)
SetPageLRU(page_tail);

- if (likely(PageLRU(page)))
- list_add_tail(&page_tail->lru, &page->lru);
+ if (likely(PageLRU(head)))
+ list_add_tail(&page_tail->lru, &head->lru);
else if (list) {
/* page reclaim is reclaiming a huge page */
get_page(page_tail);
--
1.8.3.1

2020-10-29 10:49:35

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled

If we disabled memcg by cgroup_disable=memory, page->memcg will be NULL
and so the charge is skipped and that will trigger a warning like below.
Let's return from the funcs earlier.

anon flags:0x5005b48008000d(locked|uptodate|dirty|swapbacked)
raw: 005005b48008000d dead000000000100 dead000000000122 ffff8897c7c76ad1
raw: 0000000000000022 0000000000000000 0000000200000000 0000000000000000
page dumped because: VM_WARN_ON_ONCE_PAGE(!memcg)
...
RIP: 0010:vprintk_emit+0x1f7/0x260
Code: 00 84 d2 74 72 0f b6 15 27 58 64 01 48 c7 c0 00 d4 72 82 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 de 0d 00 00 4c 89 e7 57 9d <0f> 1f 44 00 00 e9 62 ff ff ff 80 3d 88 c9 3a 01 00 0f 85 54 fe ff
RSP: 0018:ffffc9000faab358 EFLAGS: 00000202
RAX: ffffffff8272d400 RBX: 000000000000005e RCX: ffff88afd80d0040
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000202
RBP: ffffc9000faab3a8 R08: ffffffff8272d440 R09: 0000000000022480
R10: 00120c77be68bfac R11: 0000000000cd7568 R12: 0000000000000202
R13: 0057ffffc0080005 R14: ffffffff820a0130 R15: ffffc9000faab3e8
? vprintk_emit+0x140/0x260
vprintk_default+0x1a/0x20
vprintk_func+0x4f/0xc4
? vprintk_func+0x4f/0xc4
printk+0x53/0x6a
? xas_load+0xc/0x80
__dump_page.cold.6+0xff/0x4ee
? xas_init_marks+0x23/0x50
? xas_store+0x30/0x40
? free_swap_slot+0x43/0xd0
? put_swap_page+0x119/0x320
? update_load_avg+0x82/0x580
dump_page+0x9/0xb
mem_cgroup_try_charge_swap+0x16e/0x1d0
get_swap_page+0x130/0x210
add_to_swap+0x41/0xc0
shrink_page_list+0x99e/0xdf0
shrink_inactive_list+0x199/0x360
shrink_lruvec+0x40d/0x650
? _cond_resched+0x14/0x30
? _cond_resched+0x14/0x30
shrink_node+0x226/0x6e0
do_try_to_free_pages+0xd0/0x400
try_to_free_pages+0xef/0x130
__alloc_pages_slowpath.constprop.127+0x38d/0xbd0
? ___slab_alloc+0x31d/0x6f0
__alloc_pages_nodemask+0x27f/0x2c0
alloc_pages_vma+0x75/0x220
shmem_alloc_page+0x46/0x90
? release_pages+0x1ae/0x410
shmem_alloc_and_acct_page+0x77/0x1c0
shmem_getpage_gfp+0x162/0x910
shmem_fault+0x74/0x210
? filemap_map_pages+0x29c/0x410
__do_fault+0x37/0x190
handle_mm_fault+0x120a/0x1770
exc_page_fault+0x251/0x450
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6b67da305958..e46b9f9501c2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7169,6 +7169,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);

+ if (mem_cgroup_disabled())
+ return;
+
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;

@@ -7233,6 +7236,9 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
struct mem_cgroup *memcg;
unsigned short oldid;

+ if (mem_cgroup_disabled())
+ return 0;
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return 0;

--
1.8.3.1

2020-10-29 10:49:48

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru

A isolated page shouldn't be recharged by memcg since the memcg
migration isn't possible at the time.
So remove unnecessary regetting.

Thanks to Alexander Duyck for pointing this out.

Signed-off-by: Alex Shi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/vmscan.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42bac12aacb4..845733afccde 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1887,7 +1887,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
continue;
}

- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
+ != lruvec, page);
lru = page_lru(page);
nr_pages = thp_nr_pages(page);

--
1.8.3.1

2020-10-29 10:50:00

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 11/20] mm/lru: move lock into lru_note_cost

We have to move lru_lock into lru_note_cost, since it cycle up on memcg
tree, for future per lruvec lru_lock replace. It's a bit ugly and may
cost a bit more locking, but benefit from multiple memcg locking could
cover the lost.

Signed-off-by: Alex Shi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 3 +++
mm/vmscan.c | 4 +---
mm/workingset.c | 2 --
3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 31fc3ebc1079..8798bd899db0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -268,7 +268,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
{
do {
unsigned long lrusize;
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);

+ spin_lock_irq(&pgdat->lru_lock);
/* Record cost event */
if (file)
lruvec->file_cost += nr_pages;
@@ -292,6 +294,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
lruvec->file_cost /= 2;
lruvec->anon_cost /= 2;
}
+ spin_unlock_irq(&pgdat->lru_lock);
} while ((lruvec = parent_lruvec(lruvec)));
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6c6965cbbdef..42bac12aacb4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1973,19 +1973,17 @@ static int current_may_throttle(void)
&stat, false);

spin_lock_irq(&pgdat->lru_lock);
-
move_pages_to_lru(lruvec, &page_list);

__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- lru_note_cost(lruvec, file, stat.nr_pageout);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-
spin_unlock_irq(&pgdat->lru_lock);

+ lru_note_cost(lruvec, file, stat.nr_pageout);
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);

diff --git a/mm/workingset.c b/mm/workingset.c
index 975a4d2dd02e..d8d2fdf70c24 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -381,9 +381,7 @@ void workingset_refault(struct page *page, void *shadow)
if (workingset) {
SetPageWorkingset(page);
/* XXX: Move to lru_cache_add() when it supports new vs putback */
- spin_lock_irq(&page_pgdat(page)->lru_lock);
lru_note_cost_page(page);
- spin_unlock_irq(&page_pgdat(page)->lru_lock);
inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
}
out:
--
1.8.3.1

2020-10-29 10:50:00

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 16/20] mm/compaction: do page isolation first in compaction

Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins <[email protected]> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/swap.h | 2 +-
mm/compaction.c | 42 +++++++++++++++++++++++++++++++++---------
mm/vmscan.c | 43 ++++++++++++++++++++++---------------------
3 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e1e967c225f..596bc2f4d9b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -356,7 +356,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index 6e0ee5641788..75f7973605f4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -886,6 +886,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
low_pfn = end_pfn;
+ page = NULL;
goto isolate_abort;
}
valid_page = page;
@@ -967,6 +968,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;

+ /*
+ * Be careful not to clear PageLRU until after we're
+ * sure the page is not being freed elsewhere -- the
+ * page release code relies on it.
+ */
+ if (unlikely(!get_page_unless_zero(page)))
+ goto isolate_fail;
+
+ if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+ goto isolate_fail_put;
+
+ /* Try isolate the page */
+ if (!TestClearPageLRU(page))
+ goto isolate_fail_put;
+
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
goto isolate_abort;
}

- /* Recheck PageLRU and PageCompound under lock */
- if (!PageLRU(page))
- goto isolate_fail;
-
/*
* Page become compound since the non-locked check,
* and it's on LRU. It can only be a THP so the order
@@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
*/
if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
low_pfn += compound_nr(page) - 1;
- goto isolate_fail;
+ SetPageLRU(page);
+ goto isolate_fail_put;
}
}

lruvec = mem_cgroup_page_lruvec(page, pgdat);

- /* Try isolate the page */
- if (__isolate_lru_page(page, isolate_mode) != 0)
- goto isolate_fail;
-
/* The whole page is taken off the LRU; skip the tail pages. */
if (PageCompound(page))
low_pfn += compound_nr(page) - 1;
@@ -1028,6 +1037,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
}

continue;
+
+isolate_fail_put:
+ /* Avoid potential deadlock in freeing page under lru_lock */
+ if (locked) {
+ spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ locked = false;
+ }
+ put_page(page);
+
isolate_fail:
if (!skip_on_failure)
continue;
@@ -1064,9 +1082,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (unlikely(low_pfn > end_pfn))
low_pfn = end_pfn;

+ page = NULL;
+
isolate_abort:
if (locked)
spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ if (page) {
+ SetPageLRU(page);
+ put_page(page);
+ }

/*
* Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce4ab932805c..e28df9cb5be3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
*
* returns 0 on success, -ve errno on failure.
*/
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
{
int ret = -EBUSY;

@@ -1592,22 +1592,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
return ret;

- if (likely(get_page_unless_zero(page))) {
- /*
- * Be careful not to clear PageLRU until after we're
- * sure the page is not being freed elsewhere -- the
- * page release code relies on it.
- */
- if (TestClearPageLRU(page))
- ret = 0;
- else
- put_page(page);
- }
-
- return ret;
+ return 0;
}

-
/*
* Update LRU sizes after isolating pages. The LRU size updates must
* be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1687,20 +1674,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* only when the page is being freed somewhere else.
*/
scan += nr_pages;
- switch (__isolate_lru_page(page, mode)) {
+ switch (__isolate_lru_page_prepare(page, mode)) {
case 0:
+ /*
+ * Be careful not to clear PageLRU until after we're
+ * sure the page is not being freed elsewhere -- the
+ * page release code relies on it.
+ */
+ if (unlikely(!get_page_unless_zero(page)))
+ goto busy;
+
+ if (!TestClearPageLRU(page)) {
+ /*
+ * This page may in other isolation path,
+ * but we still hold lru_lock.
+ */
+ put_page(page);
+ goto busy;
+ }
+
nr_taken += nr_pages;
nr_zone_taken[page_zonenum(page)] += nr_pages;
list_move(&page->lru, dst);
break;

- case -EBUSY:
+ default:
+busy:
/* else it is being freed elsewhere */
list_move(&page->lru, src);
- continue;
-
- default:
- BUG();
}
}

--
1.8.3.1

2020-10-29 10:50:39

by Alex Shi

[permalink] [raw]
Subject: [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/swap.h | 2 --
mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++
mm/swap.c | 33 ---------------------------------
3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 667935c0dbd4..5e1e967c225f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
unsigned int nr_pages);
extern void lru_note_cost_page(struct page *);
extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
- struct lruvec *lruvec, struct list_head *head);
extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9474dbc150ed..038db815ebba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,6 +2346,36 @@ static void remap_page(struct page *page, unsigned int nr)
}
}

+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+ struct lruvec *lruvec, struct list_head *list)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+ VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+ lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+ if (!list)
+ SetPageLRU(page_tail);
+
+ if (likely(PageLRU(page)))
+ list_add_tail(&page_tail->lru, &page->lru);
+ else if (list) {
+ /* page reclaim is reclaiming a huge page */
+ get_page(page_tail);
+ list_add_tail(&page_tail->lru, list);
+ } else {
+ /*
+ * Head page has not yet been counted, as an hpage,
+ * so we must account for each subpage individually.
+ *
+ * Put page_tail on the list at the correct position
+ * so they all end up in order.
+ */
+ add_page_to_lru_list_tail(page_tail, lruvec,
+ page_lru(page_tail));
+ }
+}
+
static void __split_huge_page_tail(struct page *head, int tail,
struct lruvec *lruvec, struct list_head *list)
{
diff --git a/mm/swap.c b/mm/swap.c
index 47a47681c86b..05bc9ff6d8c0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -974,39 +974,6 @@ void __pagevec_release(struct pagevec *pvec)
}
EXPORT_SYMBOL(__pagevec_release);

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
- struct lruvec *lruvec, struct list_head *list)
-{
- VM_BUG_ON_PAGE(!PageHead(page), page);
- VM_BUG_ON_PAGE(PageCompound(page_tail), page);
- VM_BUG_ON_PAGE(PageLRU(page_tail), page);
- lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
-
- if (!list)
- SetPageLRU(page_tail);
-
- if (likely(PageLRU(page)))
- list_add_tail(&page_tail->lru, &page->lru);
- else if (list) {
- /* page reclaim is reclaiming a huge page */
- get_page(page_tail);
- list_add_tail(&page_tail->lru, list);
- } else {
- /*
- * Head page has not yet been counted, as an hpage,
- * so we must account for each subpage individually.
- *
- * Put page_tail on the list at the correct position
- * so they all end up in order.
- */
- add_page_to_lru_list_tail(page_tail, lruvec,
- page_lru(page_tail));
- }
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
--
1.8.3.1

2020-10-29 13:48:38

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 11/20] mm/lru: move lock into lru_note_cost

On Thu, Oct 29, 2020 at 06:44:56PM +0800, Alex Shi wrote:
> We have to move lru_lock into lru_note_cost, since it cycle up on memcg
> tree, for future per lruvec lru_lock replace. It's a bit ugly and may
> cost a bit more locking, but benefit from multiple memcg locking could
> cover the lost.
>
> Signed-off-by: Alex Shi <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-10-29 13:49:25

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged

On Thu, Oct 29, 2020 at 06:44:46PM +0800, Alex Shi wrote:
> Add VM_WARN_ON_ONCE_PAGE() macro.
>
> Since readahead page is charged on memcg too, in theory we don't have to
> check this exception now. Before safely remove them all, add a warning
> for the unexpected !memcg.
>
> Signed-off-by: Alex Shi <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-10-29 13:50:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled

On Thu, Oct 29, 2020 at 06:44:47PM +0800, Alex Shi wrote:
> If we disabled memcg by cgroup_disable=memory, page->memcg will be NULL
> and so the charge is skipped and that will trigger a warning like below.
> Let's return from the funcs earlier.
>
> anon flags:0x5005b48008000d(locked|uptodate|dirty|swapbacked)
> raw: 005005b48008000d dead000000000100 dead000000000122 ffff8897c7c76ad1
> raw: 0000000000000022 0000000000000000 0000000200000000 0000000000000000
> page dumped because: VM_WARN_ON_ONCE_PAGE(!memcg)
> ...
> RIP: 0010:vprintk_emit+0x1f7/0x260
> Code: 00 84 d2 74 72 0f b6 15 27 58 64 01 48 c7 c0 00 d4 72 82 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 de 0d 00 00 4c 89 e7 57 9d <0f> 1f 44 00 00 e9 62 ff ff ff 80 3d 88 c9 3a 01 00 0f 85 54 fe ff
> RSP: 0018:ffffc9000faab358 EFLAGS: 00000202
> RAX: ffffffff8272d400 RBX: 000000000000005e RCX: ffff88afd80d0040
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000202
> RBP: ffffc9000faab3a8 R08: ffffffff8272d440 R09: 0000000000022480
> R10: 00120c77be68bfac R11: 0000000000cd7568 R12: 0000000000000202
> R13: 0057ffffc0080005 R14: ffffffff820a0130 R15: ffffc9000faab3e8
> ? vprintk_emit+0x140/0x260
> vprintk_default+0x1a/0x20
> vprintk_func+0x4f/0xc4
> ? vprintk_func+0x4f/0xc4
> printk+0x53/0x6a
> ? xas_load+0xc/0x80
> __dump_page.cold.6+0xff/0x4ee
> ? xas_init_marks+0x23/0x50
> ? xas_store+0x30/0x40
> ? free_swap_slot+0x43/0xd0
> ? put_swap_page+0x119/0x320
> ? update_load_avg+0x82/0x580
> dump_page+0x9/0xb
> mem_cgroup_try_charge_swap+0x16e/0x1d0
> get_swap_page+0x130/0x210
> add_to_swap+0x41/0xc0
> shrink_page_list+0x99e/0xdf0
> shrink_inactive_list+0x199/0x360
> shrink_lruvec+0x40d/0x650
> ? _cond_resched+0x14/0x30
> ? _cond_resched+0x14/0x30
> shrink_node+0x226/0x6e0
> do_try_to_free_pages+0xd0/0x400
> try_to_free_pages+0xef/0x130
> __alloc_pages_slowpath.constprop.127+0x38d/0xbd0
> ? ___slab_alloc+0x31d/0x6f0
> __alloc_pages_nodemask+0x27f/0x2c0
> alloc_pages_vma+0x75/0x220
> shmem_alloc_page+0x46/0x90
> ? release_pages+0x1ae/0x410
> shmem_alloc_and_acct_page+0x77/0x1c0
> shmem_getpage_gfp+0x162/0x910
> shmem_fault+0x74/0x210
> ? filemap_map_pages+0x29c/0x410
> __do_fault+0x37/0x190
> handle_mm_fault+0x120a/0x1770
> exc_page_fault+0x251/0x450
> ? asm_exc_page_fault+0x8/0x30
> asm_exc_page_fault+0x1e/0x30
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

This should go in before the previous patch that adds the WARN for it.

2020-10-29 13:51:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c

On Thu, Oct 29, 2020 at 06:44:48PM +0800, Alex Shi wrote:
> The func is only used in huge_memory.c, defining it in other file with a
> CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.
>
> Let's move it THP. And make it static as Hugh Dickin suggested.
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-10-29 13:54:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail

On Thu, Oct 29, 2020 at 06:44:49PM +0800, Alex Shi wrote:
> Since the first parameter is only used by head page, it's better to make
> it explicit.
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/huge_memory.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 038db815ebba..93c0b73eb8c6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2346,19 +2346,19 @@ static void remap_page(struct page *page, unsigned int nr)
> }
> }
>
> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
> +static void lru_add_page_tail(struct page *head, struct page *page_tail,

It may be better to pick either
head and tail
or
page_head and page_tail

?

2020-10-29 14:05:25

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail()

On Thu, Oct 29, 2020 at 06:44:50PM +0800, Alex Shi wrote:
> Simplify lru_add_page_tail(), there are actually only two cases possible:
> split_huge_page_to_list(), with list supplied and head isolated from lru
> by its caller; or split_huge_page(), with NULL list and head on lru -
> because when head is racily isolated from lru, the isolator's reference
> will stop the split from getting any further than its page_ref_freeze().
>
> So decide between the two cases by "list", but add VM_WARN_ON()s to
> verify that they match our lru expectations.
>
> [Hugh Dickins: rewrite commit log]
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Mika Penttil? <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-10-30 02:31:56

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled



?? 2020/10/29 ????9:46, Johannes Weiner д??:
>> ? release_pages+0x1ae/0x410
>> shmem_alloc_and_acct_page+0x77/0x1c0
>> shmem_getpage_gfp+0x162/0x910
>> shmem_fault+0x74/0x210
>> ? filemap_map_pages+0x29c/0x410
>> __do_fault+0x37/0x190
>> handle_mm_fault+0x120a/0x1770
>> exc_page_fault+0x251/0x450
>> ? asm_exc_page_fault+0x8/0x30
>> asm_exc_page_fault+0x1e/0x30
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Reviewed-by: Roman Gushchin <[email protected]>
>> Acked-by: Michal Hocko <[email protected]>
>> Acked-by: Hugh Dickins <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Vladimir Davydov <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
> Acked-by: Johannes Weiner <[email protected]>
>
> This should go in before the previous patch that adds the WARN for it.

Right, but than the long ops may not weird. Should I remove the ops and resend the whole patchset?

Which way is convenient for you?

Thanks
Alex

2020-10-30 02:50:47

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail



?? 2020/10/29 ????9:50, Johannes Weiner д??:
> On Thu, Oct 29, 2020 at 06:44:49PM +0800, Alex Shi wrote:
>> Since the first parameter is only used by head page, it's better to make
>> it explicit.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Reviewed-by: Kirill A. Shutemov <[email protected]>
>> Acked-by: Hugh Dickins <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> mm/huge_memory.c | 12 ++++++------
>> 1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 038db815ebba..93c0b73eb8c6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2346,19 +2346,19 @@ static void remap_page(struct page *page, unsigned int nr)
>> }
>> }
>>
>> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
>> +static void lru_add_page_tail(struct page *head, struct page *page_tail,
>
> It may be better to pick either
> head and tail

Hi Johannes,

Thanks for comments!

Right, Consider functions in this file are using head/tail more as parameters
I will change to use head/tail too. And then, the 04th, 05th, and 18th patch
will be changed accordingly.

Thanks
Alex

> or
> page_head and page_tail
>
> ?
>

From a9ee63a213f40eb4d5a69b52fbb348ff9cd7cf6c Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Tue, 26 May 2020 16:49:22 +0800
Subject: [PATCH v21 04/20] mm/thp: use head for head page in lru_add_page_tail

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/huge_memory.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 038db815ebba..32a4bf5b80c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,33 +2346,32 @@ static void remap_page(struct page *page, unsigned int nr)
}
}

-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *tail,
struct lruvec *lruvec, struct list_head *list)
{
- VM_BUG_ON_PAGE(!PageHead(page), page);
- VM_BUG_ON_PAGE(PageCompound(page_tail), page);
- VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+ VM_BUG_ON_PAGE(!PageHead(head), head);
+ VM_BUG_ON_PAGE(PageCompound(tail), head);
+ VM_BUG_ON_PAGE(PageLRU(tail), head);
lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);

if (!list)
- SetPageLRU(page_tail);
+ SetPageLRU(tail);

- if (likely(PageLRU(page)))
- list_add_tail(&page_tail->lru, &page->lru);
+ if (likely(PageLRU(head)))
+ list_add_tail(&tail->lru, &head->lru);
else if (list) {
/* page reclaim is reclaiming a huge page */
- get_page(page_tail);
- list_add_tail(&page_tail->lru, list);
+ get_page(tail);
+ list_add_tail(&tail->lru, list);
} else {
/*
* Head page has not yet been counted, as an hpage,
* so we must account for each subpage individually.
*
- * Put page_tail on the list at the correct position
+ * Put tail on the list at the correct position
* so they all end up in order.
*/
- add_page_to_lru_list_tail(page_tail, lruvec,
- page_lru(page_tail));
+ add_page_to_lru_list_tail(tail, lruvec, page_lru(tail));
}
}

--
1.8.3.1

2020-10-30 02:53:10

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail()

patch changed since varible rename in 4th patch:

From 5014c78418284f70be232a37fa3a4660a54e83c0 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Fri, 10 Jul 2020 12:53:22 +0800
Subject: [PATCH v21 05/20] mm/thp: Simplify lru_add_page_tail()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Simplify lru_add_page_tail(), there are actually only two cases possible:
split_huge_page_to_list(), with list supplied and head isolated from lru
by its caller; or split_huge_page(), with NULL list and head on lru -
because when head is racily isolated from lru, the isolator's reference
will stop the split from getting any further than its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mika Penttilä <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/huge_memory.c | 20 ++++++--------------
1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 32a4bf5b80c8..cedcdbeb98b4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,24 +2354,16 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
VM_BUG_ON_PAGE(PageLRU(tail), head);
lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);

- if (!list)
- SetPageLRU(tail);
-
- if (likely(PageLRU(head)))
- list_add_tail(&tail->lru, &head->lru);
- else if (list) {
+ if (list) {
/* page reclaim is reclaiming a huge page */
+ VM_WARN_ON(PageLRU(head));
get_page(tail);
list_add_tail(&tail->lru, list);
} else {
- /*
- * Head page has not yet been counted, as an hpage,
- * so we must account for each subpage individually.
- *
- * Put tail on the list at the correct position
- * so they all end up in order.
- */
- add_page_to_lru_list_tail(tail, lruvec, page_lru(tail));
+ /* head is still on lru (and we have it frozen) */
+ VM_WARN_ON(!PageLRU(head));
+ SetPageLRU(tail);
+ list_add_tail(&tail->lru, &head->lru);
}
}

--
1.8.3.1


2020-10-30 13:56:57

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail

On Fri, Oct 30, 2020 at 10:46:54AM +0800, Alex Shi wrote:
> 在 2020/10/29 下午9:50, Johannes Weiner 写道:
> > It may be better to pick either
> > head and tail
>
> Hi Johannes,
>
> Thanks for comments!
>
> Right, Consider functions in this file are using head/tail more as parameters
> I will change to use head/tail too. And then, the 04th, 05th, and 18th patch
> will be changed accordingly.

That's great, thank you!

> From a9ee63a213f40eb4d5a69b52fbb348ff9cd7cf6c Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Tue, 26 May 2020 16:49:22 +0800
> Subject: [PATCH v21 04/20] mm/thp: use head for head page in lru_add_page_tail
>
> Since the first parameter is only used by head page, it's better to make
> it explicit.
>
> Signed-off-by: Alex Shi <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-10-30 14:09:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled

On Fri, Oct 30, 2020 at 10:27:51AM +0800, Alex Shi wrote:
>
>
> 在 2020/10/29 下午9:46, Johannes Weiner 写道:
> >> ? release_pages+0x1ae/0x410
> >> shmem_alloc_and_acct_page+0x77/0x1c0
> >> shmem_getpage_gfp+0x162/0x910
> >> shmem_fault+0x74/0x210
> >> ? filemap_map_pages+0x29c/0x410
> >> __do_fault+0x37/0x190
> >> handle_mm_fault+0x120a/0x1770
> >> exc_page_fault+0x251/0x450
> >> ? asm_exc_page_fault+0x8/0x30
> >> asm_exc_page_fault+0x1e/0x30
> >>
> >> Signed-off-by: Alex Shi <[email protected]>
> >> Reviewed-by: Roman Gushchin <[email protected]>
> >> Acked-by: Michal Hocko <[email protected]>
> >> Acked-by: Hugh Dickins <[email protected]>
> >> Cc: Johannes Weiner <[email protected]>
> >> Cc: Michal Hocko <[email protected]>
> >> Cc: Vladimir Davydov <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> > Acked-by: Johannes Weiner <[email protected]>
> >
> > This should go in before the previous patch that adds the WARN for it.
>
> Right, but than the long ops may not weird. Should I remove the ops and resend the whole patchset?

You mean the warning in the changelog? I think that's alright. You can
just say that you're about to remove the !page->memcg check in the
next patch because the original reasons for having it are gone, and
memcg being disabled is the only remaining exception, so this patch
makes that check explicit in preparation for the next.

Sorry, it's all a bit of a hassle, I just wouldn't want to introduce a
known warning into the kernel between those two patches (could confuse
bisection runs, complicates partial reverts etc.)

2020-10-31 01:17:55

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled



在 2020/10/30 下午10:04, Johannes Weiner 写道:
>>> Acked-by: Johannes Weiner <[email protected]>
>>>
>>> This should go in before the previous patch that adds the WARN for it.
>> Right, but than the long ops may not weird. Should I remove the ops and resend the whole patchset?
> You mean the warning in the changelog? I think that's alright. You can
> just say that you're about to remove the !page->memcg check in the
> next patch because the original reasons for having it are gone, and
> memcg being disabled is the only remaining exception, so this patch
> makes that check explicit in preparation for the next.
>
> Sorry, it's all a bit of a hassle, I just wouldn't want to introduce a
> known warning into the kernel between those two patches (could confuse
> bisection runs, complicates partial reverts etc.)

H Johannes,

I see, I will exchange the 1st and 2nd patch place with above comments in commit log.
I guess you could give more comments on other patches, so I am going to wait you for
more comments and send v21 as a whole. :)

Many thanks!
Alex

2020-10-31 01:18:11

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail



在 2020/10/30 下午9:52, Johannes Weiner 写道:
>
>> From a9ee63a213f40eb4d5a69b52fbb348ff9cd7cf6c Mon Sep 17 00:00:00 2001
>> From: Alex Shi <[email protected]>
>> Date: Tue, 26 May 2020 16:49:22 +0800
>> Subject: [PATCH v21 04/20] mm/thp: use head for head page in lru_add_page_tail
>>
>> Since the first parameter is only used by head page, it's better to make
>> it explicit.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Reviewed-by: Kirill A. Shutemov <[email protected]>
>> Acked-by: Hugh Dickins <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
> Acked-by: Johannes Weiner <[email protected]>


Thanks a lot!
Alex

2020-11-02 14:45:05

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> From: Hugh Dickins <[email protected]>
>
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
>
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> uses WRITE_ONCE(),

That doesn't seem to be the case in Linus's or Andrew's tree. Am I
missing a dependent patch series?

> and I see no other risk in page_idle_clear_pte_refs() using
> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> not entirely prevented by page_count() check in ksm.c's
> write_protect_page(): that risk being shared with page_referenced()
> and not helped by lru_lock).

Isn't it possible, as per Minchan's description, for page->mapping to
point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
thinking it's looking at a struct address_space?

2020-11-02 14:51:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > From: Hugh Dickins <[email protected]>
> >
> > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > get_page_unless_zero(), but holding lru_lock around that serves no
> > useful purpose, and adds to lru_lock contention: delete it.
> >
> > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > uses WRITE_ONCE(),
>
> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> missing a dependent patch series?
>
> > and I see no other risk in page_idle_clear_pte_refs() using
> > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> > not entirely prevented by page_count() check in ksm.c's
> > write_protect_page(): that risk being shared with page_referenced()
> > and not helped by lru_lock).
>
> Isn't it possible, as per Minchan's description, for page->mapping to
> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> thinking it's looking at a struct address_space?

I don't think it can point to an anon_vma without the ANON bit set.
Minchan's concern in that email was that it might still be NULL.

2020-11-02 14:55:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru

On Thu, Oct 29, 2020 at 06:44:57PM +0800, Alex Shi wrote:
> A isolated page shouldn't be recharged by memcg since the memcg
> migration isn't possible at the time.
> So remove unnecessary regetting.
>
> Thanks to Alexander Duyck for pointing this out.
>
> Signed-off-by: Alex Shi <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

A brief comment in the code could be good: all pages were isolated
from the same lruvec (and isolation inhibits memcg migration).

2020-11-02 15:01:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page

On Thu, Oct 29, 2020 at 06:44:59PM +0800, Alex Shi wrote:
> The func only has one caller, remove it to clean up code and simplify
> code.
>
> Signed-off-by: Alex Shi <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-11-02 15:15:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU

On Thu, Oct 29, 2020 at 06:45:00PM +0800, Alex Shi wrote:
> Currently lru_lock still guards both lru list and page's lru bit, that's
> ok. but if we want to use specific lruvec lock on the page, we need to
> pin down the page's lruvec/memcg during locking. Just taking lruvec
> lock first may be undermined by the page's memcg charge/migration. To
> fix this problem, we could clear the lru bit out of locking and use
> it as pin down action to block the page isolation in memcg changing.

Small nit, but the use of "could" in this sentence sounds like you're
describing one possible solution that isn't being taken, when in fact
you are describing the chosen locking mechanism.

Replacing "could" with "will" would make things a bit clearer IMO.

> So now a standard steps of page isolation is following:
> 1, get_page(); #pin the page avoid to be free
> 2, TestClearPageLRU(); #block other isolation like memcg change
> 3, spin_lock on lru_lock; #serialize lru list access
> 4, delete page from lru list;
> The step 2 could be optimzed/replaced in scenarios which page is
> unlikely be accessed or be moved between memcgs.

This is a bit ominous. I'd either elaborate / provide an example /
clarify why some sites can deal with races - or just remove that
sentence altogether from this part of the changelog.

> This patch start with the first part: TestClearPageLRU, which combines
> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
> function will be used as page isolation precondition to prevent other
> isolations some where else. Then there are may !PageLRU page on lru
> list, need to remove BUG() checking accordingly.
>
> There 2 rules for lru bit now:
> 1, the lru bit still indicate if a page on lru list, just in some
> temporary moment(isolating), the page may have no lru bit when
> it's on lru list. but the page still must be on lru list when the
> lru bit set.
> 2, have to remove lru bit before delete it from lru list.
>
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable in Rong Chen
> <[email protected]> report:
> https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Alex Shi <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-11-02 15:22:24

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 16/20] mm/compaction: do page isolation first in compaction

On Thu, Oct 29, 2020 at 06:45:01PM +0800, Alex Shi wrote:
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
>
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
>
> Hugh Dickins <[email protected]> fixed following bugs in this patch's
> early version:
>
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Alex Shi <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Johannes Weiner <[email protected]>

2020-11-02 16:06:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail

On Fri, Oct 30, 2020 at 10:46:54AM +0800, Alex Shi wrote:
> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
> +static void lru_add_page_tail(struct page *head, struct page *tail,
> struct lruvec *lruvec, struct list_head *list)
> {
> - VM_BUG_ON_PAGE(!PageHead(page), page);
> - VM_BUG_ON_PAGE(PageCompound(page_tail), page);
> - VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> + VM_BUG_ON_PAGE(!PageHead(head), head);
> + VM_BUG_ON_PAGE(PageCompound(tail), head);
> + VM_BUG_ON_PAGE(PageLRU(tail), head);

These last two should surely have been
VM_BUG_ON_PAGE(PageCompound(tail), tail);
VM_BUG_ON_PAGE(PageLRU(tail), tail);

Also, what do people think about converting these to VM_BUG_ON_PGFLAGS?

Either way:

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

2020-11-02 20:23:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> > On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > > From: Hugh Dickins <[email protected]>
> > >
> > > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > > get_page_unless_zero(), but holding lru_lock around that serves no
> > > useful purpose, and adds to lru_lock contention: delete it.
> > >
> > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > > uses WRITE_ONCE(),
> >
> > That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> > missing a dependent patch series?
> >
> > > and I see no other risk in page_idle_clear_pte_refs() using
> > > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> > > not entirely prevented by page_count() check in ksm.c's
> > > write_protect_page(): that risk being shared with page_referenced()
> > > and not helped by lru_lock).
> >
> > Isn't it possible, as per Minchan's description, for page->mapping to
> > point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> > thinking it's looking at a struct address_space?
>
> I don't think it can point to an anon_vma without the ANON bit set.
> Minchan's concern in that email was that it might still be NULL.

Hm, no, the thread is a lengthy discussion about whether the store
could be split such that page->mapping is actually pointing to
something invalid (anon_vma without the PageAnon bit).

From his email:

CPU 0 CPU 1

do_anonymous_page
__page_set_anon_rmap
/* out of order happened so SetPageLRU is done ahead */
SetPageLRU(page)
/* Compilr changed store operation like below */
page->mapping = (struct address_space *) anon_vma;
/* Big stall happens */
/* idletacking judged it as LRU page so pass the page
in page_reference */
page_refernced
page_rmapping return true because
page->mapping has some vaule but not complete
so it calls rmap_walk_file.
it's okay to pass non-completed anon page in rmap_walk_file?

2020-11-03 02:47:43

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail



?? 2020/11/3 ????12:03, Matthew Wilcox д??:
> On Fri, Oct 30, 2020 at 10:46:54AM +0800, Alex Shi wrote:
>> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
>> +static void lru_add_page_tail(struct page *head, struct page *tail,
>> struct lruvec *lruvec, struct list_head *list)
>> {
>> - VM_BUG_ON_PAGE(!PageHead(page), page);
>> - VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>> - VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>> + VM_BUG_ON_PAGE(!PageHead(head), head);
>> + VM_BUG_ON_PAGE(PageCompound(tail), head);
>> + VM_BUG_ON_PAGE(PageLRU(tail), head);
>
> These last two should surely have been
> VM_BUG_ON_PAGE(PageCompound(tail), tail);
> VM_BUG_ON_PAGE(PageLRU(tail), tail);
>
> Also, what do people think about converting these to VM_BUG_ON_PGFLAGS?

Hi Matthew,

Thanks for reminder! Looks these changes worth for another patch.

>
> Either way:
>
> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>


I will take this option this time. :)

Thanks!
Alex

2020-11-03 02:57:41

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru



?? 2020/11/2 ????10:52, Johannes Weiner д??:
> Acked-by: Johannes Weiner <[email protected]>
>
> A brief comment in the code could be good: all pages were isolated
> from the same lruvec (and isolation inhibits memcg migration).

Yes, I will add the words both in code and commit log.

Thanks

2020-11-03 03:09:19

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU



?? 2020/11/2 ????11:10, Johannes Weiner д??:
> On Thu, Oct 29, 2020 at 06:45:00PM +0800, Alex Shi wrote:
>> Currently lru_lock still guards both lru list and page's lru bit, that's
>> ok. but if we want to use specific lruvec lock on the page, we need to
>> pin down the page's lruvec/memcg during locking. Just taking lruvec
>> lock first may be undermined by the page's memcg charge/migration. To
>> fix this problem, we could clear the lru bit out of locking and use
>> it as pin down action to block the page isolation in memcg changing.
>
> Small nit, but the use of "could" in this sentence sounds like you're
> describing one possible solution that isn't being taken, when in fact
> you are describing the chosen locking mechanism.
>
> Replacing "could" with "will" would make things a bit clearer IMO.
>

Yes, 'will' is better here. Thanks!

>> So now a standard steps of page isolation is following:
>> 1, get_page(); #pin the page avoid to be free
>> 2, TestClearPageLRU(); #block other isolation like memcg change
>> 3, spin_lock on lru_lock; #serialize lru list access
>> 4, delete page from lru list;
>> The step 2 could be optimzed/replaced in scenarios which page is
>> unlikely be accessed or be moved between memcgs.
>
> This is a bit ominous. I'd either elaborate / provide an example /
> clarify why some sites can deal with races - or just remove that
> sentence altogether from this part of the changelog.
>

A few scenarios here, so examples looks verbose or cann't describe whole.
Maybe removing above 2 lines "The step 2 could be optimzed/replaced in
scenarios which page is unlikely be accessed or be moved between memcgs."
is better.

Thanks!

>> This patch start with the first part: TestClearPageLRU, which combines
>> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
>> function will be used as page isolation precondition to prevent other
>> isolations some where else. Then there are may !PageLRU page on lru
>> list, need to remove BUG() checking accordingly.
>>
>> There 2 rules for lru bit now:
>> 1, the lru bit still indicate if a page on lru list, just in some
>> temporary moment(isolating), the page may have no lru bit when
>> it's on lru list. but the page still must be on lru list when the
>> lru bit set.
>> 2, have to remove lru bit before delete it from lru list.
>>
>> As Andrew Morton mentioned this change would dirty cacheline for page
>> isn't on LRU. But the lost would be acceptable in Rong Chen
>> <[email protected]> report:
>> https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/
>>
>> Suggested-by: Johannes Weiner <[email protected]>
>> Signed-off-by: Alex Shi <[email protected]>
>> Acked-by: Hugh Dickins <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Vladimir Davydov <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>
> Acked-by: Johannes Weiner <[email protected]>
>

Thanks!
Alex

2020-11-04 11:29:41

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock



?? 2020/11/3 ????4:20, Johannes Weiner д??:
> On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
>> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
>>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
>>>> From: Hugh Dickins <[email protected]>
>>>>
>>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>>>> get_page_unless_zero(), but holding lru_lock around that serves no
>>>> useful purpose, and adds to lru_lock contention: delete it.
>>>>
>>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>>>> uses WRITE_ONCE(),
>>>
>>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
>>> missing a dependent patch series?
>>>
>>>> and I see no other risk in page_idle_clear_pte_refs() using
>>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
>>>> not entirely prevented by page_count() check in ksm.c's
>>>> write_protect_page(): that risk being shared with page_referenced()
>>>> and not helped by lru_lock).
>>>
>>> Isn't it possible, as per Minchan's description, for page->mapping to
>>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
>>> thinking it's looking at a struct address_space?
>>
>> I don't think it can point to an anon_vma without the ANON bit set.
>> Minchan's concern in that email was that it might still be NULL.
>
> Hm, no, the thread is a lengthy discussion about whether the store
> could be split such that page->mapping is actually pointing to
> something invalid (anon_vma without the PageAnon bit).
>
> From his email:
>
> CPU 0 CPU 1
>
> do_anonymous_page
> __page_set_anon_rmap
> /* out of order happened so SetPageLRU is done ahead */
> SetPageLRU(page)

This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
protection, so the original memory barrier or lock concern isn't
correct. that means, the SetPageLRU isn't possible to be here.
And then no warry on right side 'CPU 1' problem.

> /* Compilr changed store operation like below */
> page->mapping = (struct address_space *) anon_vma;
> /* Big stall happens */
> /* idletacking judged it as LRU page so pass the page
> in page_reference */
> page_refernced
page_referenced should be page_idle_clear_pte_refs_one here?
> page_rmapping return true because
> page->mapping has some vaule but not complete
> so it calls rmap_walk_file.
> it's okay to pass non-completed anon page in rmap_walk_file?
>


For this patch, According to comments of page_idle_get_page()
PageLRU just used to judge if the page is in user using. for this
purpose, a unguard PageLRU check is good enough. So this patch
should be fine.

Thanks
Alex

2020-11-04 11:59:15

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 00/20] per memcg lru lock

Hi Johannes & all,

Thanks for all comments and suggestions, here is a patch base on v20, as a summary for all you suggested:
Is this ok?

Many thanks!
Alex

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c97292834fa..0fe4172c8c14 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
* Lockless page tracking & accounting
* Unified hierarchy configuration model
* Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
*/

#include <linux/page_counter.h>
@@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
return lruvec;
}

+/**
+ * lock_page_lruvec - return lruvec for the locked page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or, page->_refcount is zero
+ */
struct lruvec *lock_page_lruvec(struct page *page)
{
struct lruvec *lruvec;
diff --git a/mm/swap.c b/mm/swap.c
index 9fe5ff9a8111..bcc814de35c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -264,6 +264,13 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
do {
unsigned long lrusize;

+ /*
+ * Hold lruvec->lru_lock is safe here, since
+ * 1) The pinned lruvec in reclaim, or
+ * 2) From a pre-LRU page during refault (which also holds the
+ * rcu lock, so would be safe even if the page was on the LRU
+ * and could move simultaneously to a new lruvec).
+ */
spin_lock_irq(&lruvec->lru_lock);
/* Record cost event */
if (file)
@@ -355,10 +362,12 @@ static void activate_page(struct page *page)
struct lruvec *lruvec;

page = compound_head(page);
- lruvec = lock_page_lruvec_irq(page);
- if (PageLRU(page))
+ if (TestClearPageLRU(page)) {
+ lruvec = lock_page_lruvec_irq(page);
__activate_page(page, lruvec);
- unlock_page_lruvec_irq(lruvec);
+ unlock_page_lruvec_irq(lruvec);
+ SetPageLRU(page);
+ }
}
#endif

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ed10ade548d..af03a7f2e1b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1868,6 +1868,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
continue;
}

+ /*
+ * All pages were isolated from the same lruvec (and isolation
+ * inhibits memcg migration).
+ */
VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
lru = page_lru(page);
nr_pages = thp_nr_pages(page);

2020-11-04 17:03:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 00/20] per memcg lru lock

On Wed, Nov 04, 2020 at 07:55:39PM +0800, Alex Shi wrote:
> @@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
> return lruvec;
> }
>
> +/**
> + * lock_page_lruvec - return lruvec for the locked page.

I would say "lock and return the lruvec for a given page"

> + * @page: the page
> + *
> + * This series functions should be used in either conditions:
> + * PageLRU is cleared or unset
> + * or, page->_refcount is zero

or page is locked

The other changes look good to me, thanks!

2020-11-04 19:00:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote:
> 在 2020/11/3 上午4:20, Johannes Weiner 写道:
> > On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
> >> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> >>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> >>>> From: Hugh Dickins <[email protected]>
> >>>>
> >>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
> >>>> get_page_unless_zero(), but holding lru_lock around that serves no
> >>>> useful purpose, and adds to lru_lock contention: delete it.
> >>>>
> >>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> >>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> >>>> uses WRITE_ONCE(),
> >>>
> >>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> >>> missing a dependent patch series?
> >>>
> >>>> and I see no other risk in page_idle_clear_pte_refs() using
> >>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> >>>> not entirely prevented by page_count() check in ksm.c's
> >>>> write_protect_page(): that risk being shared with page_referenced()
> >>>> and not helped by lru_lock).
> >>>
> >>> Isn't it possible, as per Minchan's description, for page->mapping to
> >>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> >>> thinking it's looking at a struct address_space?
> >>
> >> I don't think it can point to an anon_vma without the ANON bit set.
> >> Minchan's concern in that email was that it might still be NULL.
> >
> > Hm, no, the thread is a lengthy discussion about whether the store
> > could be split such that page->mapping is actually pointing to
> > something invalid (anon_vma without the PageAnon bit).
> >
> > From his email:
> >
> > CPU 0 CPU 1
> >
> > do_anonymous_page
> > __page_set_anon_rmap
> > /* out of order happened so SetPageLRU is done ahead */
> > SetPageLRU(page)
>
> This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
> protection, so the original memory barrier or lock concern isn't
> correct. that means, the SetPageLRU isn't possible to be here.
> And then no warry on right side 'CPU 1' problem.

The SetPageLRU is done under lru_lock, but the store to page->mapping
is not, so what ensures ordering between them? And what prevents the
compiler from tearing the store to page->mapping?

The writer does this:

CPU 0
page_add_new_anon_rmap()
page->mapping = anon_vma + PAGE_MAPPING_ANON
lru_cache_add_inactive_or_unevictable()
spin_lock(lruvec->lock)
SetPageLRU()
spin_unlock(lruvec->lock)

The concern is what CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

That's the in-order scenario and is fine.

2. NULL

That's possible if the page->mapping store gets reordered to occur
after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

That would be a problem and could lead to all kinds of undesirable
behavior including crashes and data corruption.

Is it possible? AFAICT the compiler is allowed to tear the store to
page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment that's referenced in
the changelog of this patch but I cannot find in any tree or patch.

2020-11-05 05:42:27

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock



在 2020/11/5 上午1:46, Johannes Weiner 写道:
> On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote:
>> 在 2020/11/3 上午4:20, Johannes Weiner 写道:
>>> On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
>>>> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
>>>>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
>>>>>> From: Hugh Dickins <[email protected]>
>>>>>>
>>>>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>>>>>> get_page_unless_zero(), but holding lru_lock around that serves no
>>>>>> useful purpose, and adds to lru_lock contention: delete it.
>>>>>>
>>>>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>>>>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>>>>>> uses WRITE_ONCE(),
>>>>>
>>>>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
>>>>> missing a dependent patch series?
>>>>>
>>>>>> and I see no other risk in page_idle_clear_pte_refs() using
>>>>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
>>>>>> not entirely prevented by page_count() check in ksm.c's
>>>>>> write_protect_page(): that risk being shared with page_referenced()
>>>>>> and not helped by lru_lock).
>>>>>
>>>>> Isn't it possible, as per Minchan's description, for page->mapping to
>>>>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
>>>>> thinking it's looking at a struct address_space?
>>>>
>>>> I don't think it can point to an anon_vma without the ANON bit set.
>>>> Minchan's concern in that email was that it might still be NULL.
>>>
>>> Hm, no, the thread is a lengthy discussion about whether the store
>>> could be split such that page->mapping is actually pointing to
>>> something invalid (anon_vma without the PageAnon bit).
>>>
>>> From his email:
>>>
>>> CPU 0 CPU 1
>>>
>>> do_anonymous_page
>>> __page_set_anon_rmap
>>> /* out of order happened so SetPageLRU is done ahead */
>>> SetPageLRU(page)
>>
>> This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
>> protection, so the original memory barrier or lock concern isn't
>> correct. that means, the SetPageLRU isn't possible to be here.
>> And then no warry on right side 'CPU 1' problem.
>
> The SetPageLRU is done under lru_lock, but the store to page->mapping
> is not, so what ensures ordering between them? And what prevents the
> compiler from tearing the store to page->mapping?
>

Right, I misunderstand the spin_lock in memory barrier. Thanks a lot
for point out this.
So, is this patch fine to fix the problem?

From 8427121da195a6a386d1ce71abcb41b31211e21f Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

CPU 0 CPU1
do_anonymous_page
page_add_new_anon_rmap()
page->mapping = anon_vma + PAGE_MAPPING_ANON
lru_cache_add_inactive_or_unevictable()
spin_lock(lruvec->lock)
SetPageLRU()
spin_unlock(lruvec->lock)
/* idletacking judged it as LRU
* page so pass the page in
* page_idle_clear_pte_refs
*/
page_idle_clear_pte_refs
rmap_walk
if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

That's the in-order scenario and is fine.

2. NULL

That's possible if the page->mapping store gets reordered to occur
after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

That would be a problem and could lead to all kinds of undesirable
behavior including crashes and data corruption.

Is it possible? AFAICT the compiler is allowed to tear the store to
page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/rmap.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c050dab2ae65..56af18aa57de 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
if (!exclusive)
anon_vma = anon_vma->root;

+ /*
+ * w/o the WRITE_ONCE here the following scenario may happens due to
+ * store reordering.
+ *
+ * CPU 0 CPU 1
+ *
+ * do_anonymous_page page_idle_clear_pte_refs
+ * __page_set_anon_rmap
+ * page->mapping = anon_vma + PAGE_MAPPING_ANON
+ * lru_cache_add_inactive_or_unevictable()
+ * SetPageLRU(page)
+ * rmap_walk
+ * if PageAnon(page)
+ *
+ * The 'SetPageLRU' may reordered before page->mapping setting, and
+ * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
+ * may goes to rmap_walk_file() for a anon page.
+ */
+
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
- page->mapping = (struct address_space *) anon_vma;
+ WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
page->index = linear_page_index(vma, address);
}

--
1.8.3.1


> The writer does this:
>
> CPU 0
> page_add_new_anon_rmap()
> page->mapping = anon_vma + PAGE_MAPPING_ANON
> lru_cache_add_inactive_or_unevictable()
> spin_lock(lruvec->lock)
> SetPageLRU()
> spin_unlock(lruvec->lock)
>
> The concern is what CPU 1 will observe at page->mapping after
> observing PageLRU set on the page.
>
> 1. anon_vma + PAGE_MAPPING_ANON
>
> That's the in-order scenario and is fine.
>
> 2. NULL
>
> That's possible if the page->mapping store gets reordered to occur
> after SetPageLRU. That's fine too because we check for it.
>
> 3. anon_vma without the PAGE_MAPPING_ANON bit
>
> That would be a problem and could lead to all kinds of undesirable
> behavior including crashes and data corruption.
>
> Is it possible? AFAICT the compiler is allowed to tear the store to
> page->mapping and I don't see anything that would prevent it.
>
> That said, I also don't see how the reader testing PageLRU under the
> lru_lock would prevent that in the first place. AFAICT we need that
> WRITE_ONCE() around the page->mapping assignment that's referenced in
> the changelog of this patch but I cannot find in any tree or patch.
>

2020-11-05 05:43:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
> if (!exclusive)
> anon_vma = anon_vma->root;
>
> + /*
> + * w/o the WRITE_ONCE here the following scenario may happens due to
> + * store reordering.
> + *
> + * CPU 0 CPU 1
> + *
> + * do_anonymous_page page_idle_clear_pte_refs
> + * __page_set_anon_rmap
> + * page->mapping = anon_vma + PAGE_MAPPING_ANON
> + * lru_cache_add_inactive_or_unevictable()
> + * SetPageLRU(page)
> + * rmap_walk
> + * if PageAnon(page)
> + *
> + * The 'SetPageLRU' may reordered before page->mapping setting, and
> + * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
> + * may goes to rmap_walk_file() for a anon page.
> + */
> +
> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> - page->mapping = (struct address_space *) anon_vma;
> + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
> page->index = linear_page_index(vma, address);
> }

I don't like these verbose comments with detailed descriptions in
the source code. They're fine in changelogs, but they clutter the
code, and they get outdated really quickly. My preference is for
something more brief:

/*
* Prevent page->mapping from pointing to an anon_vma without
* the PAGE_MAPPING_ANON bit set. This could happen if the
* compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
*/

2020-11-05 05:43:20

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock



?? 2020/11/5 ????12:57, Matthew Wilcox д??:
> On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
>> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
>> if (!exclusive)
>> anon_vma = anon_vma->root;
>>
>> + /*
>> + * w/o the WRITE_ONCE here the following scenario may happens due to
>> + * store reordering.
>> + *
>> + * CPU 0 CPU 1
>> + *
>> + * do_anonymous_page page_idle_clear_pte_refs
>> + * __page_set_anon_rmap
>> + * page->mapping = anon_vma + PAGE_MAPPING_ANON
>> + * lru_cache_add_inactive_or_unevictable()
>> + * SetPageLRU(page)
>> + * rmap_walk
>> + * if PageAnon(page)
>> + *
>> + * The 'SetPageLRU' may reordered before page->mapping setting, and
>> + * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
>> + * may goes to rmap_walk_file() for a anon page.
>> + */
>> +
>> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>> - page->mapping = (struct address_space *) anon_vma;
>> + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
>> page->index = linear_page_index(vma, address);
>> }
>
> I don't like these verbose comments with detailed descriptions in
> the source code. They're fine in changelogs, but they clutter the
> code, and they get outdated really quickly. My preference is for
> something more brief:
>
> /*
> * Prevent page->mapping from pointing to an anon_vma without
> * the PAGE_MAPPING_ANON bit set. This could happen if the
> * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
> */
>

Yes, it's reansonble. So is the following fine?

From f166f0d5df350c5eae1218456b9e6e1bd43434e7 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

CPU 0 CPU1
do_anonymous_page
page_add_new_anon_rmap()
page->mapping = anon_vma + PAGE_MAPPING_ANON
lru_cache_add_inactive_or_unevictable()
spin_lock(lruvec->lock)
SetPageLRU()
spin_unlock(lruvec->lock)
/* idletacking judged it as LRU
* page so pass the page in
* page_idle_clear_pte_refs
*/
page_idle_clear_pte_refs
rmap_walk
if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

That's the in-order scenario and is fine.

2. NULL

That's possible if the page->mapping store gets reordered to occur
after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

That would be a problem and could lead to all kinds of undesirable
behavior including crashes and data corruption.

Is it possible? AFAICT the compiler is allowed to tear the store to
page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/rmap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c050dab2ae65..73788505aa0a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,13 @@ static void __page_set_anon_rmap(struct page *page,
if (!exclusive)
anon_vma = anon_vma->root;

+ /*
+ * Prevent page->mapping from pointing to an anon_vma without
+ * the PAGE_MAPPING_ANON bit set. This could happen if the
+ * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
+ */
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
- page->mapping = (struct address_space *) anon_vma;
+ WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
page->index = linear_page_index(vma, address);
}

--
1.8.3.1

2020-11-05 05:43:32

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 00/20] per memcg lru lock



?? 2020/11/5 ????12:59, Johannes Weiner д??:
> On Wed, Nov 04, 2020 at 07:55:39PM +0800, Alex Shi wrote:
>> @@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>> return lruvec;
>> }
>>
>> +/**
>> + * lock_page_lruvec - return lruvec for the locked page.
>
> I would say "lock and return the lruvec for a given page"
>
>> + * @page: the page
>> + *
>> + * This series functions should be used in either conditions:
>> + * PageLRU is cleared or unset
>> + * or, page->_refcount is zero
>
> or page is locked
>
> The other changes look good to me, thanks!
>

Thanks a lot for both comments!
I will pick them and sent out in v21.

Thanks!
Alex

2020-11-05 15:43:24

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Thu, Nov 05, 2020 at 01:03:18PM +0800, Alex Shi wrote:
>
>
> 在 2020/11/5 下午12:57, Matthew Wilcox 写道:
> > On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
> >> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
> >> if (!exclusive)
> >> anon_vma = anon_vma->root;
> >>
> >> + /*
> >> + * w/o the WRITE_ONCE here the following scenario may happens due to
> >> + * store reordering.
> >> + *
> >> + * CPU 0 CPU 1
> >> + *
> >> + * do_anonymous_page page_idle_clear_pte_refs
> >> + * __page_set_anon_rmap
> >> + * page->mapping = anon_vma + PAGE_MAPPING_ANON
> >> + * lru_cache_add_inactive_or_unevictable()
> >> + * SetPageLRU(page)
> >> + * rmap_walk
> >> + * if PageAnon(page)
> >> + *
> >> + * The 'SetPageLRU' may reordered before page->mapping setting, and
> >> + * page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
> >> + * may goes to rmap_walk_file() for a anon page.
> >> + */
> >> +
> >> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> >> - page->mapping = (struct address_space *) anon_vma;
> >> + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
> >> page->index = linear_page_index(vma, address);
> >> }
> >
> > I don't like these verbose comments with detailed descriptions in
> > the source code. They're fine in changelogs, but they clutter the
> > code, and they get outdated really quickly. My preference is for
> > something more brief:
> >
> > /*
> > * Prevent page->mapping from pointing to an anon_vma without
> > * the PAGE_MAPPING_ANON bit set. This could happen if the
> > * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
> > */
> >

Yeah, I don't think this scenario warrants the full race diagram in
the code itself.

But the code is highly specific - synchronizing one struct page member
for one particular use case. Let's keep at least a reference to what
we are synchronizing against. There is a non-zero chance that if the
comment goes out of date, so does the code. How about this?

/*
* page_idle does a lockless/optimistic rmap scan on page->mapping.
* Make sure the compiler doesn't split the stores of anon_vma and
* the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
* could mistake the mapping for a struct address_space and crash.
*/

2020-11-05 15:46:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Thu, Nov 05, 2020 at 10:36:49AM -0500, Johannes Weiner wrote:
> But the code is highly specific - synchronizing one struct page member
> for one particular use case. Let's keep at least a reference to what
> we are synchronizing against. There is a non-zero chance that if the
> comment goes out of date, so does the code. How about this?
>
> /*
> * page_idle does a lockless/optimistic rmap scan on page->mapping.
> * Make sure the compiler doesn't split the stores of anon_vma and
> * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> * could mistake the mapping for a struct address_space and crash.
> */

Fine by me! There may be other cases where seeing a split store would
be bad, so I didn't want to call out page_idle explicitly. But if you
want to, I'm happy with this comment.

2020-11-06 01:15:20

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock



在 2020/11/5 下午11:36, Johannes Weiner 写道:
>>> */
>>>
> Yeah, I don't think this scenario warrants the full race diagram in
> the code itself.
>
> But the code is highly specific - synchronizing one struct page member
> for one particular use case. Let's keep at least a reference to what
> we are synchronizing against. There is a non-zero chance that if the
> comment goes out of date, so does the code. How about this?
>
> /*
> * page_idle does a lockless/optimistic rmap scan on page->mapping.
> * Make sure the compiler doesn't split the stores of anon_vma and
> * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> * could mistake the mapping for a struct address_space and crash.
> */

Thanks a lot to you all. I will update to v21 patch

Alex

2020-11-11 07:30:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock

On Mon, 2 Nov 2020, Johannes Weiner wrote:
> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > From: Hugh Dickins <[email protected]>
> >
> > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > get_page_unless_zero(), but holding lru_lock around that serves no
> > useful purpose, and adds to lru_lock contention: delete it.
> >
> > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > uses WRITE_ONCE(),
>
> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> missing a dependent patch series?

Sorry, I was out of action, then slower than ever, for a while.

Many thanks for calling out my falsehood there, Johannes.

What led me to write that? It has baffled me, but at last I see:
this patch to page_idle_get_page() was 0002 in my lru_lock patchset
against v5.3 last year, and 0001 was the patch which made it true.
Then when I checked against mainline, I must have got confused by
the similar WRITE_ONCE in page_move_anon_rmap().

Appended below, but not rediffed, and let's not hold up Alex's set
for the rest of it: it is all theoretical until the kernel gets to
be built with a suitably malicious compiler; but I'll follow up
with a fresh version of the below after his set is safely in.

From a1abcbc2aac70c6ba47b8991992bb85b86b4a160 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <[email protected]>
Date: Thu, 22 Aug 2019 15:49:44 -0700
Subject: [PATCH 1/9] mm: more WRITE_ONCE and READ_ONCE on page->mapping

v4.2 commit 414e2fb8ce5a ("rmap: fix theoretical race between do_wp_page
and shrink_active_list") added a WRITE_ONCE() where page_move_anon_rmap()
composes page->mapping from anon_vma pointer and PAGE_MAPPING_ANON.

Now do the same where __page_set_anon_rmap() does the same, and where
compaction.c applies PAGE_MAPPING_MOVABLE, and ksm.c PAGE_MAPPING_KSM.

rmap.c already uses READ_ONCE(page->mapping), but util.c should too:
add READ_ONCE() in page_rmapping(), page_anon_vma() and page_mapping().
Delete the then unused helper __page_rmapping().

I doubt that this commit fixes anything, but it's harmless and
unintrusive, and makes reasoning about page mapping flags easier.

What if a compiler implements "page->mapping = mapping" in other places
by, say, first assigning the odd bits of mapping, then adding in the
even bits? Then we shall not build the kernel with such a compiler.

Signed-off-by: Hugh Dickins <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Alex Shi <[email protected]>
---
mm/compaction.c | 7 ++++---
mm/ksm.c | 2 +-
mm/rmap.c | 7 ++++++-
mm/util.c | 24 ++++++++++--------------
4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..c405f4362624 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -113,7 +113,8 @@ void __SetPageMovable(struct page *page, struct address_space *mapping)
{
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page);
- page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE);
+ WRITE_ONCE(page->mapping,
+ (unsigned long)mapping | PAGE_MAPPING_MOVABLE);
}
EXPORT_SYMBOL(__SetPageMovable);

@@ -126,8 +127,8 @@ void __ClearPageMovable(struct page *page)
* flag so that VM can catch up released page by driver after isolation.
* With it, VM migration doesn't try to put it back.
*/
- page->mapping = (void *)((unsigned long)page->mapping &
- PAGE_MAPPING_MOVABLE);
+ WRITE_ONCE(page->mapping,
+ (unsigned long)page->mapping & PAGE_MAPPING_MOVABLE);
}
EXPORT_SYMBOL(__ClearPageMovable);

diff --git a/mm/ksm.c b/mm/ksm.c
index 3dc4346411e4..426b6a40ea41 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -865,7 +865,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
static inline void set_page_stable_node(struct page *page,
struct stable_node *stable_node)
{
- page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
+ WRITE_ONCE(page->mapping, (unsigned long)stable_node | PAGE_MAPPING_KSM);
}

#ifdef CONFIG_SYSFS
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..9480df437edc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1044,7 +1044,12 @@ static void __page_set_anon_rmap(struct page *page,
anon_vma = anon_vma->root;

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
- page->mapping = (struct address_space *) anon_vma;
+ /*
+ * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written
+ * simultaneously, so a concurrent reader (eg page_referenced()'s
+ * PageAnon()) will not see one without the other.
+ */
+ WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
page->index = linear_page_index(vma, address);
}

diff --git a/mm/util.c b/mm/util.c
index e6351a80f248..09b9fcbedac3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -489,21 +489,14 @@ void kvfree(const void *addr)
}
EXPORT_SYMBOL(kvfree);

-static inline void *__page_rmapping(struct page *page)
-{
- unsigned long mapping;
-
- mapping = (unsigned long)page->mapping;
- mapping &= ~PAGE_MAPPING_FLAGS;
-
- return (void *)mapping;
-}
-
/* Neutral page->mapping pointer to address_space or anon_vma or other */
void *page_rmapping(struct page *page)
{
+ unsigned long mapping;
+
page = compound_head(page);
- return __page_rmapping(page);
+ mapping = (unsigned long)READ_ONCE(page->mapping);
+ return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
}

/*
@@ -534,10 +527,11 @@ struct anon_vma *page_anon_vma(struct page *page)
unsigned long mapping;

page = compound_head(page);
- mapping = (unsigned long)page->mapping;
+ mapping = (unsigned long)READ_ONCE(page->mapping);
+ /* Return NULL if file or PageMovable or PageKsm */
if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
return NULL;
- return __page_rmapping(page);
+ return (struct anon_vma *)(mapping & ~PAGE_MAPPING_FLAGS);
}

struct address_space *page_mapping(struct page *page)
@@ -557,10 +551,12 @@ struct address_space *page_mapping(struct page *page)
return swap_address_space(entry);
}

- mapping = page->mapping;
+ mapping = READ_ONCE(page->mapping);
+ /* Return NULL if PageAnon (including PageKsm) */
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;

+ /* Return struct address_space pointer if file or PageMovable */
return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
EXPORT_SYMBOL(page_mapping);
--
2.23.0.187.g17f5b7556c-goog