2019-12-25 09:05:38

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 00/10] per lruvec lru_lock for memcg

Hi all,

Merry Christmas! :)

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node.

We introduce function lock_page_lruvec, which will lock the page's
memcg and then memcg's lruvec->lru_lock(Thanks Johannes Weiner,
Hugh Dickins and Konstantin Khlebnikov suggestion/reminder) to replace
old pgdat->lru_lock.

Following to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this patchset, the readtwice performance increased about 80%
with containers. And no performance drops w/o container.

Another way to guard move_account is by lru_lock instead of move_lock
Considering the memcg move task path:
mem_cgroup_move_task:
mem_cgroup_move_charge:
lru_add_drain_all();
atomic_inc(&mc.from->moving_account); //ask lruvec's move_lock
synchronize_rcu();
walk_parge_range: do charge_walk_ops(mem_cgroup_move_charge_pte_range):
isolate_lru_page();
mem_cgroup_move_account(page,)
spin_lock(&from->move_lock)
page->mem_cgroup = to;
spin_unlock(&from->move_lock)
putback_lru_page(page)

to guard 'page->mem_cgroup = to' by to_vec->lru_lock has the similar effect with
move_lock. So for performance reason, both solutions are same.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought the same idea
7 years ago.

Thanks all the comments from Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
Johannes Weiner, Mel Gorman, Shakeel Butt, Rong Chen, Fengguang Wu, Yun Wang etc.
and some testing support from Intel 0days!

v7,
a, rebase on v5.5-rc3,
b, move the lock_page_lru() clean up before lock replace.

v6,
a, rebase on v5.5-rc2, and do retesting.
b, pick up Johanness' comments change and a lock_page_lru cleanup.

v5,
a, locking page's memcg according JohannesW suggestion
b, using macro for non memcg, according to Johanness and Metthew's suggestion.

v4:
a, fix the page->mem_cgroup dereferencing issue, thanks Johannes Weiner
b, remove the irqsave flags changes, thanks Metthew Wilcox
c, merge/split patches for better understanding and bisection purpose

v3: rebase on linux-next, and fold the relock fix patch into introduceing patch

v2: bypass a performance regression bug and fix some function issues

v1: initial version, aim testing show 5% performance increase on a 16 threads box.


Alex Shi (9):
mm/vmscan: remove unnecessary lruvec adding
mm/memcg: fold lru_lock in lock_page_lru
mm/lru: replace pgdat lru_lock with lruvec lock
mm/lru: introduce the relock_page_lruvec function
mm/mlock: optimize munlock_pagevec by relocking
mm/swap: only change the lru_lock iff page's lruvec is different
mm/pgdat: remove pgdat lru_lock
mm/lru: add debug checking for page memcg moving
mm/memcg: add debug checking in lock_page_memcg

Hugh Dickins (1):
mm/lru: revise the comments of lru_lock

Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +---
Documentation/admin-guide/cgroup-v1/memory.rst | 6 +-
Documentation/trace/events-kmem.rst | 2 +-
Documentation/vm/unevictable-lru.rst | 22 ++---
include/linux/memcontrol.h | 63 ++++++++++++++
include/linux/mm_types.h | 2 +-
include/linux/mmzone.h | 5 +-
mm/compaction.c | 59 ++++++++-----
mm/filemap.c | 4 +-
mm/huge_memory.c | 18 ++--
mm/memcontrol.c | 84 +++++++++++++++----
mm/mlock.c | 28 +++----
mm/mmzone.c | 1 +
mm/page_alloc.c | 1 -
mm/page_idle.c | 7 +-
mm/rmap.c | 2 +-
mm/swap.c | 75 +++++++----------
mm/vmscan.c | 98 ++++++++++++----------
18 files changed, 297 insertions(+), 195 deletions(-)

--
1.8.3.1


2019-12-25 09:05:42

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 01/10] mm/vmscan: remove unnecessary lruvec adding

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept here for list intergrity. Otherwise:

#0 mave_pages_to_lru #1 release_pages
if (put_page_testzero())
if (put_page_testzero())
!PageLRU //skip lru_lock
list_add(&page->lru,);
else
list_add(&page->lru,) //corrupt

Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/vmscan.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17c6273..8719361b47a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1852,26 +1852,18 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
while (!list_empty(list)) {
page = lru_to_page(list);
VM_BUG_ON_PAGE(PageLRU(page), page);
+ list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
- list_del(&page->lru);
spin_unlock_irq(&pgdat->lru_lock);
putback_lru_page(page);
spin_lock_irq(&pgdat->lru_lock);
continue;
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
-
SetPageLRU(page);
- lru = page_lru(page);
-
- nr_pages = hpage_nr_pages(page);
- update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
- list_move(&page->lru, &lruvec->lists[lru]);

if (put_page_testzero(page)) {
__ClearPageLRU(page);
__ClearPageActive(page);
- del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
spin_unlock_irq(&pgdat->lru_lock);
@@ -1880,6 +1872,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
} else
list_add(&page->lru, &pages_to_free);
} else {
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lru = page_lru(page);
+ nr_pages = hpage_nr_pages(page);
+
+ update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+ list_add(&page->lru, &lruvec->lists[lru]);
nr_moved += nr_pages;
}
}
--
1.8.3.1

2019-12-25 09:05:46

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 06/10] mm/swap: only change the lru_lock iff page's lruvec is different

Since we introduced relock_page_lruvec, we could use it in more place
to reduce spin_locks.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 97e108be4f92..84a845968e1d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -196,11 +196,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];

- lruvec = lock_page_lruvec_irqsave(page, &flags);
+ lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);

(*move_fn)(page, lruvec, arg);
- unlock_page_lruvec_irqrestore(lruvec, flags);
}
+ if (lruvec)
+ unlock_page_lruvec_irqrestore(lruvec, flags);

release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
@@ -819,14 +820,11 @@ void release_pages(struct page **pages, int nr)
}

if (PageLRU(page)) {
- struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+ struct lruvec *pre_lruvec = lruvec;

- if (new_lruvec != lruvec) {
- if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
+ if (pre_lruvec != lruvec)
lock_batch = 0;
- lruvec = lock_page_lruvec_irqsave(page, &flags);
- }

VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
--
1.8.3.1

2019-12-25 09:05:54

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 07/10] mm/pgdat: remove pgdat lru_lock

Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.

Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/mmzone.h | 1 -
mm/page_alloc.c | 1 -
2 files changed, 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c5455675acf2..7db0cec19aa0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -769,7 +769,6 @@ struct deferred_split {

/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
- spinlock_t lru_lock;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4785a8a2040e..352f2a3d67b3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6712,7 +6712,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
init_waitqueue_head(&pgdat->pfmemalloc_wait);

pgdat_page_ext_init(pgdat);
- spin_lock_init(&pgdat->lru_lock);
lruvec_init(&pgdat->__lruvec);
}

--
1.8.3.1

2019-12-25 09:05:55

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 10/10] mm/memcg: add debug checking in lock_page_memcg

This extra irq disable/enable and BUG_ON checking costs 5% readtwice
performance on a 2 socket * 26 cores * HT box. So it's a temporary
debugging and need to be remove finially.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33bf086faed0..1512be4d8bde 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2022,6 +2022,11 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
if (unlikely(!memcg))
return NULL;

+ /* temporary lockdep checking, need remove */
+ local_irq_save(flags);
+ might_lock(&memcg->move_lock);
+ local_irq_restore(flags);
+
if (atomic_read(&memcg->moving_account) <= 0)
return memcg;

--
1.8.3.1

2019-12-25 09:06:06

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 03/10] mm/lru: replace pgdat lru_lock with lruvec lock

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node.

This is the main patch to replace per node lru_lock with per memcg
lruvec lock.

We introduces function lock_page_lruvec, which will lock the page's
memcg and then memcg's lruvec->lru_lock. (Thanks Johannes Weiner,
Hugh Dickins and Konstantin Khlebnikov suggestion/reminder on them)

According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this and later patches, the readtwice performance increases about
80% with containers.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Colin Ian King <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 27 ++++++++++++++++
include/linux/mmzone.h | 2 ++
mm/compaction.c | 55 ++++++++++++++++++++-----------
mm/huge_memory.c | 18 ++++-------
mm/memcontrol.c | 72 ++++++++++++++++++++++++++++++-----------
mm/mlock.c | 32 +++++++++----------
mm/mmzone.c | 1 +
mm/page_idle.c | 7 ++--
mm/swap.c | 75 +++++++++++++++++--------------------------
mm/vmscan.c | 80 +++++++++++++++++++++++++---------------------
10 files changed, 219 insertions(+), 150 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5c8d5..8389b9b927ef 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -417,6 +417,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
}

struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
+struct lruvec *lock_page_lruvec_irq(struct page *);
+struct lruvec *lock_page_lruvec_irqsave(struct page *, unsigned long*);
+void unlock_page_lruvec_irq(struct lruvec *);
+void unlock_page_lruvec_irqrestore(struct lruvec *, unsigned long);

struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

@@ -900,6 +904,29 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
{
return &pgdat->__lruvec;
}
+#define lock_page_lruvec_irq(page) \
+({ \
+ struct pglist_data *pgdat = page_pgdat(page); \
+ spin_lock_irq(&pgdat->__lruvec.lru_lock); \
+ &pgdat->__lruvec; \
+})
+
+#define lock_page_lruvec_irqsave(page, flagsp) \
+({ \
+ struct pglist_data *pgdat = page_pgdat(page); \
+ spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); \
+ &pgdat->__lruvec; \
+})
+
+#define unlock_page_lruvec_irq(lruvec) \
+({ \
+ spin_unlock_irq(&lruvec->lru_lock); \
+})
+
+#define unlock_page_lruvec_irqrestore(lruvec, flags) \
+({ \
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags); \
+})

static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 89d8ff06c9ce..c5455675acf2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -311,6 +311,8 @@ struct lruvec {
unsigned long refaults;
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
+ /* per lruvec lru_lock for memcg */
+ spinlock_t lru_lock;
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 672d3c78c6ab..8c0a2da217d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -786,7 +786,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
- bool locked = false;
+ struct lruvec *locked_lruvec = NULL;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
bool skip_on_failure = false;
@@ -846,11 +846,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
* contention, to give chance to IRQs. Abort completely if
* a fatal signal is pending.
*/
- if (!(low_pfn % SWAP_CLUSTER_MAX)
- && compact_unlock_should_abort(&pgdat->lru_lock,
- flags, &locked, cc)) {
- low_pfn = 0;
- goto fatal_pending;
+ if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+ if (locked_lruvec) {
+ unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+ locked_lruvec = NULL;
+ }
+
+ if (fatal_signal_pending(current)) {
+ cc->contended = true;
+
+ low_pfn = 0;
+ goto fatal_pending;
+ }
+
+ cond_resched();
}

if (!pfn_valid_within(low_pfn))
@@ -919,10 +928,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
*/
if (unlikely(__PageMovable(page)) &&
!PageIsolated(page)) {
- if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock,
- flags);
- locked = false;
+ if (locked_lruvec) {
+ unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+ locked_lruvec = NULL;
}

if (!isolate_movable_page(page, isolate_mode))
@@ -948,10 +956,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;

+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
/* If we already hold the lock, we can skip some rechecking */
- if (!locked) {
- locked = compact_lock_irqsave(&pgdat->lru_lock,
- &flags, cc);
+ if (lruvec != locked_lruvec) {
+ struct mem_cgroup *memcg = lock_page_memcg(page);
+
+ if (locked_lruvec) {
+ unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+ locked_lruvec = NULL;
+ }
+ /* reget lruvec with a locked memcg */
+ lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+ compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+ locked_lruvec = lruvec;

/* Try get exclusive access under lock */
if (!skip_updated) {
@@ -975,7 +993,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
}
}

- lruvec = mem_cgroup_page_lruvec(page, pgdat);

/* Try isolate the page */
if (__isolate_lru_page(page, isolate_mode) != 0)
@@ -1016,9 +1033,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
* page anyway.
*/
if (nr_isolated) {
- if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
- locked = false;
+ if (locked_lruvec) {
+ unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+ locked_lruvec = NULL;
}
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
@@ -1043,8 +1060,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
low_pfn = end_pfn;

isolate_abort:
- if (locked)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ if (locked_lruvec)
+ unlock_page_lruvec_irqrestore(locked_lruvec, flags);

/*
* Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41a0fbddc96b..160c845290cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2495,17 +2495,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
}

static void __split_huge_page(struct page *page, struct list_head *list,
- pgoff_t end, unsigned long flags)
+ struct lruvec *lruvec, pgoff_t end, unsigned long flags)
{
struct page *head = compound_head(page);
- pg_data_t *pgdat = page_pgdat(head);
- struct lruvec *lruvec;
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
int i;

- lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);

@@ -2554,7 +2550,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_unlock(&head->mapping->i_pages);
}

- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ unlock_page_lruvec_irqrestore(lruvec, flags);

remap_page(head);

@@ -2693,13 +2689,13 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
- struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
struct deferred_split *ds_queue = get_deferred_split_queue(page);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
+ struct lruvec *lruvec;
int count, mapcount, extra_pins, ret;
bool mlocked;
- unsigned long flags;
+ unsigned long uninitialized_var(flags);
pgoff_t end;

VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
@@ -2766,7 +2762,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
lru_add_drain();

/* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irqsave(&pgdata->lru_lock, flags);
+ lruvec = lock_page_lruvec_irqsave(head, &flags);

if (mapping) {
XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2797,7 +2793,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
}

spin_unlock(&ds_queue->split_queue_lock);
- __split_huge_page(page, list, end, flags);
+ __split_huge_page(page, list, lruvec, end, flags);
if (PageSwapCache(head)) {
swp_entry_t entry = { .val = page_private(head) };

@@ -2816,7 +2812,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&ds_queue->split_queue_lock);
fail: if (mapping)
xa_unlock(&mapping->i_pages);
- spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+ unlock_page_lruvec_irqrestore(lruvec, flags);
remap_page(head);
ret = -EBUSY;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0ad10caabc3d..1b69e27d1b9f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1217,7 +1217,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
goto out;
}

- memcg = page->mem_cgroup;
+ memcg = READ_ONCE(page->mem_cgroup);
/*
* Swapcache readahead pages are added to the LRU - and
* possibly migrated - before they are charged.
@@ -1238,6 +1238,42 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
return lruvec;
}

+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+ struct lruvec *lruvec;
+ struct mem_cgroup *memcg;
+
+ memcg = lock_page_memcg(page);
+ lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+ spin_lock_irq(&lruvec->lru_lock);
+
+ return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+ struct lruvec *lruvec;
+ struct mem_cgroup *memcg;
+
+ memcg = lock_page_memcg(page);
+ lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+ spin_lock_irqsave(&lruvec->lru_lock, *flags);
+
+ return lruvec;
+}
+
+void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+ spin_unlock_irq(&lruvec->lru_lock);
+ __unlock_page_memcg(lruvec_memcg(lruvec));
+}
+
+void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, unsigned long flags)
+{
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+ __unlock_page_memcg(lruvec_memcg(lruvec));
+}
+
/**
* mem_cgroup_update_lru_size - account for adding or removing an lru page
* @lruvec: mem_cgroup per zone lru vector
@@ -2570,40 +2606,40 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
css_put_many(&memcg->css, nr_pages);
}

-static void lock_page_lru(struct page *page, int *isolated)
+static struct lruvec *lock_page_lru(struct page *page)
{
+ struct lruvec *lruvec = NULL;
+
if (PageLRU(page)) {
- pg_data_t *pgdat = page_pgdat(page);
- struct lruvec *lruvec;
+ lruvec = lock_page_lruvec_irq(page);

- spin_lock_irq(&pgdat->lru_lock);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
- *isolated = 1;
- } else
- *isolated = 0;
+ }
+
+ return lruvec;
}

-static void unlock_page_lru(struct page *page, int isolated)
+static void unlock_page_lru(struct page *page, struct lruvec *locked_lruvec)
{

- if (isolated) {
- pg_data_t *pgdat = page_pgdat(page);
+ if (locked_lruvec) {
struct lruvec *lruvec;

- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ unlock_page_lruvec_irq(locked_lruvec);
+ lruvec = lock_page_lruvec_irq(page);
+
VM_BUG_ON_PAGE(PageLRU(page), page);
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
- spin_unlock_irq(&pgdat->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
}
}

static void commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare)
{
- int isolated;
+ struct lruvec *lruvec;

VM_BUG_ON_PAGE(page->mem_cgroup, page);

@@ -2612,7 +2648,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
* may already be on some other mem_cgroup's LRU. Take care of it.
*/
if (lrucare)
- lock_page_lru(page, &isolated);
+ lruvec = lock_page_lru(page);

/*
* Nobody should be changing or seriously looking at
@@ -2631,7 +2667,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
page->mem_cgroup = memcg;

if (lrucare)
- unlock_page_lru(page, isolated);
+ unlock_page_lru(page, lruvec);
}

#ifdef CONFIG_MEMCG_KMEM
@@ -2927,7 +2963,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)

/*
* Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * lruvec->lru_lock and migration entries setup in all page mappings.
*/
void mem_cgroup_split_huge_fixup(struct page *head)
{
diff --git a/mm/mlock.c b/mm/mlock.c
index a72c1eeded77..10d15f58b061 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
* Isolate a page from LRU with optional get_page() pin.
* Assumes lru_lock already held and page already pinned.
*/
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+ struct lruvec *lruvec, bool getpage)
{
if (PageLRU(page)) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
if (getpage)
get_page(page);
ClearPageLRU(page);
@@ -182,7 +180,7 @@ static void __munlock_isolation_failed(struct page *page)
unsigned int munlock_vma_page(struct page *page)
{
int nr_pages;
- pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;

/* For try_to_munlock() and to serialize with page migration */
BUG_ON(!PageLocked(page));
@@ -194,7 +192,7 @@ unsigned int munlock_vma_page(struct page *page)
* might otherwise copy PageMlocked to part of the tail pages before
* we clear it in the head page. It also stabilizes hpage_nr_pages().
*/
- spin_lock_irq(&pgdat->lru_lock);
+ lruvec = lock_page_lruvec_irq(page);

if (!TestClearPageMlocked(page)) {
/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -205,15 +203,15 @@ unsigned int munlock_vma_page(struct page *page)
nr_pages = hpage_nr_pages(page);
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);

- if (__munlock_isolate_lru_page(page, true)) {
- spin_unlock_irq(&pgdat->lru_lock);
+ if (__munlock_isolate_lru_page(page, lruvec, true)) {
+ unlock_page_lruvec_irq(lruvec);
__munlock_isolated_page(page);
goto out;
}
__munlock_isolation_failed(page);

unlock_out:
- spin_unlock_irq(&pgdat->lru_lock);
+ unlock_page_lruvec_irq(lruvec);

out:
return nr_pages - 1;
@@ -291,28 +289,29 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
{
int i;
int nr = pagevec_count(pvec);
- int delta_munlocked = -nr;
struct pagevec pvec_putback;
+ struct lruvec *lruvec = NULL;
int pgrescued = 0;

pagevec_init(&pvec_putback);

/* Phase 1: page isolation */
- spin_lock_irq(&zone->zone_pgdat->lru_lock);
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];

+ lruvec = lock_page_lruvec_irq(page);
+
if (TestClearPageMlocked(page)) {
/*
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
- if (__munlock_isolate_lru_page(page, false))
+ if (__munlock_isolate_lru_page(page, lruvec, false)) {
+ __mod_zone_page_state(zone, NR_MLOCK, -1);
+ unlock_page_lruvec_irq(lruvec);
continue;
- else
+ } else
__munlock_isolation_failed(page);
- } else {
- delta_munlocked++;
}

/*
@@ -323,9 +322,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
*/
pagevec_add(&pvec_putback, pvec->pages[i]);
pvec->pages[i] = NULL;
+ unlock_page_lruvec_irq(lruvec);
}
- __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- spin_unlock_irq(&zone->zone_pgdat->lru_lock);

/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
enum lru_list lru;

memset(lruvec, 0, sizeof(struct lruvec));
+ spin_lock_init(&lruvec->lru_lock);

for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..d2d868ca2bf7 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -31,7 +31,7 @@
static struct page *page_idle_get_page(unsigned long pfn)
{
struct page *page;
- pg_data_t *pgdat;
+ struct lruvec *lruvec;

if (!pfn_valid(pfn))
return NULL;
@@ -41,13 +41,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
!get_page_unless_zero(page))
return NULL;

- pgdat = page_pgdat(page);
- spin_lock_irq(&pgdat->lru_lock);
+ lruvec = lock_page_lruvec_irq(page);
if (unlikely(!PageLRU(page))) {
put_page(page);
page = NULL;
}
- spin_unlock_irq(&pgdat->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
return page;
}

diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..97e108be4f92 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -60,16 +60,14 @@
static void __page_cache_release(struct page *page)
{
if (PageLRU(page)) {
- pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;
- unsigned long flags;
+ unsigned long flags = 0;

- spin_lock_irqsave(&pgdat->lru_lock, flags);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = lock_page_lruvec_irqsave(page, &flags);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ unlock_page_lruvec_irqrestore(lruvec, flags);
}
__ClearPageWaiters(page);
}
@@ -192,26 +190,18 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
void *arg)
{
int i;
- struct pglist_data *pgdat = NULL;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = NULL;
unsigned long flags = 0;

for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- struct pglist_data *pagepgdat = page_pgdat(page);

- if (pagepgdat != pgdat) {
- if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
- pgdat = pagepgdat;
- spin_lock_irqsave(&pgdat->lru_lock, flags);
- }
+ lruvec = lock_page_lruvec_irqsave(page, &flags);

- lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
+ unlock_page_lruvec_irqrestore(lruvec, flags);
}
- if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
}
@@ -324,12 +314,12 @@ static inline void activate_page_drain(int cpu)

void activate_page(struct page *page)
{
- pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;

page = compound_head(page);
- spin_lock_irq(&pgdat->lru_lock);
- __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
- spin_unlock_irq(&pgdat->lru_lock);
+ lruvec = lock_page_lruvec_irq(page);
+ __activate_page(page, lruvec, NULL);
+ unlock_page_lruvec_irq(lruvec);
}
#endif

@@ -780,8 +770,7 @@ void release_pages(struct page **pages, int nr)
{
int i;
LIST_HEAD(pages_to_free);
- struct pglist_data *locked_pgdat = NULL;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = NULL;
unsigned long uninitialized_var(flags);
unsigned int uninitialized_var(lock_batch);

@@ -791,21 +780,20 @@ void release_pages(struct page **pages, int nr)
/*
* Make sure the IRQ-safe lock-holding time does not get
* excessive with a continuous string of pages from the
- * same pgdat. The lock is held only if pgdat != NULL.
+ * same lruvec. The lock is held only if lruvec != NULL.
*/
- if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
- locked_pgdat = NULL;
+ if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec = NULL;
}

if (is_huge_zero_page(page))
continue;

if (is_zone_device_page(page)) {
- if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
- flags);
- locked_pgdat = NULL;
+ if (lruvec) {
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec = NULL;
}
/*
* ZONE_DEVICE pages that return 'false' from
@@ -822,27 +810,24 @@ void release_pages(struct page **pages, int nr)
continue;

if (PageCompound(page)) {
- if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
- locked_pgdat = NULL;
+ if (lruvec) {
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec = NULL;
}
__put_compound_page(page);
continue;
}

if (PageLRU(page)) {
- struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

- if (pgdat != locked_pgdat) {
- if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
- flags);
+ if (new_lruvec != lruvec) {
+ if (lruvec)
+ unlock_page_lruvec_irqrestore(lruvec, flags);
lock_batch = 0;
- locked_pgdat = pgdat;
- spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+ lruvec = lock_page_lruvec_irqsave(page, &flags);
}

- lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -854,8 +839,8 @@ void release_pages(struct page **pages, int nr)

list_add(&page->lru, &pages_to_free);
}
- if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ if (lruvec)
+ unlock_page_lruvec_irqrestore(lruvec, flags);

mem_cgroup_uncharge_list(&pages_to_free);
free_unref_page_list(&pages_to_free);
@@ -893,7 +878,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
- lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+ lockdep_assert_held(&lruvec->lru_lock);

if (!list)
SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8719361b47a0..a97e35675286 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1766,11 +1766,9 @@ int isolate_lru_page(struct page *page)
WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");

if (PageLRU(page)) {
- pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;

- spin_lock_irq(&pgdat->lru_lock);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = lock_page_lruvec_irq(page);
if (PageLRU(page)) {
int lru = page_lru(page);
get_page(page);
@@ -1778,7 +1776,7 @@ int isolate_lru_page(struct page *page)
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
}
- spin_unlock_irq(&pgdat->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
}
return ret;
}
@@ -1843,7 +1841,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
struct list_head *list)
{
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
int nr_pages, nr_moved = 0;
LIST_HEAD(pages_to_free);
struct page *page;
@@ -1854,9 +1851,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
VM_BUG_ON_PAGE(PageLRU(page), page);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
putback_lru_page(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
continue;
}
SetPageLRU(page);
@@ -1866,19 +1863,35 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
__ClearPageActive(page);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
} else
list_add(&page->lru, &pages_to_free);
} else {
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ struct mem_cgroup *memcg = lock_page_memcg(page);
+ struct lruvec *plv;
+ bool relocked = false;
+
+ plv = mem_cgroup_lruvec(memcg, page_pgdat(page));
+ /* page's lruvec changed in memcg moving */
+ if (plv != lruvec) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ spin_lock_irq(&plv->lru_lock);
+ relocked = true;
+ }
+
lru = page_lru(page);
nr_pages = hpage_nr_pages(page);
-
- update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
- list_add(&page->lru, &lruvec->lists[lru]);
+ update_lru_size(plv, lru, page_zonenum(page), nr_pages);
+ list_add(&page->lru, &plv->lists[lru]);
nr_moved += nr_pages;
+
+ if (relocked) {
+ spin_unlock_irq(&plv->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
+ }
+ __unlock_page_memcg(memcg);
}
}

@@ -1937,7 +1950,7 @@ static int current_may_throttle(void)

lru_add_drain();

- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, lru);
@@ -1949,15 +1962,14 @@ static int current_may_throttle(void)
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
- spin_unlock_irq(&pgdat->lru_lock);
-
+ spin_unlock_irq(&lruvec->lru_lock);
if (nr_taken == 0)
return 0;

nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
&stat, false);

- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);

item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
@@ -1970,7 +1982,7 @@ static int current_may_throttle(void)

__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);

- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);

mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
@@ -2023,7 +2035,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

lru_add_drain();

- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, lru);
@@ -2034,7 +2046,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_vm_events(PGREFILL, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);

- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);

while (!list_empty(&l_hold)) {
cond_resched();
@@ -2080,7 +2092,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
@@ -2098,7 +2110,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);

__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);

mem_cgroup_uncharge_list(&l_active);
free_unref_page_list(&l_active);
@@ -2247,7 +2259,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
u64 denominator = 0; /* gcc */
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
unsigned long anon_prio, file_prio;
enum scan_balance scan_balance;
unsigned long anon, file;
@@ -2325,7 +2336,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);

- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
@@ -2346,7 +2357,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,

fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);

fraction[0] = ap;
fraction[1] = fp;
@@ -4324,24 +4335,21 @@ int page_evictable(struct page *page)
*/
void check_move_unevictable_pages(struct pagevec *pvec)
{
- struct lruvec *lruvec;
- struct pglist_data *pgdat = NULL;
+ struct lruvec *lruvec = NULL;
int pgscanned = 0;
int pgrescued = 0;
int i;

for (i = 0; i < pvec->nr; i++) {
struct page *page = pvec->pages[i];
- struct pglist_data *pagepgdat = page_pgdat(page);
+ struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

pgscanned++;
- if (pagepgdat != pgdat) {
- if (pgdat)
- spin_unlock_irq(&pgdat->lru_lock);
- pgdat = pagepgdat;
- spin_lock_irq(&pgdat->lru_lock);
+ if (lruvec != new_lruvec) {
+ if (lruvec)
+ unlock_page_lruvec_irq(lruvec);
+ lruvec = lock_page_lruvec_irq(page);
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);

if (!PageLRU(page) || !PageUnevictable(page))
continue;
@@ -4357,10 +4365,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
}
}

- if (pgdat) {
+ if (lruvec) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
}
}
EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
--
1.8.3.1

2019-12-25 09:06:09

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 08/10] mm/lru: revise the comments of lru_lock

From: Hugh Dickins <[email protected]>

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
Documentation/admin-guide/cgroup-v1/memory.rst | 6 +++---
Documentation/trace/events-kmem.rst | 2 +-
Documentation/vm/unevictable-lru.rst | 22 ++++++++--------------
include/linux/mm_types.h | 2 +-
include/linux/mmzone.h | 2 +-
mm/filemap.c | 4 ++--
mm/rmap.c | 2 +-
mm/vmscan.c | 12 ++++++++----
9 files changed, 28 insertions(+), 39 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.

8. LRU
======
- Each memcg has its own private LRU. Now, its handling is under global
- VM's control (means that it's handled under global pgdat->lru_lock).
- Almost all routines around memcg's LRU is called by global LRU's
- list management functions under pgdat->lru_lock.
-
- A special function is mem_cgroup_isolate_pages(). This scans
- memcg's private LRU and call __isolate_lru_page() to extract a page
- from LRU.
-
- (By __isolate_lru_page(), the page is removed from both of global and
- private LRU.)
-
+ Each memcg has its own vector of LRUs (inactive anon, active anon,
+ inactive file, active file, unevictable) of pages from each node,
+ each LRU handled under a single lru_lock for that memcg and node.

9. Typical Tests.
=================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 0ae4f564c2d6..60d97e8b7f3c 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -297,13 +297,13 @@ When oom event notifier is registered, event will be delivered.

PG_locked.
mm->page_table_lock
- pgdat->lru_lock
+ lruvec->lru_lock
lock_page_cgroup.

In many cases, just lock_page_cgroup() is called.

- per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
- pgdat->lru_lock, it has no lock of its own.
+ per-node-per-cgroup LRU (cgroup's private LRU) is just guarded by
+ lruvec->lru_lock, it has no lock of its own.

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
-----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.

4. Per-CPU Allocator Activity
=============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems.

To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone. When a large
+main memory will have over 32 million 4k pages in a single node. When a large
fraction of these pages are not evictable for any reason [see below], vmscan
will spend a lot of time scanning the LRU lists looking for the small fraction
of pages that are evictable. This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
The Unevictable Page List
-------------------------

-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to
indicate that the page is being managed on the unevictable list.

@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
swap-backed pages. This differentiation is only important while the pages are,
in fact, evictable.

-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter.

-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock. This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-

Memory Control Group Interaction
--------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
lru_list enum.

-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
lru_list enum element). The memory controller tracks the movement of pages to
and from the unevictable list.

@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.

There may be situations where a page is mapped into a VM_LOCKED VMA, but the
page is not marked as PG_mlocked. Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
page from the LRU, as it is likely on the appropriate active or inactive list
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list. If
+is now mlocked and divert the page to the node's unevictable list. If
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page.

@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
unevictable list in mlock_vma_page().

shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.

shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 270aa8fd2800..ff08a6a8145c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
- * pgdat->lru_lock. Sometimes used as a generic list
+ * lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7db0cec19aa0..d73be191e9f8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
struct pglist_data;

/*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
* So add a wild amount of padding here to ensure that they fall into separate
* cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here.
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..6dcdf06660fb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,8 +101,8 @@
* ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one)
* ->i_pages lock (try_to_unmap_one)
- * ->pgdat->lru_lock (follow_page->mark_page_accessed)
- * ->pgdat->lru_lock (check_pte_range->isolate_lru_page)
+ * ->lruvec->lru_lock (follow_page->mark_page_accessed)
+ * ->lruvec->lru_lock (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty)
* ->i_pages lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index b3e381919835..39052794cb46 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
* mapping->i_mmap_rwsem
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
- * pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
+ * lruvec->lru_lock (in mark_page_accessed, isolate_lru_page)
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a1d34ca1505e..47a3397cca1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1626,14 +1626,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
}

/**
- * pgdat->lru_lock is heavily contended. Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
*
* For pagecache intensive workloads, this function is the hottest
* spot in the kernel (apart from copy_*_user functions).
*
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
*
* @nr_to_scan: The number of eligible pages to look through on the list.
* @lruvec: The LRU vector to pull pages from.
@@ -1820,14 +1822,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,

/*
* This moves pages from @list to corresponding LRU list.
+ * The pages from @list is out of any lruvec, and in the end list reuses as
+ * pages_to_free list.
*
* We move them the other way if the page is referenced by one or more
* processes, from rmap.
*
* If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation. But if
+ * appropriate to hold lru_lock across the whole operation. But if
* the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page. It's impossible to balance
+ * should drop lru_lock around each page. It's impossible to balance
* this, so instead we remove the pages from the LRU while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
--
1.8.3.1

2019-12-25 09:06:12

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 02/10] mm/memcg: fold lru_lock in lock_page_lru

From the commit_charge's explanations and mem_cgroup_commit_charge
comments, as well as call path when lrucare is ture, The lru_lock is
just to guard the task migration(which would be lead to move_account)
So it isn't needed when !PageLRU, and better be fold into PageLRU to
reduce lock contentions.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..0ad10caabc3d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2572,12 +2572,11 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)

static void lock_page_lru(struct page *page, int *isolated)
{
- pg_data_t *pgdat = page_pgdat(page);
-
- spin_lock_irq(&pgdat->lru_lock);
if (PageLRU(page)) {
+ pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;

+ spin_lock_irq(&pgdat->lru_lock);
lruvec = mem_cgroup_page_lruvec(page, pgdat);
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
@@ -2588,17 +2587,17 @@ static void lock_page_lru(struct page *page, int *isolated)

static void unlock_page_lru(struct page *page, int isolated)
{
- pg_data_t *pgdat = page_pgdat(page);

if (isolated) {
+ pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;

lruvec = mem_cgroup_page_lruvec(page, pgdat);
VM_BUG_ON_PAGE(PageLRU(page), page);
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
+ spin_unlock_irq(&pgdat->lru_lock);
}
- spin_unlock_irq(&pgdat->lru_lock);
}

static void commit_charge(struct page *page, struct mem_cgroup *memcg,
--
1.8.3.1

2019-12-25 09:06:28

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 04/10] mm/lru: introduce the relock_page_lruvec function

During the lruvec locking, a new page's lruvec is may same as
previous one. Thus we could save a re-locking, and only
change lock iff lruvec is new.

Function named relock_page_lruvec following Hugh Dickins' patch.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 36 ++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 8 ++------
2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8389b9b927ef..09e861df48e8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1299,6 +1299,42 @@ static inline void dec_lruvec_page_state(struct page *page,
mod_lruvec_page_state(page, idx, -1);
}

+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+ struct lruvec *locked_lruvec)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ if (likely(locked_lruvec == lruvec))
+ return lruvec;
+
+ if (unlikely(locked_lruvec))
+ unlock_page_lruvec_irq(locked_lruvec);
+
+ return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+ struct lruvec *locked_lruvec, unsigned long *flags)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ if (likely(locked_lruvec == lruvec))
+ return lruvec;
+
+ if (unlikely(locked_lruvec))
+ unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+
+ return lock_page_lruvec_irqsave(page, flags);
+}
+
#ifdef CONFIG_CGROUP_WRITEBACK

struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a97e35675286..a1d34ca1505e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4342,14 +4342,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)

for (i = 0; i < pvec->nr; i++) {
struct page *page = pvec->pages[i];
- struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

pgscanned++;
- if (lruvec != new_lruvec) {
- if (lruvec)
- unlock_page_lruvec_irq(lruvec);
- lruvec = lock_page_lruvec_irq(page);
- }
+
+ lruvec = relock_page_lruvec_irq(page, lruvec);

if (!PageLRU(page) || !PageUnevictable(page))
continue;
--
1.8.3.1

2019-12-25 09:06:40

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 05/10] mm/mlock: optimize munlock_pagevec by relocking

During the pagevec locking, a new page's lruvec is may same as
previous one. Thus we could save a re-locking, and only
change lock iff lruvec is newer.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
---
mm/mlock.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 10d15f58b061..050f999eadb1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -289,6 +289,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
{
int i;
int nr = pagevec_count(pvec);
+ int delta_munlocked = -nr;
struct pagevec pvec_putback;
struct lruvec *lruvec = NULL;
int pgrescued = 0;
@@ -299,20 +300,19 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];

- lruvec = lock_page_lruvec_irq(page);
+ lruvec = relock_page_lruvec_irq(page, lruvec);

if (TestClearPageMlocked(page)) {
/*
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
- if (__munlock_isolate_lru_page(page, lruvec, false)) {
- __mod_zone_page_state(zone, NR_MLOCK, -1);
- unlock_page_lruvec_irq(lruvec);
+ if (__munlock_isolate_lru_page(page, lruvec, false))
continue;
- } else
+ else
__munlock_isolation_failed(page);
- }
+ } else
+ delta_munlocked++;

/*
* We won't be munlocking this page in the next phase
@@ -322,8 +322,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
*/
pagevec_add(&pvec_putback, pvec->pages[i]);
pvec->pages[i] = NULL;
- unlock_page_lruvec_irq(lruvec);
}
+ __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+ if (lruvec)
+ unlock_page_lruvec_irq(lruvec);

/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
--
1.8.3.1

2019-12-25 09:06:42

by Alex Shi

[permalink] [raw]
Subject: [PATCH v7 09/10] mm/lru: add debug checking for page memcg moving

This debug patch could give some clues if there are sth out of
consideration.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/compaction.c | 4 ++++
mm/memcontrol.c | 8 ++++++++
2 files changed, 12 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8c0a2da217d8..f47820355b66 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -971,6 +971,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
locked_lruvec = lruvec;

+#ifdef CONFIG_MEMCG
+ if (!mem_cgroup_disabled())
+ VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+#endif
/* Try get exclusive access under lock */
if (!skip_updated) {
skip_updated = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1b69e27d1b9f..33bf086faed0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1247,6 +1247,10 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
spin_lock_irq(&lruvec->lru_lock);

+#ifdef CONFIG_MEMCG
+ if (!mem_cgroup_disabled())
+ VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+#endif
return lruvec;
}

@@ -1259,6 +1263,10 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
spin_lock_irqsave(&lruvec->lru_lock, *flags);

+#ifdef CONFIG_MEMCG
+ if (!mem_cgroup_disabled())
+ VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+#endif
return lruvec;
}

--
1.8.3.1

2019-12-31 23:07:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg

On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <[email protected]> wrote:

> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg per node.

I see that there has been plenty of feedback on previous versions, but
no acked/reviewed tags as yet.

I think I'll take a pass for now, see what the audience feedback looks
like ;)

2020-01-02 10:23:38

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg



?? 2020/1/1 ????7:05, Andrew Morton д??:
> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <[email protected]> wrote:
>
>> This patchset move lru_lock into lruvec, give a lru_lock for each of
>> lruvec, thus bring a lru_lock for each of memcg per node.
>
> I see that there has been plenty of feedback on previous versions, but
> no acked/reviewed tags as yet.
>
> I think I'll take a pass for now, see what the audience feedback looks
> like ;)
>


Thanks a lot! Andrew.

Please drop the 10th patch since it's for debug only and cost performance drop.

Best regards & Happy new year! :)
Alex

2020-01-10 02:03:56

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg



?? 2020/1/2 ????6:21, Alex Shi д??:
>
>
> ?? 2020/1/1 ????7:05, Andrew Morton д??:
>> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <[email protected]> wrote:
>>
>>> This patchset move lru_lock into lruvec, give a lru_lock for each of
>>> lruvec, thus bring a lru_lock for each of memcg per node.
>>
>> I see that there has been plenty of feedback on previous versions, but
>> no acked/reviewed tags as yet.
>>
>> I think I'll take a pass for now, see what the audience feedback looks
>> like ;)
>>
>

Hi Johannes,

Any comments of this version? :)

Thanks
Alex

>
> Thanks a lot! Andrew.
>
> Please drop the 10th patch since it's for debug only and cost performance drop.
>
> Best regards & Happy new year! :)
> Alex
>

2020-01-10 08:42:14

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v7 01/10] mm/vmscan: remove unnecessary lruvec adding

On 25/12/2019 12.04, Alex Shi wrote:
> We don't have to add a freeable page into lru and then remove from it.
> This change saves a couple of actions and makes the moving more clear.
>
> The SetPageLRU needs to be kept here for list intergrity. Otherwise:
>
> #0 mave_pages_to_lru #1 release_pages
> if (put_page_testzero())
> if (put_page_testzero())
> !PageLRU //skip lru_lock
> list_add(&page->lru,);
> else
> list_add(&page->lru,) //corrupt
>
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/vmscan.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 572fb17c6273..8719361b47a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1852,26 +1852,18 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,

Here is another cleanup: pass only pgdat as argument.

This function reavaluates lruvec for each page under lru lock.
Probably this is redundant for now but could be used in the future (or your patchset already use that).

> while (!list_empty(list)) {
> page = lru_to_page(list);
> VM_BUG_ON_PAGE(PageLRU(page), page);
> + list_del(&page->lru);
> if (unlikely(!page_evictable(page))) {
> - list_del(&page->lru);
> spin_unlock_irq(&pgdat->lru_lock);
> putback_lru_page(page);
> spin_lock_irq(&pgdat->lru_lock);
> continue;
> }
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -

Please leave a comment that we must set PageLRU before dropping our page reference.

> SetPageLRU(page);
> - lru = page_lru(page);
> -
> - nr_pages = hpage_nr_pages(page);
> - update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
> - list_move(&page->lru, &lruvec->lists[lru]);
>
> if (put_page_testzero(page)) {
> __ClearPageLRU(page);
> __ClearPageActive(page);
> - del_page_from_lru_list(page, lruvec, lru);
>
> if (unlikely(PageCompound(page))) {
> spin_unlock_irq(&pgdat->lru_lock);
> @@ -1880,6 +1872,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> } else
> list_add(&page->lru, &pages_to_free);
> } else {
> + lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lru = page_lru(page);
> + nr_pages = hpage_nr_pages(page);
> +
> + update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
> + list_add(&page->lru, &lruvec->lists[lru]);
> nr_moved += nr_pages;
> }

IMHO It looks better to in this way:

SetPageLRU()

if (unlikely(put_page_testzero())) {
<free>
continue;
}

<add>

> }
>

2020-01-10 08:50:21

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v7 02/10] mm/memcg: fold lru_lock in lock_page_lru

On 25/12/2019 12.04, Alex Shi wrote:
> From the commit_charge's explanations and mem_cgroup_commit_charge
> comments, as well as call path when lrucare is ture, The lru_lock is
> just to guard the task migration(which would be lead to move_account)
> So it isn't needed when !PageLRU, and better be fold into PageLRU to
> reduce lock contentions.
>
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/memcontrol.c | 9 ++++-----
> 1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c5b5f74cfd4d..0ad10caabc3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2572,12 +2572,11 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>
> static void lock_page_lru(struct page *page, int *isolated)
> {
> - pg_data_t *pgdat = page_pgdat(page);
> -
> - spin_lock_irq(&pgdat->lru_lock);
> if (PageLRU(page)) {
> + pg_data_t *pgdat = page_pgdat(page);
> struct lruvec *lruvec;
>
> + spin_lock_irq(&pgdat->lru_lock);

That's wrong. Here PageLRU must be checked again under lru_lock.


Also I don't like these functions:
- called lock/unlock but actually also isolates
- used just once
- pgdat evaluated twice

> lruvec = mem_cgroup_page_lruvec(page, pgdat);
> ClearPageLRU(page);
> del_page_from_lru_list(page, lruvec, page_lru(page));
> @@ -2588,17 +2587,17 @@ static void lock_page_lru(struct page *page, int *isolated)
>
> static void unlock_page_lru(struct page *page, int isolated)
> {
> - pg_data_t *pgdat = page_pgdat(page);
>
> if (isolated) {
> + pg_data_t *pgdat = page_pgdat(page);
> struct lruvec *lruvec;
>
> lruvec = mem_cgroup_page_lruvec(page, pgdat);
> VM_BUG_ON_PAGE(PageLRU(page), page);
> SetPageLRU(page);
> add_page_to_lru_list(page, lruvec, page_lru(page));
> + spin_unlock_irq(&pgdat->lru_lock);
> }
> - spin_unlock_irq(&pgdat->lru_lock);
> }
>
> static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>

2020-01-13 07:24:33

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 01/10] mm/vmscan: remove unnecessary lruvec adding



在 2020/1/10 下午4:39, Konstantin Khlebnikov 写道:
> On 25/12/2019 12.04, Alex Shi wrote:
>> We don't have to add a freeable page into lru and then remove from it.
>> This change saves a couple of actions and makes the moving more clear.
>>
>> The SetPageLRU needs to be kept here for list intergrity. Otherwise:
>>
>>      #0 mave_pages_to_lru                #1 release_pages
>>                                          if (put_page_testzero())
>>      if (put_page_testzero())
>>                                         !PageLRU //skip lru_lock
>>                                                  list_add(&page->lru,);
>>      else
>>          list_add(&page->lru,) //corrupt
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Hugh Dickins <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>>   mm/vmscan.c | 16 +++++++---------
>>   1 file changed, 7 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 572fb17c6273..8719361b47a0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1852,26 +1852,18 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>
> Here is another cleanup: pass only pgdat as argument.
>
> This function reavaluates lruvec for each page under lru lock.
> Probably this is redundant for now but could be used in the future (or your patchset already use that).

Thanks a lot for comments, Konstantin!

yes, we could use pgdat here, but since to remove the pgdat later, maybe better to save this change? :)

>
>>       while (!list_empty(list)) {
>>           page = lru_to_page(list);
>>           VM_BUG_ON_PAGE(PageLRU(page), page);
>> +        list_del(&page->lru);
>>           if (unlikely(!page_evictable(page))) {
>> -            list_del(&page->lru);
>>               spin_unlock_irq(&pgdat->lru_lock);
>>               putback_lru_page(page);
>>               spin_lock_irq(&pgdat->lru_lock);
>>               continue;
>>           }
>> -        lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> -
>
> Please leave a comment that we must set PageLRU before dropping our page reference.

Right, I will try to give a comments here.

>
>>           SetPageLRU(page);
>> -        lru = page_lru(page);
>> -
>> -        nr_pages = hpage_nr_pages(page);
>> -        update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>> -        list_move(&page->lru, &lruvec->lists[lru]);
>>             if (put_page_testzero(page)) {
>>               __ClearPageLRU(page);
>>               __ClearPageActive(page);
>> -            del_page_from_lru_list(page, lruvec, lru);
>>                 if (unlikely(PageCompound(page))) {
>>                   spin_unlock_irq(&pgdat->lru_lock);
>> @@ -1880,6 +1872,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>>               } else
>>                   list_add(&page->lru, &pages_to_free);
>>           } else {
>> +            lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +            lru = page_lru(page);
>> +            nr_pages = hpage_nr_pages(page);
>> +
>> +            update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>> +            list_add(&page->lru, &lruvec->lists[lru]);
>>               nr_moved += nr_pages;
>>           }
>
> IMHO It looks better to in this way:>
> SetPageLRU()
>
> if (unlikely(put_page_testzero())) {
>  <free>
>  continue;
> }
>
> <add>

Yes, this looks better!

Thanks
Alex

>
>>       }
>>

2020-01-13 08:49:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg

On Fri, 10 Jan 2020, Alex Shi wrote:
> 在 2020/1/2 下午6:21, Alex Shi 写道:
> > 在 2020/1/1 上午7:05, Andrew Morton 写道:
> >> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <[email protected]> wrote:
> >>
> >>> This patchset move lru_lock into lruvec, give a lru_lock for each of
> >>> lruvec, thus bring a lru_lock for each of memcg per node.
> >>
> >> I see that there has been plenty of feedback on previous versions, but
> >> no acked/reviewed tags as yet.
> >>
> >> I think I'll take a pass for now, see what the audience feedback looks
> >> like ;)
> >>
> >
>
> Hi Johannes,
>
> Any comments of this version? :)

I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all -
perhaps because my particular interest tends towards tmpfs and swap,
and swap always made trouble for lruvec lock - one of the reasons why
our patches were more complicated than you thought necessary.

Booted a smallish kernel in mem=700M with 1.5G of swap, with intention
of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs
(losetup was the last command started but I doubt it played much part):

mount -t tmpfs -o size=470M tmpfs /tst
cp /dev/zero /tst
losetup /dev/loop0 /tst/zero

and kernel crashed on the

VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
kernel BUG at mm/memcontrol.c:1268!
lock_page_lruvec_irqsave
relock_page_lruvec_irqsave
pagevec_lru_move_fn
__pagevec_lru_add
lru_add_drain_cpu
lru_add_drain
swap_cluster_readahead
shmem_swapin
shmem_swapin_page
shmem_getpage_gfp
shmem_getpage
shmem_write_begin
generic_perform_write
__generic_file_write_iter
generic_file_write_iter
new_sync_write
__vfs_write
vfs_write
ksys_write
__x86_sys_write
do_syscall_64

Hugh

2020-01-13 09:49:34

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 02/10] mm/memcg: fold lru_lock in lock_page_lru



在 2020/1/10 下午4:49, Konstantin Khlebnikov 写道:
> On 25/12/2019 12.04, Alex Shi wrote:
>>  From the commit_charge's explanations and mem_cgroup_commit_charge
>> comments, as well as call path when lrucare is ture, The lru_lock is
>> just to guard the task migration(which would be lead to move_account)
>> So it isn't needed when !PageLRU, and better be fold into PageLRU to
>> reduce lock contentions.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Vladimir Davydov <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>>   mm/memcontrol.c | 9 ++++-----
>>   1 file changed, 4 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c5b5f74cfd4d..0ad10caabc3d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2572,12 +2572,11 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>>     static void lock_page_lru(struct page *page, int *isolated)
>>   {
>> -    pg_data_t *pgdat = page_pgdat(page);
>> -
>> -    spin_lock_irq(&pgdat->lru_lock);
>>       if (PageLRU(page)) {
>> +        pg_data_t *pgdat = page_pgdat(page);
>>           struct lruvec *lruvec;
>>   +        spin_lock_irq(&pgdat->lru_lock);
>
> That's wrong. Here PageLRU must be checked again under lru_lock.
Hi, Konstantin,

For logical remain, we can get the lock and then release for !PageLRU.
but I still can figure out the problem scenario. Would like to give more hints?


>
>
> Also I don't like these functions:
> - called lock/unlock but actually also isolates
> - used just once
> - pgdat evaluated twice

That's right. I will fold these functions into commit_charge.

Thanks
Alex

2020-01-13 09:57:25

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v7 02/10] mm/memcg: fold lru_lock in lock_page_lru



On 13/01/2020 12.45, Alex Shi wrote:
>
>
> 在 2020/1/10 下午4:49, Konstantin Khlebnikov 写道:
>> On 25/12/2019 12.04, Alex Shi wrote:
>>>  From the commit_charge's explanations and mem_cgroup_commit_charge
>>> comments, as well as call path when lrucare is ture, The lru_lock is
>>> just to guard the task migration(which would be lead to move_account)
>>> So it isn't needed when !PageLRU, and better be fold into PageLRU to
>>> reduce lock contentions.
>>>
>>> Signed-off-by: Alex Shi <[email protected]>
>>> Cc: Johannes Weiner <[email protected]>
>>> Cc: Michal Hocko <[email protected]>
>>> Cc: Matthew Wilcox <[email protected]>
>>> Cc: Vladimir Davydov <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: [email protected]
>>> Cc: [email protected]
>>> Cc: [email protected]
>>> ---
>>>   mm/memcontrol.c | 9 ++++-----
>>>   1 file changed, 4 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index c5b5f74cfd4d..0ad10caabc3d 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2572,12 +2572,11 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>     static void lock_page_lru(struct page *page, int *isolated)
>>>   {
>>> -    pg_data_t *pgdat = page_pgdat(page);
>>> -
>>> -    spin_lock_irq(&pgdat->lru_lock);
>>>       if (PageLRU(page)) {
>>> +        pg_data_t *pgdat = page_pgdat(page);
>>>           struct lruvec *lruvec;
>>>   +        spin_lock_irq(&pgdat->lru_lock);
>>
>> That's wrong. Here PageLRU must be checked again under lru_lock.
> Hi, Konstantin,
>
> For logical remain, we can get the lock and then release for !PageLRU.
> but I still can figure out the problem scenario. Would like to give more hints?

That's trivial race: page could be isolated from lru between

if (PageLRU(page))
and
spin_lock_irq(&pgdat->lru_lock);

>
>
>>
>>
>> Also I don't like these functions:
>> - called lock/unlock but actually also isolates
>> - used just once
>> - pgdat evaluated twice
>
> That's right. I will fold these functions into commit_charge.
>
> Thanks
> Alex
>

2020-01-13 12:48:08

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg



在 2020/1/13 下午4:48, Hugh Dickins 写道:
> On Fri, 10 Jan 2020, Alex Shi wrote:
>> 在 2020/1/2 下午6:21, Alex Shi 写道:
>>> 在 2020/1/1 上午7:05, Andrew Morton 写道:
>>>> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <[email protected]> wrote:
>>>>
>>>>> This patchset move lru_lock into lruvec, give a lru_lock for each of
>>>>> lruvec, thus bring a lru_lock for each of memcg per node.
>>>>
>>>> I see that there has been plenty of feedback on previous versions, but
>>>> no acked/reviewed tags as yet.
>>>>
>>>> I think I'll take a pass for now, see what the audience feedback looks
>>>> like ;)
>>>>
>>>
>>
>> Hi Johannes,
>>
>> Any comments of this version? :)
>
> I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all -
> perhaps because my particular interest tends towards tmpfs and swap,
> and swap always made trouble for lruvec lock - one of the reasons why
> our patches were more complicated than you thought necessary.
>
> Booted a smallish kernel in mem=700M with 1.5G of swap, with intention
> of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs
> (losetup was the last command started but I doubt it played much part):
>
> mount -t tmpfs -o size=470M tmpfs /tst
> cp /dev/zero /tst
> losetup /dev/loop0 /tst/zero

Hi Hugh,

Many thanks for the testing!

I am trying to reproduce your testing, do above 3 steps, then build kernel with 'make -j 8' on my qemu. but cannot reproduce the problem with this v7 version or with v8 version, https://github.com/alexshi/linux/tree/lru-next, which fixed the bug KK mentioned, like the following.
my qemu vmm like this:

[root@debug010000002015 ~]# mount -t tmpfs -o size=470M tmpfs /tst
[root@debug010000002015 ~]# cp /dev/zero /tst
cp: error writing ‘/tst/zero’: No space left on device
cp: failed to extend ‘/tst/zero’: No space left on device
[root@debug010000002015 ~]# losetup /dev/loop0 /tst/zero
[root@debug010000002015 ~]# cat /proc/cmdline
earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on

my kernel configed with MEMCG/MEMCG_SWAP with xfs rootimage, and compiling kernel under ext4. Could you like to share your kernel config and detailed reproduce steps with me? And would you like to try my new version from above github link in your convenient?

Thanks a lot!
Alex

static void commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare)
{
- int isolated;
+ struct lruvec *lruvec = NULL;

VM_BUG_ON_PAGE(page->mem_cgroup, page);

@@ -2612,8 +2617,16 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
* In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
* may already be on some other mem_cgroup's LRU. Take care of it.
*/
- if (lrucare)
- lock_page_lru(page, &isolated);
+ if (lrucare) {
+ lruvec = lock_page_lruvec_irq(page);
+ if (likely(PageLRU(page))) {
+ ClearPageLRU(page);
+ del_page_from_lru_list(page, lruvec, page_lru(page));
+ } else {
+ unlock_page_lruvec_irq(lruvec);
+ lruvec = NULL;
+ }
+ }

/*
* Nobody should be changing or seriously looking at
@@ -2631,8 +2644,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
*/
page->mem_cgroup = memcg;

- if (lrucare)
- unlock_page_lru(page, isolated);
+ if (lrucare && lruvec) {
+ unlock_page_lruvec_irq(lruvec);
+ lruvec = lock_page_lruvec_irq(page);
+
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+ SetPageLRU(page);
+ add_page_to_lru_list(page, lruvec, page_lru(page));
+ unlock_page_lruvec_irq(lruvec);
+ }
}
>
> and kernel crashed on the
>
> VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
> kernel BUG at mm/memcontrol.c:1268!
> lock_page_lruvec_irqsave
> relock_page_lruvec_irqsave
> pagevec_lru_move_fn
> __pagevec_lru_add
> lru_add_drain_cpu
> lru_add_drain
> swap_cluster_readahead
> shmem_swapin
> shmem_swapin_page
> shmem_getpage_gfp
> shmem_getpage
> shmem_write_begin
> generic_perform_write
> __generic_file_write_iter
> generic_file_write_iter
> new_sync_write
> __vfs_write
> vfs_write
> ksys_write
> __x86_sys_write
> do_syscall_64
>
> Hugh
>

2020-01-13 12:50:10

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 02/10] mm/memcg: fold lru_lock in lock_page_lru



在 2020/1/13 下午5:55, Konstantin Khlebnikov 写道:
>>>>
>>>> index c5b5f74cfd4d..0ad10caabc3d 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -2572,12 +2572,11 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>>      static void lock_page_lru(struct page *page, int *isolated)
>>>>    {
>>>> -    pg_data_t *pgdat = page_pgdat(page);
>>>> -
>>>> -    spin_lock_irq(&pgdat->lru_lock);
>>>>        if (PageLRU(page)) {
>>>> +        pg_data_t *pgdat = page_pgdat(page);
>>>>            struct lruvec *lruvec;
>>>>    +        spin_lock_irq(&pgdat->lru_lock);
>>>
>>> That's wrong. Here PageLRU must be checked again under lru_lock.
>> Hi, Konstantin,
>>
>> For logical remain, we can get the lock and then release for !PageLRU.
>> but I still can figure out the problem scenario. Would like to give more hints?
>
> That's trivial race: page could be isolated from lru between
>
> if (PageLRU(page))
> and
> spin_lock_irq(&pgdat->lru_lock);

yes, it could be a problem. guess the following change could helpful:
I will update it in new version.

Thanks a lot!
Alex

-static void lock_page_lru(struct page *page, int *isolated)
-{
- pg_data_t *pgdat = page_pgdat(page);
-
- spin_lock_irq(&pgdat->lru_lock);
- if (PageLRU(page)) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
- ClearPageLRU(page);
- del_page_from_lru_list(page, lruvec, page_lru(page));
- *isolated = 1;
- } else
- *isolated = 0;
-}
-
-static void unlock_page_lru(struct page *page, int isolated)
-{
- pg_data_t *pgdat = page_pgdat(page);
-
- if (isolated) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
- VM_BUG_ON_PAGE(PageLRU(page), page);
- SetPageLRU(page);
- add_page_to_lru_list(page, lruvec, page_lru(page));
- }
- spin_unlock_irq(&pgdat->lru_lock);
-}
-
static void commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare)
{
- int isolated;
+ struct lruvec *lruvec = NULL;

VM_BUG_ON_PAGE(page->mem_cgroup, page);

@@ -2612,8 +2617,16 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
* In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
* may already be on some other mem_cgroup's LRU. Take care of it.
*/
- if (lrucare)
- lock_page_lru(page, &isolated);
+ if (lrucare) {
+ lruvec = lock_page_lruvec_irq(page);
+ if (likely(PageLRU(page))) {
+ ClearPageLRU(page);
+ del_page_from_lru_list(page, lruvec, page_lru(page));
+ } else {
+ unlock_page_lruvec_irq(lruvec);
+ lruvec = NULL;
+ }
+ }

/*
* Nobody should be changing or seriously looking at
@@ -2631,8 +2644,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
*/
page->mem_cgroup = memcg;

- if (lrucare)
- unlock_page_lru(page, isolated);
+ if (lrucare && lruvec) {
+ unlock_page_lruvec_irq(lruvec);
+ lruvec = lock_page_lruvec_irq(page);
+
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+ SetPageLRU(page);
+ add_page_to_lru_list(page, lruvec, page_lru(page));
+ unlock_page_lruvec_irq(lruvec);
+ }
}

2020-01-13 15:44:23

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH v7 03/10] mm/lru: replace pgdat lru_lock with lruvec lock

Hi Alex,

On Wed, Dec 25, 2019 at 05:04:19PM +0800, Alex Shi wrote:
> @@ -900,6 +904,29 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> {
> return &pgdat->__lruvec;
> }
> +#define lock_page_lruvec_irq(page) \
> +({ \
> + struct pglist_data *pgdat = page_pgdat(page); \
> + spin_lock_irq(&pgdat->__lruvec.lru_lock); \
> + &pgdat->__lruvec; \
> +})
> +
> +#define lock_page_lruvec_irqsave(page, flagsp) \
> +({ \
> + struct pglist_data *pgdat = page_pgdat(page); \
> + spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); \
> + &pgdat->__lruvec; \
> +})
> +
> +#define unlock_page_lruvec_irq(lruvec) \
> +({ \
> + spin_unlock_irq(&lruvec->lru_lock); \
> +})
> +
> +#define unlock_page_lruvec_irqrestore(lruvec, flags) \
> +({ \
> + spin_unlock_irqrestore(&lruvec->lru_lock, flags); \
> +})

Noticed this while testing your series. These are safe as inline functions, so
I think you may have gotten the wrong impression when Johannes made this point:

https://lore.kernel.org/linux-mm/[email protected]/

2020-01-13 16:36:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 02/10] mm/memcg: fold lru_lock in lock_page_lru

On Mon, Jan 13, 2020 at 08:47:25PM +0800, Alex Shi wrote:
> 在 2020/1/13 下午5:55, Konstantin Khlebnikov 写道:
> >>> That's wrong. Here PageLRU must be checked again under lru_lock.
> >> Hi, Konstantin,
> >>
> >> For logical remain, we can get the lock and then release for !PageLRU.
> >> but I still can figure out the problem scenario. Would like to give more hints?
> >
> > That's trivial race: page could be isolated from lru between
> >
> > if (PageLRU(page))
> > and
> > spin_lock_irq(&pgdat->lru_lock);
>
> yes, it could be a problem. guess the following change could helpful:
> I will update it in new version.

> + if (lrucare) {
> + lruvec = lock_page_lruvec_irq(page);
> + if (likely(PageLRU(page))) {
> + ClearPageLRU(page);
> + del_page_from_lru_list(page, lruvec, page_lru(page));
> + } else {
> + unlock_page_lruvec_irq(lruvec);
> + lruvec = NULL;
> + }

What about a harder race to hit like a page being on LRU list A when you
look up the lruvec, then it's removed and added to LRU list B by the
time you get the lock? At that point, you are holding a lock on the
wrong LRU list. I think you need to check not just that the page
is still PageLRU but also still on the same LRU list.

2020-01-13 20:22:16

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg

On Mon, 13 Jan 2020, Alex Shi wrote:
> 在 2020/1/13 下午4:48, Hugh Dickins 写道:
> >
> > I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all -
> > perhaps because my particular interest tends towards tmpfs and swap,
> > and swap always made trouble for lruvec lock - one of the reasons why
> > our patches were more complicated than you thought necessary.
> >
> > Booted a smallish kernel in mem=700M with 1.5G of swap, with intention
> > of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs
> > (losetup was the last command started but I doubt it played much part):
> >
> > mount -t tmpfs -o size=470M tmpfs /tst
> > cp /dev/zero /tst
> > losetup /dev/loop0 /tst/zero
>
> Hi Hugh,
>
> Many thanks for the testing!
>
> I am trying to reproduce your testing, do above 3 steps, then build kernel with 'make -j 8' on my qemu. but cannot reproduce the problem with this v7 version or with v8 version, https://github.com/alexshi/linux/tree/lru-next, which fixed the bug KK mentioned, like the following.
> my qemu vmm like this:
>
> [root@debug010000002015 ~]# mount -t tmpfs -o size=470M tmpfs /tst
> [root@debug010000002015 ~]# cp /dev/zero /tst
> cp: error writing ‘/tst/zero’: No space left on device
> cp: failed to extend ‘/tst/zero’: No space left on device
> [root@debug010000002015 ~]# losetup /dev/loop0 /tst/zero
> [root@debug010000002015 ~]# cat /proc/cmdline
> earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on
>
> my kernel configed with MEMCG/MEMCG_SWAP with xfs rootimage, and compiling kernel under ext4. Could you like to share your kernel config and detailed reproduce steps with me? And would you like to try my new version from above github link in your convenient?

I tried with the mods you had appended, from [PATCH v7 02/10]
discussion with Konstantion: no, still crashes in a similar way.

Does your github tree have other changes too? I see it says "Latest
commit e05d0dd 22 days ago", which doesn't seem to fit. Afraid I
don't have time to test many variations.

It looks like, in my case, systemd was usually jumping in and doing
something with shmem (perhaps via memfd) that read back from swap
and triggered the crash without any further intervention from me.

So please try booting with mem=700M and 1.5G swap,
mount -t tmpfs -o size=470M tmpfs /tst
cp /dev/zero /tst; cp /tst/zero /dev/null

That's enough to crash it for me, without getting into any losetup or
systemd complications. But you might have to adjust the numbers to be
sure of writing out and reading back from swap.

It's swap to SSD in my case, don't think that matters. I happen to
run with swappiness 100 (precisely to help generate swap problems),
but swappiness 60 is good enough to get these crashes.

Hugh

2020-01-14 06:37:11

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 03/10] mm/lru: replace pgdat lru_lock with lruvec lock



?? 2020/1/13 ????11:41, Daniel Jordan д??:
> Hi Alex,
>
> On Wed, Dec 25, 2019 at 05:04:19PM +0800, Alex Shi wrote:
>> @@ -900,6 +904,29 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>> {
>> return &pgdat->__lruvec;
>> }
>> +#define lock_page_lruvec_irq(page) \
>> +({ \
>> + struct pglist_data *pgdat = page_pgdat(page); \
>> + spin_lock_irq(&pgdat->__lruvec.lru_lock); \
>> + &pgdat->__lruvec; \
>> +})
>> +
>> +#define lock_page_lruvec_irqsave(page, flagsp) \
>> +({ \
>> + struct pglist_data *pgdat = page_pgdat(page); \
>> + spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); \
>> + &pgdat->__lruvec; \
>> +})
>> +
>> +#define unlock_page_lruvec_irq(lruvec) \
>> +({ \
>> + spin_unlock_irq(&lruvec->lru_lock); \
>> +})
>> +
>> +#define unlock_page_lruvec_irqrestore(lruvec, flags) \
>> +({ \
>> + spin_unlock_irqrestore(&lruvec->lru_lock, flags); \
>> +})
>
> Noticed this while testing your series. These are safe as inline functions, so
> I think you may have gotten the wrong impression when Johannes made this point:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>

Thanks for reminder! Daniel!

I write them to macro, since I guess it's more favorite for some guys. It's do a bit more neat than function although cost a bit readablity.

Thanks a lot! :)
Alex

2020-01-14 09:17:37

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg


> I tried with the mods you had appended, from [PATCH v7 02/10]
> discussion with Konstantion: no, still crashes in a similar way.>
> Does your github tree have other changes too? I see it says "Latest
> commit e05d0dd 22 days ago", which doesn't seem to fit. Afraid I
> don't have time to test many variations.

Thanks a lot for testing! the github version is same as your tested.
The github branches page has a bug, it don't show correct update time.
https://github.com/alexshi/linux/branches while detailed page does.
https://github.com/alexshi/linux/tree/lru-next
>
> It looks like, in my case, systemd was usually jumping in and doing
> something with shmem (perhaps via memfd) that read back from swap
> and triggered the crash without any further intervention from me.
>
> So please try booting with mem=700M and 1.5G swap,
> mount -t tmpfs -o size=470M tmpfs /tst
> cp /dev/zero /tst; cp /tst/zero /dev/null
>
> That's enough to crash it for me, without getting into any losetup or
> systemd complications. But you might have to adjust the numbers to be
> sure of writing out and reading back from swap.
>
> It's swap to SSD in my case, don't think that matters. I happen to
> run with swappiness 100 (precisely to help generate swap problems),
> but swappiness 60 is good enough to get these crashes.
>

I did use 700M memory and 1.5G swapfile in my qemu, but with a swapfile
not a disk.
qemu-system-x86_64 -smp 4 -enable-kvm -cpu SandyBridge \
-m 700M -kernel /home/kuiliang.as/linux/qemulru/arch/x86/boot/bzImage \
-append "earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on " \
-hda /home/kuiliang.as/rootimages/CentOS-7-x86_64-Azure-1703.qcow2 \
-hdb /home/kuiliang.as/rootimages/hdb.qcow2 \
--nographic \

Anyway, although I didn't reproduced the bug. but I found a bug in my
debug function:
VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);

if !page->mem_cgroup, the bug could be triggered, so, seems it's a bug
for debug function, not real issue. The 9th patch should be replaced by
the following new patch.

Many thanks for testing!
Alex

From ac6d3e2bcfba5727d5c03f9655bb0c7443f655eb Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Mon, 23 Dec 2019 13:33:54 +0800
Subject: [PATCH v8 8/9] mm/lru: add debug checking for page memcg moving

This debug patch could give some clues if there are sth out of
consideration.

Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 5 +++++
mm/compaction.c | 2 ++
mm/memcontrol.c | 13 +++++++++++++
3 files changed, 20 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 09e861df48e8..ece88bb11d0f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -421,6 +421,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *);
struct lruvec *lock_page_lruvec_irqsave(struct page *, unsigned long*);
void unlock_page_lruvec_irq(struct lruvec *);
void unlock_page_lruvec_irqrestore(struct lruvec *, unsigned long);
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);

struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

@@ -1183,6 +1184,10 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
#endif /* CONFIG_MEMCG */

/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/compaction.c b/mm/compaction.c
index 8c0a2da217d8..151242817bf4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -971,6 +971,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
locked_lruvec = lruvec;

+ lruvec_memcg_debug(lruvec, page);
+
/* Try get exclusive access under lock */
if (!skip_updated) {
skip_updated = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00fef8ddbd08..a473da8d2275 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1238,6 +1238,17 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
return lruvec;
}

+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!page->mem_cgroup)
+ VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+ else
+ VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+
struct lruvec *lock_page_lruvec_irq(struct page *page)
{
struct lruvec *lruvec;
@@ -1247,6 +1258,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
spin_lock_irq(&lruvec->lru_lock);

+ lruvec_memcg_debug(lruvec, page);
return lruvec;
}

@@ -1259,6 +1271,7 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
spin_lock_irqsave(&lruvec->lru_lock, *flags);

+ lruvec_memcg_debug(lruvec, page);
return lruvec;
}

--
2.22.0

2020-01-14 09:23:48

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 02/10] mm/memcg: fold lru_lock in lock_page_lru



在 2020/1/14 上午12:34, Matthew Wilcox 写道:
> On Mon, Jan 13, 2020 at 08:47:25PM +0800, Alex Shi wrote:
>> 在 2020/1/13 下午5:55, Konstantin Khlebnikov 写道:
>>>>> That's wrong. Here PageLRU must be checked again under lru_lock.
>>>> Hi, Konstantin,
>>>>
>>>> For logical remain, we can get the lock and then release for !PageLRU.
>>>> but I still can figure out the problem scenario. Would like to give more hints?
>>>
>>> That's trivial race: page could be isolated from lru between
>>>
>>> if (PageLRU(page))
>>> and
>>> spin_lock_irq(&pgdat->lru_lock);
>>
>> yes, it could be a problem. guess the following change could helpful:
>> I will update it in new version.
>
>> + if (lrucare) {
>> + lruvec = lock_page_lruvec_irq(page);
>> + if (likely(PageLRU(page))) {
>> + ClearPageLRU(page);
>> + del_page_from_lru_list(page, lruvec, page_lru(page));
>> + } else {
>> + unlock_page_lruvec_irq(lruvec);
>> + lruvec = NULL;
>> + }
>
> What about a harder race to hit like a page being on LRU list A when you
> look up the lruvec, then it's removed and added to LRU list B by the
> time you get the lock? At that point, you are holding a lock on the
> wrong LRU list. I think you need to check not just that the page
> is still PageLRU but also still on the same LRU list.
>

Thanks for comments, Matthew!

We will check and lock lruvec after lock_page_memcg, so if it works well, a page
won't moved from one lruvec to another. Also the later debug do this check, to
see if lruvec changed.

If you mean lru list not lruvec, It seems there is noway to figure out the lru list
from a page now.

Thanks
Alex

2020-01-14 09:32:53

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v7 00/10] per lruvec lru_lock for memcg



在 2020/1/14 下午5:14, Alex Shi 写道:
> Anyway, although I didn't reproduced the bug. but I found a bug in my
> debug function:
> VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
>
> if !page->mem_cgroup, the bug could be triggered, so, seems it's a bug
> for debug function, not real issue. The 9th patch should be replaced by
> the following new patch.

If !page->mem_cgroup, means the page is on root_mem_cgroup, so lurvec's
memcg is root_mem_cgroup, not NULL. that trigger the issue.


Hi Johannes,

So I have a question about the lock_page_memcg in this scenario, Should
we lock the page to root_mem_cgroup? or there is no needs as no tasks
move to a leaf memcg from root?

Thanks
Alex