Hi all,
This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node.
According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
containers on my 2s* 8 core * HT box with the modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this change above lru_lock censitive testing improved 17% with multiple
containers scenario. And no performance lose w/o mem_cgroup.
Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought the same idea
7 years ago. Now I believe considering my testing result, and google internal
using fact. This feature is clearly benefit multi-container users.
So I'd like to introduce it here.
Thanks all the comments from Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
Johannes Weiner, Mel Gorman, Shakeel Butt, Rong Chen, Fengguang Wu, Yun Wang etc.
v4:
a, fix the page->mem_cgroup dereferencing issue, thanks Johannes Weiner
b, remove the irqsave flags changes, thanks Metthew Wilcox
c, merge/split patches for better understanding and bisection purpose
v3: rebase on linux-next, and fold the relock fix patch into introduceing patch
v2: bypass a performance regression bug and fix some function issues
v1: initial version, aim testing show 5% performance increase
Alex Shi (9):
mm/swap: fix uninitialized compiler warning
mm/huge_memory: fix uninitialized compiler warning
mm/lru: replace pgdat lru_lock with lruvec lock
mm/mlock: only change the lru_lock iff page's lruvec is different
mm/swap: only change the lru_lock iff page's lruvec is different
mm/vmscan: only change the lru_lock iff page's lruvec is different
mm/pgdat: remove pgdat lru_lock
mm/lru: likely enhancement
mm/lru: revise the comments of lru_lock
Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +----
Documentation/admin-guide/cgroup-v1/memory.rst | 6 +-
Documentation/trace/events-kmem.rst | 2 +-
Documentation/vm/unevictable-lru.rst | 22 +++----
include/linux/memcontrol.h | 68 ++++++++++++++++++++
include/linux/mm_types.h | 2 +-
include/linux/mmzone.h | 5 +-
mm/compaction.c | 67 +++++++++++++------
mm/filemap.c | 4 +-
mm/huge_memory.c | 17 ++---
mm/memcontrol.c | 75 +++++++++++++++++-----
mm/mlock.c | 27 ++++----
mm/mmzone.c | 1 +
mm/page_alloc.c | 1 -
mm/page_idle.c | 5 +-
mm/rmap.c | 2 +-
mm/swap.c | 74 +++++++++------------
mm/vmscan.c | 74 ++++++++++-----------
18 files changed, 287 insertions(+), 180 deletions(-)
--
1.8.3.1
During the pagevec locking, a new page's lruvec is may same as
previous one. Thus we could save a re-locking, and only
change lock iff lruvec is new.
Function named relock_page_lruvec following Hugh Dickins patch.
The first version of this patch used rcu_read_lock to guard
lruvec assign and comparsion with locked_lruvev in relock_page_lruvec.
But Rong Chen <[email protected]> report a regression with
PROVE_LOCKING config. The rcu_read locking causes qspinlock waiting to
be locked for too long.
Since we had hold a spinlock, rcu_read locking isn't necessary.
[[email protected]: Fix RCU-related regression reported by LKP robot]
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
mm/mlock.c | 16 +++++++++-------
2 files changed, 53 insertions(+), 7 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9538253998a6..19ff453e2822 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1291,6 +1291,50 @@ static inline void dec_lruvec_page_state(struct page *page,
mod_lruvec_page_state(page, idx, -1);
}
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+ struct lruvec *locked_lruvec)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
+
+ if (!locked_lruvec)
+ goto lock;
+
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ if (locked_lruvec == lruvec)
+ return lruvec;
+
+ spin_unlock_irq(&locked_lruvec->lru_lock);
+
+lock:
+ lruvec = lock_page_lruvec_irq(page, pgdat);
+ return lruvec;
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+ struct lruvec *locked_lruvec, unsigned long *flags)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
+
+ if (!locked_lruvec)
+ goto lock;
+
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ if (locked_lruvec == lruvec)
+ return lruvec;
+
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, *flags);
+
+lock:
+ lruvec = lock_page_lruvec_irqsave(page, pgdat, flags);
+ return lruvec;
+}
+
#ifdef CONFIG_CGROUP_WRITEBACK
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index b509b80b8513..8b3a97b62c0a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -290,6 +290,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
{
int i;
int nr = pagevec_count(pvec);
+ int delta_munlocked = -nr;
struct pagevec pvec_putback;
struct lruvec *lruvec = NULL;
int pgrescued = 0;
@@ -300,20 +301,19 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];
- lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
+ lruvec = relock_page_lruvec_irq(page, lruvec);
if (TestClearPageMlocked(page)) {
/*
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
- if (__munlock_isolate_lru_page(page, lruvec, false)) {
- __mod_zone_page_state(zone, NR_MLOCK, -1);
- spin_unlock_irq(&lruvec->lru_lock);
+ if (__munlock_isolate_lru_page(page, lruvec, false))
continue;
- } else
+ else
__munlock_isolation_failed(page);
- }
+ } else
+ delta_munlocked++;
/*
* We won't be munlocking this page in the next phase
@@ -323,8 +323,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
*/
pagevec_add(&pvec_putback, pvec->pages[i]);
pvec->pages[i] = NULL;
- spin_unlock_irq(&lruvec->lru_lock);
}
+ __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+ if (lruvec)
+ spin_unlock_irq(&lruvec->lru_lock);
/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
--
1.8.3.1
Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.
Originally-from: Hugh Dickins <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
Documentation/admin-guide/cgroup-v1/memory.rst | 6 +++---
Documentation/trace/events-kmem.rst | 2 +-
Documentation/vm/unevictable-lru.rst | 22 ++++++++--------------
include/linux/mm_types.h | 2 +-
include/linux/mmzone.h | 2 +-
mm/filemap.c | 4 ++--
mm/rmap.c | 2 +-
mm/vmscan.c | 12 ++++++++----
9 files changed, 28 insertions(+), 39 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
8. LRU
======
- Each memcg has its own private LRU. Now, its handling is under global
- VM's control (means that it's handled under global pgdat->lru_lock).
- Almost all routines around memcg's LRU is called by global LRU's
- list management functions under pgdat->lru_lock.
-
- A special function is mem_cgroup_isolate_pages(). This scans
- memcg's private LRU and call __isolate_lru_page() to extract a page
- from LRU.
-
- (By __isolate_lru_page(), the page is removed from both of global and
- private LRU.)
-
+ Each memcg has its own vector of LRUs (inactive anon, active anon,
+ inactive file, active file, unevictable) of pages from each node,
+ each LRU handled under a single lru_lock for that memcg and node.
9. Typical Tests.
=================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 0ae4f564c2d6..60d97e8b7f3c 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -297,13 +297,13 @@ When oom event notifier is registered, event will be delivered.
PG_locked.
mm->page_table_lock
- pgdat->lru_lock
+ lruvec->lru_lock
lock_page_cgroup.
In many cases, just lock_page_cgroup() is called.
- per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
- pgdat->lru_lock, it has no lock of its own.
+ per-node-per-cgroup LRU (cgroup's private LRU) is just guarded by
+ lruvec->lru_lock, it has no lock of its own.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
-----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
4. Per-CPU Allocator Activity
=============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems.
To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone. When a large
+main memory will have over 32 million 4k pages in a single node. When a large
fraction of these pages are not evictable for any reason [see below], vmscan
will spend a lot of time scanning the LRU lists looking for the small fraction
of pages that are evictable. This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
The Unevictable Page List
-------------------------
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to
indicate that the page is being managed on the unevictable list.
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
swap-backed pages. This differentiation is only important while the pages are,
in fact, evictable.
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter.
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock. This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
Memory Control Group Interaction
--------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
lru_list enum.
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
lru_list enum element). The memory controller tracks the movement of pages to
and from the unevictable list.
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
page is not marked as PG_mlocked. Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
page from the LRU, as it is likely on the appropriate active or inactive list
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list. If
+is now mlocked and divert the page to the node's unevictable list. If
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page.
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
unevictable list in mlock_vma_page().
shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 270aa8fd2800..ff08a6a8145c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
- * pgdat->lru_lock. Sometimes used as a generic list
+ * lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a9334c8d42c8..58ce3186df78 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
struct pglist_data;
/*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
* So add a wild amount of padding here to ensure that they fall into separate
* cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here.
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f5731768222..65236af31db3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,8 +101,8 @@
* ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one)
* ->i_pages lock (try_to_unmap_one)
- * ->pgdat->lru_lock (follow_page->mark_page_accessed)
- * ->pgdat->lru_lock (check_pte_range->isolate_lru_page)
+ * ->lruvec->lru_lock (follow_page->mark_page_accessed)
+ * ->lruvec->lru_lock (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty)
* ->i_pages lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 95f381bbaf7f..b32a15d34a9e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
* mapping->i_mmap_rwsem
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
- * pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
+ * lruvec->lru_lock (in mark_page_accessed, isolate_lru_page)
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba57c55a6a41..14e09c14977d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1612,14 +1612,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
}
/**
- * pgdat->lru_lock is heavily contended. Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
*
* For pagecache intensive workloads, this function is the hottest
* spot in the kernel (apart from copy_*_user functions).
*
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
*
* @nr_to_scan: The number of eligible pages to look through on the list.
* @lruvec: The LRU vector to pull pages from.
@@ -1807,14 +1809,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
/*
* This moves pages from @list to corresponding LRU list.
+ * The pages from @list is out of any lruvec, and in the end list reuses as
+ * pages_to_free list.
*
* We move them the other way if the page is referenced by one or more
* processes, from rmap.
*
* If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation. But if
+ * appropriate to hold lru_lock across the whole operation. But if
* the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page. It's impossible to balance
+ * should drop lru_lock around each page. It's impossible to balance
* this, so instead we remove the pages from the LRU while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
--
1.8.3.1
Use likely() to remove speculations according to pagevec usage.
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 8 ++++----
mm/memcontrol.c | 4 ++--
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19ff453e2822..1787bbc4b059 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1298,12 +1298,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
struct pglist_data *pgdat = page_pgdat(page);
struct lruvec *lruvec;
- if (!locked_lruvec)
+ if (unlikely(!locked_lruvec))
goto lock;
lruvec = mem_cgroup_page_lruvec(page, pgdat);
- if (locked_lruvec == lruvec)
+ if (likely(locked_lruvec == lruvec))
return lruvec;
spin_unlock_irq(&locked_lruvec->lru_lock);
@@ -1320,12 +1320,12 @@ static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
struct pglist_data *pgdat = page_pgdat(page);
struct lruvec *lruvec;
- if (!locked_lruvec)
+ if (unlikely(!locked_lruvec))
goto lock;
lruvec = mem_cgroup_page_lruvec(page, pgdat);
- if (locked_lruvec == lruvec)
+ if (likely(locked_lruvec == lruvec))
return lruvec;
spin_unlock_irqrestore(&locked_lruvec->lru_lock, *flags);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e6387ad01f0..8c3ba90bdefc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1257,7 +1257,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *page,
rcu_read_unlock();
/* lruvec may changed in commit_charge() */
- if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+ if (unlikely(lruvec != mem_cgroup_page_lruvec(page, pgdat))) {
spin_unlock_irq(&lruvec->lru_lock);
goto again;
}
@@ -1277,7 +1277,7 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page,
rcu_read_unlock();
/* lruvec may changed in commit_charge() */
- if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+ if (unlikely(lruvec != mem_cgroup_page_lruvec(page, pgdat))) {
spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
goto again;
}
--
1.8.3.1
Since we introduced relock_page_lruvec, we could use it in more place
to reduce spin_locks.
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 05fee145e382..a023e6095bd9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -197,11 +197,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page), &flags);
+ lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
(*move_fn)(page, lruvec, arg);
- spin_unlock_irqrestore(&lruvec->lru_lock, flags);
}
+ if (lruvec)
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
@@ -820,15 +821,12 @@ void release_pages(struct page **pages, int nr)
}
if (PageLRU(page)) {
- struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+ struct lruvec *pre_lruvec = lruvec;
- if (new_lruvec != lruvec) {
- if (lruvec)
- spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+ lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
+ if (pre_lruvec != lruvec)
lock_batch = 0;
- lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page), &flags);
- }
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
--
1.8.3.1
../mm/swap.c: In function ‘__page_cache_release’:
../mm/swap.c:67:10: warning: ‘flags’ may be used uninitialized in this
function [-Wmaybe-uninitialized]
lruvec = lock_page_lruvec_irqsave(page, pgdat, flags);
~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..c36a10244d07 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,7 +62,7 @@ static void __page_cache_release(struct page *page)
if (PageLRU(page)) {
pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;
- unsigned long flags;
+ unsigned long flags = 0;
spin_lock_irqsave(&pgdat->lru_lock, flags);
lruvec = mem_cgroup_page_lruvec(page, pgdat);
--
1.8.3.1
Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/mmzone.h | 1 -
mm/page_alloc.c | 1 -
2 files changed, 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a13b8a602ee5..a9334c8d42c8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -734,7 +734,6 @@ struct deferred_split {
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
- spinlock_t lru_lock;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5ecbacb0e04..e8f2c8f755f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6803,7 +6803,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat_page_ext_init(pgdat);
- spin_lock_init(&pgdat->lru_lock);
lruvec_init(&pgdat->__lruvec);
}
--
1.8.3.1
Used relock_page_lruvec in more places for spin_lock reducing.
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/vmscan.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3cdf343e7a27..ba57c55a6a41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1825,22 +1825,25 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
* Returns the number of pages moved to the given lruvec.
*/
-static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
+static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *llvec,
struct list_head *list)
{
int nr_pages, nr_moved = 0;
LIST_HEAD(pages_to_free);
struct page *page;
enum lru_list lru;
+ struct lruvec *lruvec = llvec;
while (!list_empty(list)) {
page = lru_to_page(list);
+ lruvec = relock_page_lruvec_irq(page, lruvec);
+
VM_BUG_ON_PAGE(PageLRU(page), page);
if (unlikely(!page_evictable(page))) {
list_del(&page->lru);
spin_unlock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
putback_lru_page(page);
- spin_lock_irq(&lruvec->lru_lock);
continue;
}
@@ -1858,8 +1861,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
if (unlikely(PageCompound(page))) {
spin_unlock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&lruvec->lru_lock);
} else
list_add(&page->lru, &pages_to_free);
} else {
@@ -1867,6 +1870,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
}
}
+ if (lruvec != llvec) {
+ if (lruvec)
+ spin_unlock_irq(&lruvec->lru_lock);
+ spin_lock_irq(&llvec->lru_lock);
+ }
/*
* To save our caller's stack, now use input list for pages to free.
*/
@@ -4289,18 +4297,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
for (i = 0; i < pvec->nr; i++) {
struct page *page = pvec->pages[i];
- struct pglist_data *pgdat = page_pgdat(page);
- struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, pgdat);
-
pgscanned++;
- if (lruvec != new_lruvec) {
- if (lruvec)
- spin_unlock_irq(&lruvec->lru_lock);
- lruvec = new_lruvec;
- spin_lock_irq(&lruvec->lru_lock);
- }
+ lruvec = relock_page_lruvec_irq(page, lruvec);
if (!PageLRU(page) || !PageUnevictable(page))
continue;
--
1.8.3.1
../mm/huge_memory.c: In function ‘split_huge_page_to_list’:
../mm/huge_memory.c:2766:9: warning: ‘flags’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
lruvec = lock_page_lruvec_irqsave(head, pgdata, flags);
~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/huge_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13cc93785006..4cdfdbeb6b2b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2699,7 +2699,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
struct address_space *mapping = NULL;
int count, mapcount, extra_pins, ret;
bool mlocked;
- unsigned long flags;
+ unsigned long flags = 0;
pgoff_t end;
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
--
1.8.3.1
This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node.
This is the main patch to replace per node lru_lock with per memcg
lruvec lock.
We introduce function lock_page_lruvec, it's same as vanilla pgdat lock
when memory cgroup unset, w/o memcg, the function will keep repin the
lruvec's lock to guard from page->mem_cgroup changes in page
migrations between memcgs. (Thanks Hugh Dickins and Konstantin
Khlebnikov reminder on this. Than the core logical is same as their
previous patchs)
According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
containers on my 2s* 8 core * HT box with the modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this and later patches, the dd performance is 144MB/s, the vanilla
kernel performance is 123MB/s. 17% performance increased.
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Colin Ian King <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 24 +++++++++++++++
include/linux/mmzone.h | 2 ++
mm/compaction.c | 67 ++++++++++++++++++++++++++++-------------
mm/huge_memory.c | 15 ++++------
mm/memcontrol.c | 75 +++++++++++++++++++++++++++++++++++-----------
mm/mlock.c | 31 ++++++++++---------
mm/mmzone.c | 1 +
mm/page_idle.c | 5 ++--
mm/swap.c | 74 +++++++++++++++++++--------------------------
mm/vmscan.c | 58 +++++++++++++++++------------------
10 files changed, 214 insertions(+), 138 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b86287fa069..9538253998a6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -418,6 +418,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
+struct lruvec *lock_page_lruvec_irq(struct page *, struct pglist_data *);
+struct lruvec *lock_page_lruvec_irqsave(struct page *, struct pglist_data *,
+ unsigned long*);
+
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -901,6 +905,26 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &pgdat->__lruvec;
}
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
+ struct pglist_data *pgdat)
+{
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ spin_lock_irq(&lruvec->lru_lock);
+
+ return lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+ struct pglist_data *pgdat, unsigned long *flags)
+{
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ spin_lock_irqsave(&lruvec->lru_lock, *flags);
+
+ return lruvec;
+}
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 07b784ffbb14..a13b8a602ee5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -267,6 +267,8 @@ struct lruvec {
unsigned long refaults;
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
+ /* per lruvec lru_lock for memcg */
+ spinlock_t lru_lock;
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index d20816b21b55..f8018226fdbc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
- bool locked = false;
+ struct lruvec *locked_lruvec = NULL;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
bool skip_on_failure = false;
@@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
* contention, to give chance to IRQs. Abort completely if
* a fatal signal is pending.
*/
- if (!(low_pfn % SWAP_CLUSTER_MAX)
- && compact_unlock_should_abort(&pgdat->lru_lock,
- flags, &locked, cc)) {
- low_pfn = 0;
- goto fatal_pending;
+ if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+ flags);
+ locked_lruvec = NULL;
+ }
+
+ if (fatal_signal_pending(current)) {
+ cc->contended = true;
+
+ low_pfn = 0;
+ goto fatal_pending;
+ }
+
+ cond_resched();
}
if (!pfn_valid_within(low_pfn))
@@ -920,10 +930,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
*/
if (unlikely(__PageMovable(page)) &&
!PageIsolated(page)) {
- if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock,
- flags);
- locked = false;
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+ flags);
+ locked_lruvec = NULL;
}
if (!isolate_movable_page(page, isolate_mode))
@@ -949,10 +959,26 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
+ rcu_read_lock();
+reget_lruvec:
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
/* If we already hold the lock, we can skip some rechecking */
- if (!locked) {
- locked = compact_lock_irqsave(&pgdat->lru_lock,
- &flags, cc);
+ if (lruvec != locked_lruvec) {
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+ flags);
+ locked_lruvec = NULL;
+ }
+ if (compact_lock_irqsave(&lruvec->lru_lock,
+ &flags, cc))
+ locked_lruvec = lruvec;
+
+
+ if (lruvec != mem_cgroup_page_lruvec(page, pgdat))
+ goto reget_lruvec;
+
+ rcu_read_unlock();
/* Try get exclusive access under lock */
if (!skip_updated) {
@@ -974,9 +1000,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
low_pfn += compound_nr(page) - 1;
goto isolate_fail;
}
- }
+ } else
+ rcu_read_unlock();
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
/* Try isolate the page */
if (__isolate_lru_page(page, isolate_mode) != 0)
@@ -1017,9 +1043,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
* page anyway.
*/
if (nr_isolated) {
- if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
- locked = false;
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+ flags);
+ locked_lruvec = NULL;
}
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
@@ -1044,8 +1071,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
low_pfn = end_pfn;
isolate_abort:
- if (locked)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ if (locked_lruvec)
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
/*
* Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4cdfdbeb6b2b..cf37c58f0558 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2495,17 +2495,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
}
static void __split_huge_page(struct page *page, struct list_head *list,
- pgoff_t end, unsigned long flags)
+ struct lruvec *lruvec, pgoff_t end, unsigned long flags)
{
struct page *head = compound_head(page);
- pg_data_t *pgdat = page_pgdat(head);
- struct lruvec *lruvec;
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
int i;
- lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
@@ -2554,7 +2550,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_unlock(&head->mapping->i_pages);
}
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
remap_page(head);
@@ -2697,6 +2693,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
struct deferred_split *ds_queue = get_deferred_split_queue(page);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
+ struct lruvec *lruvec;
int count, mapcount, extra_pins, ret;
bool mlocked;
unsigned long flags = 0;
@@ -2766,7 +2763,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
lru_add_drain();
/* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irqsave(&pgdata->lru_lock, flags);
+ lruvec = lock_page_lruvec_irqsave(head, pgdata, &flags);
if (mapping) {
XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2797,7 +2794,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
}
spin_unlock(&ds_queue->split_queue_lock);
- __split_huge_page(page, list, end, flags);
+ __split_huge_page(page, list, lruvec, end, flags);
if (PageSwapCache(head)) {
swp_entry_t entry = { .val = page_private(head) };
@@ -2816,7 +2813,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&ds_queue->split_queue_lock);
fail: if (mapping)
xa_unlock(&mapping->i_pages);
- spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
remap_page(head);
ret = -EBUSY;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62470325f9bc..7e6387ad01f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1210,9 +1210,8 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
* @page: the page
* @pgdat: pgdat of the page
*
- * This function is only safe when following the LRU page isolation
- * and putback protocol: the LRU lock must be held, and the page must
- * either be PageLRU() or the caller must have isolated/allocated it.
+ * The caller needs to grantee the page's mem_cgroup is undisturbed during
+ * using. That could be done by lock_page_memcg or lock_page_lruvec.
*/
struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
{
@@ -1225,7 +1224,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
goto out;
}
- memcg = page->mem_cgroup;
+ memcg = READ_ONCE(page->mem_cgroup);
/*
* Swapcache readahead pages are added to the LRU - and
* possibly migrated - before they are charged.
@@ -1246,6 +1245,46 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
return lruvec;
}
+struct lruvec *lock_page_lruvec_irq(struct page *page,
+ struct pglist_data *pgdat)
+{
+ struct lruvec *lruvec;
+
+again:
+ rcu_read_lock();
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irq(&lruvec->lru_lock);
+ rcu_read_unlock();
+
+ /* lruvec may changed in commit_charge() */
+ if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ goto again;
+ }
+
+ return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+ struct pglist_data *pgdat, unsigned long *flags)
+{
+ struct lruvec *lruvec;
+
+again:
+ rcu_read_lock();
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irqsave(&lruvec->lru_lock, *flags);
+ rcu_read_unlock();
+
+ /* lruvec may changed in commit_charge() */
+ if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+ goto again;
+ }
+
+ return lruvec;
+}
+
/**
* mem_cgroup_update_lru_size - account for adding or removing an lru page
* @lruvec: mem_cgroup per zone lru vector
@@ -2571,41 +2610,43 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
css_put_many(&memcg->css, nr_pages);
}
-static void lock_page_lru(struct page *page, int *isolated)
+static struct lruvec *lock_page_lru(struct page *page, int *isolated)
{
pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec = lock_page_lruvec_irq(page, pgdat);
- spin_lock_irq(&pgdat->lru_lock);
if (PageLRU(page)) {
- struct lruvec *lruvec;
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
*isolated = 1;
} else
*isolated = 0;
+
+ return lruvec;
}
-static void unlock_page_lru(struct page *page, int isolated)
+static void unlock_page_lru(struct page *page, int isolated,
+ struct lruvec *locked_lruvec)
{
- pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
- if (isolated) {
- struct lruvec *lruvec;
+ spin_unlock_irq(&locked_lruvec->lru_lock);
+ lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ if (isolated) {
VM_BUG_ON_PAGE(PageLRU(page), page);
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
}
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
}
static void commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare)
{
int isolated;
+ struct lruvec *lruvec;
VM_BUG_ON_PAGE(page->mem_cgroup, page);
@@ -2614,7 +2655,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
* may already be on some other mem_cgroup's LRU. Take care of it.
*/
if (lrucare)
- lock_page_lru(page, &isolated);
+ lruvec = lock_page_lru(page, &isolated);
/*
* Nobody should be changing or seriously looking at
@@ -2633,7 +2674,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
page->mem_cgroup = memcg;
if (lrucare)
- unlock_page_lru(page, isolated);
+ unlock_page_lru(page, isolated, lruvec);
}
#ifdef CONFIG_MEMCG_KMEM
@@ -2929,7 +2970,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
/*
* Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * pgdat->lruvec.lru_lock and migration entries setup in all page mappings.
*/
void mem_cgroup_split_huge_fixup(struct page *head)
{
diff --git a/mm/mlock.c b/mm/mlock.c
index a72c1eeded77..b509b80b8513 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
* Isolate a page from LRU with optional get_page() pin.
* Assumes lru_lock already held and page already pinned.
*/
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+ struct lruvec *lruvec, bool getpage)
{
if (PageLRU(page)) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
if (getpage)
get_page(page);
ClearPageLRU(page);
@@ -183,6 +181,7 @@ unsigned int munlock_vma_page(struct page *page)
{
int nr_pages;
pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
/* For try_to_munlock() and to serialize with page migration */
BUG_ON(!PageLocked(page));
@@ -194,7 +193,7 @@ unsigned int munlock_vma_page(struct page *page)
* might otherwise copy PageMlocked to part of the tail pages before
* we clear it in the head page. It also stabilizes hpage_nr_pages().
*/
- spin_lock_irq(&pgdat->lru_lock);
+ lruvec = lock_page_lruvec_irq(page, pgdat);
if (!TestClearPageMlocked(page)) {
/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -205,15 +204,15 @@ unsigned int munlock_vma_page(struct page *page)
nr_pages = hpage_nr_pages(page);
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
- if (__munlock_isolate_lru_page(page, true)) {
- spin_unlock_irq(&pgdat->lru_lock);
+ if (__munlock_isolate_lru_page(page, lruvec, true)) {
+ spin_unlock_irq(&lruvec->lru_lock);
__munlock_isolated_page(page);
goto out;
}
__munlock_isolation_failed(page);
unlock_out:
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
out:
return nr_pages - 1;
@@ -291,28 +290,29 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
{
int i;
int nr = pagevec_count(pvec);
- int delta_munlocked = -nr;
struct pagevec pvec_putback;
+ struct lruvec *lruvec = NULL;
int pgrescued = 0;
pagevec_init(&pvec_putback);
/* Phase 1: page isolation */
- spin_lock_irq(&zone->zone_pgdat->lru_lock);
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];
+ lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
+
if (TestClearPageMlocked(page)) {
/*
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
- if (__munlock_isolate_lru_page(page, false))
+ if (__munlock_isolate_lru_page(page, lruvec, false)) {
+ __mod_zone_page_state(zone, NR_MLOCK, -1);
+ spin_unlock_irq(&lruvec->lru_lock);
continue;
- else
+ } else
__munlock_isolation_failed(page);
- } else {
- delta_munlocked++;
}
/*
@@ -323,9 +323,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
*/
pagevec_add(&pvec_putback, pvec->pages[i]);
pvec->pages[i] = NULL;
+ spin_unlock_irq(&lruvec->lru_lock);
}
- __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- spin_unlock_irq(&zone->zone_pgdat->lru_lock);
/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
enum lru_list lru;
memset(lruvec, 0, sizeof(struct lruvec));
+ spin_lock_init(&lruvec->lru_lock);
for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..25f4b1cf3e0f 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,6 +32,7 @@ static struct page *page_idle_get_page(unsigned long pfn)
{
struct page *page;
pg_data_t *pgdat;
+ struct lruvec *lruvec;
if (!pfn_valid(pfn))
return NULL;
@@ -42,12 +43,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
return NULL;
pgdat = page_pgdat(page);
- spin_lock_irq(&pgdat->lru_lock);
+ lruvec = lock_page_lruvec_irq(page, pgdat);
if (unlikely(!PageLRU(page))) {
put_page(page);
page = NULL;
}
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
return page;
}
diff --git a/mm/swap.c b/mm/swap.c
index c36a10244d07..05fee145e382 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -64,12 +64,11 @@ static void __page_cache_release(struct page *page)
struct lruvec *lruvec;
unsigned long flags = 0;
- spin_lock_irqsave(&pgdat->lru_lock, flags);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = lock_page_lruvec_irqsave(page, pgdat, &flags);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
}
__ClearPageWaiters(page);
}
@@ -192,26 +191,18 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
void *arg)
{
int i;
- struct pglist_data *pgdat = NULL;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = NULL;
unsigned long flags = 0;
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- struct pglist_data *pagepgdat = page_pgdat(page);
- if (pagepgdat != pgdat) {
- if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
- pgdat = pagepgdat;
- spin_lock_irqsave(&pgdat->lru_lock, flags);
- }
+ lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page), &flags);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
}
- if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
}
@@ -252,7 +243,6 @@ void rotate_reclaimable_page(struct page *page)
!PageUnevictable(page) && PageLRU(page)) {
struct pagevec *pvec;
unsigned long flags;
-
get_page(page);
local_irq_save(flags);
pvec = this_cpu_ptr(&lru_rotate_pvecs);
@@ -325,11 +315,12 @@ static inline void activate_page_drain(int cpu)
void activate_page(struct page *page)
{
pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
page = compound_head(page);
- spin_lock_irq(&pgdat->lru_lock);
- __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
- spin_unlock_irq(&pgdat->lru_lock);
+ lruvec = lock_page_lruvec_irq(page, pgdat);
+ __activate_page(page, lruvec, NULL);
+ spin_unlock_irq(&lruvec->lru_lock);
}
#endif
@@ -780,8 +771,7 @@ void release_pages(struct page **pages, int nr)
{
int i;
LIST_HEAD(pages_to_free);
- struct pglist_data *locked_pgdat = NULL;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = NULL;
unsigned long uninitialized_var(flags);
unsigned int uninitialized_var(lock_batch);
@@ -791,21 +781,20 @@ void release_pages(struct page **pages, int nr)
/*
* Make sure the IRQ-safe lock-holding time does not get
* excessive with a continuous string of pages from the
- * same pgdat. The lock is held only if pgdat != NULL.
+ * same lruvec. The lock is held only if lruvec != NULL.
*/
- if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
- locked_pgdat = NULL;
+ if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+ lruvec = NULL;
}
if (is_huge_zero_page(page))
continue;
if (is_zone_device_page(page)) {
- if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
- flags);
- locked_pgdat = NULL;
+ if (lruvec) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+ lruvec = NULL;
}
/*
* ZONE_DEVICE pages that return 'false' from
@@ -822,27 +811,24 @@ void release_pages(struct page **pages, int nr)
continue;
if (PageCompound(page)) {
- if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
- locked_pgdat = NULL;
+ if (lruvec) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+ lruvec = NULL;
}
__put_compound_page(page);
continue;
}
if (PageLRU(page)) {
- struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
- if (pgdat != locked_pgdat) {
- if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
- flags);
+ if (new_lruvec != lruvec) {
+ if (lruvec)
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
lock_batch = 0;
- locked_pgdat = pgdat;
- spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
- }
+ lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page), &flags);
- lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
+ }
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -854,8 +840,8 @@ void release_pages(struct page **pages, int nr)
list_add(&page->lru, &pages_to_free);
}
- if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ if (lruvec)
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
mem_cgroup_uncharge_list(&pages_to_free);
free_unref_page_list(&pages_to_free);
@@ -893,7 +879,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
- lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+ lockdep_assert_held(&lruvec->lru_lock);
if (!list)
SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d97985262dda..3cdf343e7a27 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1755,8 +1755,7 @@ int isolate_lru_page(struct page *page)
pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;
- spin_lock_irq(&pgdat->lru_lock);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = lock_page_lruvec_irq(page, pgdat);
if (PageLRU(page)) {
int lru = page_lru(page);
get_page(page);
@@ -1764,7 +1763,7 @@ int isolate_lru_page(struct page *page)
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
}
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
}
return ret;
}
@@ -1829,7 +1828,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
struct list_head *list)
{
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
int nr_pages, nr_moved = 0;
LIST_HEAD(pages_to_free);
struct page *page;
@@ -1840,12 +1838,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
VM_BUG_ON_PAGE(PageLRU(page), page);
if (unlikely(!page_evictable(page))) {
list_del(&page->lru);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
putback_lru_page(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
continue;
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
SetPageLRU(page);
lru = page_lru(page);
@@ -1860,9 +1857,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
del_page_from_lru_list(page, lruvec, lru);
if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
} else
list_add(&page->lru, &pages_to_free);
} else {
@@ -1925,7 +1922,7 @@ static int current_may_throttle(void)
lru_add_drain();
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, lru);
@@ -1937,7 +1934,7 @@ static int current_may_throttle(void)
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
if (nr_taken == 0)
return 0;
@@ -1945,7 +1942,7 @@ static int current_may_throttle(void)
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
&stat, false);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
@@ -1958,7 +1955,7 @@ static int current_may_throttle(void)
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
@@ -2011,7 +2008,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
lru_add_drain();
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, lru);
@@ -2022,7 +2019,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_vm_events(PGREFILL, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
while (!list_empty(&l_hold)) {
cond_resched();
@@ -2068,7 +2065,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
@@ -2086,7 +2083,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_list(&l_active);
free_unref_page_list(&l_active);
@@ -2371,7 +2368,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
@@ -2392,7 +2389,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
fraction[0] = ap;
fraction[1] = fp;
@@ -4285,24 +4282,25 @@ int page_evictable(struct page *page)
*/
void check_move_unevictable_pages(struct pagevec *pvec)
{
- struct lruvec *lruvec;
- struct pglist_data *pgdat = NULL;
+ struct lruvec *lruvec = NULL;
int pgscanned = 0;
int pgrescued = 0;
int i;
for (i = 0; i < pvec->nr; i++) {
struct page *page = pvec->pages[i];
- struct pglist_data *pagepgdat = page_pgdat(page);
+ struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
pgscanned++;
- if (pagepgdat != pgdat) {
- if (pgdat)
- spin_unlock_irq(&pgdat->lru_lock);
- pgdat = pagepgdat;
- spin_lock_irq(&pgdat->lru_lock);
+
+ if (lruvec != new_lruvec) {
+ if (lruvec)
+ spin_unlock_irq(&lruvec->lru_lock);
+ lruvec = new_lruvec;
+ spin_lock_irq(&lruvec->lru_lock);
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
if (!PageLRU(page) || !PageUnevictable(page))
continue;
@@ -4318,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
}
}
- if (pgdat) {
+ if (lruvec) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
}
}
EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
--
1.8.3.1
On Tue, Nov 19, 2019 at 08:23:15PM +0800, Alex Shi wrote:
> ../mm/swap.c: In function ‘__page_cache_release’:
> ../mm/swap.c:67:10: warning: ‘flags’ may be used uninitialized in this
> function [-Wmaybe-uninitialized]
> lruvec = lock_page_lruvec_irqsave(page, pgdat, flags);
> ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At this point in the series, there is no lock_page_lruvec_irqsave()
yet, so this patch is out of order. Please don't do that. It's very
hard to review patches that depend on later changes to make sense.
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/swap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 5341ae93861f..c36a10244d07 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,7 +62,7 @@ static void __page_cache_release(struct page *page)
> if (PageLRU(page)) {
> pg_data_t *pgdat = page_pgdat(page);
> struct lruvec *lruvec;
> - unsigned long flags;
> + unsigned long flags = 0;
Use unintialized_var() for these cases to make the intent clear.
On Tue, Nov 19, 2019 at 08:23:16PM +0800, Alex Shi wrote:
> ../mm/huge_memory.c: In function ‘split_huge_page_to_list’:
> ../mm/huge_memory.c:2766:9: warning: ‘flags’ may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> lruvec = lock_page_lruvec_irqsave(head, pgdata, flags);
> ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Like with the previous patch, there is no lock_page_lruvec_irqsave()
at this point in the series.
On Tue, Nov 19, 2019 at 08:23:17PM +0800, Alex Shi wrote:
> +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> + struct pglist_data *pgdat, unsigned long *flags)
> +{
> + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
> + spin_lock_irqsave(&lruvec->lru_lock, *flags);
> +
> + return lruvec;
> +}
This should be a macro, not a function. You basically can't do this;
spin_lock_irqsave needs to write to a variable which can then be passed
to spin_unlock_irqrestore(). What you're doing here will dereference the
pointer in _this_ function, but won't propagate the modified value back to
the caller. I suppose you could do something like this ...
static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
struct pglist_data *pgdat, unsigned long *flagsp)
{
unsigned long flags;
struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
spin_lock_irqsave(&lruvec->lru_lock, flags);
*flagsp = flags;
return lruvec;
}
Almost certainly easier to write a macro though.
You shouldn't need the two prior patches with this kind of change.
On Tue, Nov 19, 2019 at 08:23:17PM +0800, Alex Shi wrote:
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg per node.
>
> This is the main patch to replace per node lru_lock with per memcg
> lruvec lock.
>
> We introduce function lock_page_lruvec, it's same as vanilla pgdat lock
> when memory cgroup unset, w/o memcg, the function will keep repin the
> lruvec's lock to guard from page->mem_cgroup changes in page
> migrations between memcgs. (Thanks Hugh Dickins and Konstantin
> Khlebnikov reminder on this. Than the core logical is same as their
> previous patchs)
>
> According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
> containers on my 2s* 8 core * HT box with the modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>
> With this and later patches, the dd performance is 144MB/s, the vanilla
> kernel performance is 123MB/s. 17% performance increased.
>
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Chris Down <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Qian Cai <[email protected]>
> Cc: Andrey Ryabinin <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: "J?r?me Glisse" <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: swkhack <[email protected]>
> Cc: "Potyra, Stefan" <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Colin Ian King <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Mauro Carvalho Chehab <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Peng Fan <[email protected]>
> Cc: Nikolay Borisov <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Yafang Shao <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/memcontrol.h | 24 +++++++++++++++
> include/linux/mmzone.h | 2 ++
> mm/compaction.c | 67 ++++++++++++++++++++++++++++-------------
> mm/huge_memory.c | 15 ++++------
> mm/memcontrol.c | 75 +++++++++++++++++++++++++++++++++++-----------
> mm/mlock.c | 31 ++++++++++---------
> mm/mmzone.c | 1 +
> mm/page_idle.c | 5 ++--
> mm/swap.c | 74 +++++++++++++++++++--------------------------
> mm/vmscan.c | 58 +++++++++++++++++------------------
> 10 files changed, 214 insertions(+), 138 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5b86287fa069..9538253998a6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -418,6 +418,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>
> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
> +struct lruvec *lock_page_lruvec_irq(struct page *, struct pglist_data *);
> +struct lruvec *lock_page_lruvec_irqsave(struct page *, struct pglist_data *,
> + unsigned long*);
> +
> struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
> struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> @@ -901,6 +905,26 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> return &pgdat->__lruvec;
> }
>
> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
> + struct pglist_data *pgdat)
> +{
> + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
> + spin_lock_irq(&lruvec->lru_lock);
> +
> + return lruvec;
While this works in practice, it looks wrong because it doesn't follow
the mem_cgroup_page_lruvec() rules.
Please open-code spin_lock_irq(&pgdat->__lruvec->lru_lock) instead.
> @@ -1246,6 +1245,46 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
> return lruvec;
> }
>
> +struct lruvec *lock_page_lruvec_irq(struct page *page,
> + struct pglist_data *pgdat)
> +{
> + struct lruvec *lruvec;
> +
> +again:
> + rcu_read_lock();
> + lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + spin_lock_irq(&lruvec->lru_lock);
> + rcu_read_unlock();
The spinlock doesn't prevent the lruvec from being freed.
You deleted the rules from the mem_cgroup_page_lruvec() documentation,
but they still apply: if the page is already !PageLRU() by the time
you get here, it could get reclaimed or migrated to another cgroup,
and that can free the memcg/lruvec. Merely having the lru_lock held
does not prevent this.
Either the page needs to be locked, or the page needs to be PageLRU
with the lru_lock held to prevent somebody else from isolating
it. Otherwise, the lruvec is not safe to use.
On Tue, Nov 19, 2019 at 08:23:23PM +0800, Alex Shi wrote:
> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
> fix the incorrect comments in code. Also fixed some zone->lru_lock comment
> error from ancient time. etc.
>
> Originally-from: Hugh Dickins <[email protected]>
You can resubmit a patch written by somebody else, but you need to
preserve authorship such that git can attribute it properly:
From: Hugh Dickins <[email protected]>
in the patch header, as well as
Signed-off-by: Hugh Dickins <[email protected]>
in the changelog tags. It should look something like this:
---
From: Hugh Dickins <[email protected]>
Subject: [PATCH] mm/memcg: update lru_lock Doc and comments
Update scattered references from "zone_lru_lock" to "lruvec->lru_lock"
(in low-level descriptions) or "lruvec lock" (where that reads better).
In the course of doing so, update lock ordering comments in mm/rmap.c
and mm/filemap.c and Documentation/cgroups/memory.txt - slightly, with
no attempt to be complete (swap_lock, in particular, left out-of-date).
Remove allusions to set_page_dirty under page_remove_rmap: gone in v3.9.
memcg_test.txt looks quite out-of-date: just give LRU a short paragraph.
Replaced zone by node throughout unevictable-lru.txt (except for stats).
Leave Documentation/locking/lockstat.txt untouched this time: it doesn't
matter if that still refers to zone->lru_lock, along with other history.
I struggled to understand the comment above move_pages_to_lru() (surely
it never calls page_referenced()), and eventually realized that most of
it had got separated from shrink_active_list(): move that comment back.
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
---
On Tue, Nov 19, 2019 at 07:57:04AM -0800, Matthew Wilcox wrote:
> On Tue, Nov 19, 2019 at 08:23:17PM +0800, Alex Shi wrote:
> > +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> > + struct pglist_data *pgdat, unsigned long *flags)
> > +{
> > + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> > +
> > + spin_lock_irqsave(&lruvec->lru_lock, *flags);
> > +
> > + return lruvec;
> > +}
>
> This should be a macro, not a function. You basically can't do this;
> spin_lock_irqsave needs to write to a variable which can then be passed
> to spin_unlock_irqrestore(). What you're doing here will dereference the
> pointer in _this_ function, but won't propagate the modified value back to
> the caller. I suppose you could do something like this ...
This works because spin_lock_irqsave and local_irq_save() are
macros. It boils down to '*flags = arch_local_irq_save()' in this
function, and therefor does the right thing.
We exploit that in quite a few places:
$ git grep 'spin_lock_irqsave(.*\*flags' | wc -l
39
On Tue, Nov 19, 2019 at 4:24 AM Alex Shi <[email protected]> wrote:
>
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg per node.
>
> This is the main patch to replace per node lru_lock with per memcg
> lruvec lock.
>
> We introduce function lock_page_lruvec, it's same as vanilla pgdat lock
> when memory cgroup unset, w/o memcg, the function will keep repin the
> lruvec's lock to guard from page->mem_cgroup changes in page
> migrations between memcgs. (Thanks Hugh Dickins and Konstantin
> Khlebnikov reminder on this. Than the core logical is same as their
> previous patchs)
>
> According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
> containers on my 2s* 8 core * HT box with the modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>
> With this and later patches, the dd performance is 144MB/s, the vanilla
> kernel performance is 123MB/s. 17% performance increased.
>
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Chris Down <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Qian Cai <[email protected]>
> Cc: Andrey Ryabinin <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: "Jérôme Glisse" <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: swkhack <[email protected]>
> Cc: "Potyra, Stefan" <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Colin Ian King <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Mauro Carvalho Chehab <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Peng Fan <[email protected]>
> Cc: Nikolay Borisov <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Yafang Shao <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
This patch (and series) still have unsafe accesses to lruvec.
Alex, I was hoping that you would drop this series in favor of Hugh's
patches. Anyways I will post Hugh patches for review to be considered
for 5.6. I will run a couple of performance experiments.
Shakeel
?? 2019/11/20 ????12:04, Johannes Weiner д??:
>> +
>> + return lruvec;
> While this works in practice, it looks wrong because it doesn't follow
> the mem_cgroup_page_lruvec() rules.
>
> Please open-code spin_lock_irq(&pgdat->__lruvec->lru_lock) instead.
>
That's right. Thanks for suggestion!
>> @@ -1246,6 +1245,46 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>> return lruvec;
>> }
>>
>> +struct lruvec *lock_page_lruvec_irq(struct page *page,
>> + struct pglist_data *pgdat)
>> +{
>> + struct lruvec *lruvec;
>> +
>> +again:
>> + rcu_read_lock();
>> + lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> + spin_lock_irq(&lruvec->lru_lock);
>> + rcu_read_unlock();
> The spinlock doesn't prevent the lruvec from being freed
>
> You deleted the rules from the mem_cgroup_page_lruvec() documentation,
> but they still apply: if the page is already !PageLRU() by the time
> you get here, it could get reclaimed or migrated to another cgroup,
> and that can free the memcg/lruvec. Merely having the lru_lock held
> does not prevent this.
Forgive my idiot, I still don't know the details of unsafe lruvec here.
From my shortsight, the spin_lock_irq(embedded a preempt_disable) could block all rcu syncing thus, keep all memcg alive until the preempt_enabled in unspinlock, is this right?
If so even the page->mem_cgroup is migrated to others cgroups, the new and old cgroup should still be alive here.
>
> Either the page needs to be locked, or the page needs to be PageLRU
> with the lru_lock held to prevent somebody else from isolating
> it. Otherwise, the lruvec is not safe to use.
Do you mean that we may get the wrong lruvec->lru_lock if !PageLRU, so, the page may got freed by others? Sorry I got last there.
Thanks
Alex
在 2019/11/19 下午11:41, Johannes Weiner 写道:
> On Tue, Nov 19, 2019 at 08:23:15PM +0800, Alex Shi wrote:
>> ../mm/swap.c: In function ‘__page_cache_release’:
>> ../mm/swap.c:67:10: warning: ‘flags’ may be used uninitialized in this
>> function [-Wmaybe-uninitialized]
>> lruvec = lock_page_lruvec_irqsave(page, pgdat, flags);
>> ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> At this point in the series, there is no lock_page_lruvec_irqsave()
> yet, so this patch is out of order. Please don't do that. It's very
> hard to review patches that depend on later changes to make sense.
>
Sorry, I will be more careful on this.
>> Signed-off-by: Alex Shi <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> mm/swap.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 5341ae93861f..c36a10244d07 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -62,7 +62,7 @@ static void __page_cache_release(struct page *page)
>> if (PageLRU(page)) {
>> pg_data_t *pgdat = page_pgdat(page);
>> struct lruvec *lruvec;
>> - unsigned long flags;
>> + unsigned long flags = 0;
> Use unintialized_var() for these cases to make the intent clear.
Thanks for suggestion!
?? 2019/11/20 ????12:44, Johannes Weiner д??:
> On Tue, Nov 19, 2019 at 07:57:04AM -0800, Matthew Wilcox wrote:
>> On Tue, Nov 19, 2019 at 08:23:17PM +0800, Alex Shi wrote:
>>> +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
>>> + struct pglist_data *pgdat, unsigned long *flags)
>>> +{
>>> + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
>>> +
>>> + spin_lock_irqsave(&lruvec->lru_lock, *flags);
>>> +
>>> + return lruvec;
>>> +}
>>
>> This should be a macro, not a function. You basically can't do this;
>> spin_lock_irqsave needs to write to a variable which can then be passed
>> to spin_unlock_irqrestore(). What you're doing here will dereference the
>> pointer in _this_ function, but won't propagate the modified value back to
>> the caller. I suppose you could do something like this ...
>
> This works because spin_lock_irqsave and local_irq_save() are
> macros. It boils down to '*flags = arch_local_irq_save()' in this
> function, and therefor does the right thing.
>
> We exploit that in quite a few places:
>
> $ git grep 'spin_lock_irqsave(.*\*flags' | wc -l
> 39
>
Yes, thank you all for point this!
?? 2019/11/20 ????12:19, Johannes Weiner д??:
> On Tue, Nov 19, 2019 at 08:23:23PM +0800, Alex Shi wrote:
>> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
>> fix the incorrect comments in code. Also fixed some zone->lru_lock comment
>> error from ancient time. etc.
>>
>> Originally-from: Hugh Dickins <[email protected]>
>
> You can resubmit a patch written by somebody else, but you need to
> preserve authorship such that git can attribute it properly:
>
> From: Hugh Dickins <[email protected]>
>
> in the patch header, as well as
>
> Signed-off-by: Hugh Dickins <[email protected]>
>
> in the changelog tags. It should look something like this:
>
> ---
> From: Hugh Dickins <[email protected]>
> Subject: [PATCH] mm/memcg: update lru_lock Doc and comments
>
> Update scattered references from "zone_lru_lock" to "lruvec->lru_lock"
> (in low-level descriptions) or "lruvec lock" (where that reads better).
>
> In the course of doing so, update lock ordering comments in mm/rmap.c
> and mm/filemap.c and Documentation/cgroups/memory.txt - slightly, with
> no attempt to be complete (swap_lock, in particular, left out-of-date).
> Remove allusions to set_page_dirty under page_remove_rmap: gone in v3.9.
>
> memcg_test.txt looks quite out-of-date: just give LRU a short paragraph.
> Replaced zone by node throughout unevictable-lru.txt (except for stats).
> Leave Documentation/locking/lockstat.txt untouched this time: it doesn't
> matter if that still refers to zone->lru_lock, along with other history.
>
> I struggled to understand the comment above move_pages_to_lru() (surely
> it never calls page_referenced()), and eventually realized that most of
> it had got separated from shrink_active_list(): move that comment back.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Signed-off-by: Alex Shi <[email protected]>
> ---
>
Thanks for teaching!
On Wed, Nov 20, 2019 at 07:41:44PM +0800, Alex Shi wrote:
> 在 2019/11/20 上午12:04, Johannes Weiner 写道:
> >> @@ -1246,6 +1245,46 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
> >> return lruvec;
> >> }
> >>
> >> +struct lruvec *lock_page_lruvec_irq(struct page *page,
> >> + struct pglist_data *pgdat)
> >> +{
> >> + struct lruvec *lruvec;
> >> +
> >> +again:
> >> + rcu_read_lock();
> >> + lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >> + spin_lock_irq(&lruvec->lru_lock);
> >> + rcu_read_unlock();
> > The spinlock doesn't prevent the lruvec from being freed
> >
> > You deleted the rules from the mem_cgroup_page_lruvec() documentation,
> > but they still apply: if the page is already !PageLRU() by the time
> > you get here, it could get reclaimed or migrated to another cgroup,
> > and that can free the memcg/lruvec. Merely having the lru_lock held
> > does not prevent this.
>
>
> Forgive my idiot, I still don't know the details of unsafe lruvec here.
> From my shortsight, the spin_lock_irq(embedded a preempt_disable) could block all rcu syncing thus, keep all memcg alive until the preempt_enabled in unspinlock, is this right?
> If so even the page->mem_cgroup is migrated to others cgroups, the new and old cgroup should still be alive here.
You are right about the freeing part, I missed this. And I should have
read this email here before sending out my "fix" to the current code;
thankfully Hugh re-iterated my mistake on that thread. My apologies.
But I still don't understand how the moving part is safe. You look up
the lruvec optimistically, lock it, then verify the lookup. What keeps
page->mem_cgroup from changing after you verified it?
lock_page_lruvec(): mem_cgroup_move_account():
again:
rcu_read_lock()
lruvec = page->mem_cgroup->lruvec
isolate_lru_page()
spin_lock_irq(&lruvec->lru_lock)
rcu_read_unlock()
if page->mem_cgroup->lruvec != lruvec:
spin_unlock_irq(&lruvec->lru_lock)
goto again;
page->mem_cgroup = new cgroup
putback_lru_page() // new lruvec
SetPageLRU()
return lruvec; // old lruvec
The caller assumes page belongs to the returned lruvec and will then
change the page's lru state with a mismatched page and lruvec.
If we could restrict lock_page_lruvec() to working only on PageLRU
pages, we could fix the problem with memory barriers. But this won't
work for split_huge_page(), which is AFAICT the only user that needs
to freeze the lru state of a page that could be isolated elsewhere.
So AFAICS the only option is to lock out mem_cgroup_move_account()
entirely when the lru_lock is held. Which I guess should be fine.
在 2019/11/22 上午6:06, Johannes Weiner 写道:
>>
>> Forgive my idiot, I still don't know the details of unsafe lruvec here.
>> From my shortsight, the spin_lock_irq(embedded a preempt_disable) could block all rcu syncing thus, keep all memcg alive until the preempt_enabled in unspinlock, is this right?
>> If so even the page->mem_cgroup is migrated to others cgroups, the new and old cgroup should still be alive here.
> You are right about the freeing part, I missed this. And I should have
> read this email here before sending out my "fix" to the current code;
> thankfully Hugh re-iterated my mistake on that thread. My apologies.
>
That's all right. You and Hugh do give me a lot of help! :)
> But I still don't understand how the moving part is safe. You look up
> the lruvec optimistically, lock it, then verify the lookup. What keeps
> page->mem_cgroup from changing after you verified it?
>
> lock_page_lruvec(): mem_cgroup_move_account():
> again:
> rcu_read_lock()
> lruvec = page->mem_cgroup->lruvec
> isolate_lru_page()
> spin_lock_irq(&lruvec->lru_lock)
> rcu_read_unlock()
> if page->mem_cgroup->lruvec != lruvec:
> spin_unlock_irq(&lruvec->lru_lock)
> goto again;
> page->mem_cgroup = new cgroup
> putback_lru_page() // new lruvec
> SetPageLRU()
> return lruvec; // old lruvec
>
> The caller assumes page belongs to the returned lruvec and will then
> change the page's lru state with a mismatched page and lruvec.
>
Yes, that's the problem we have to deal.
> If we could restrict lock_page_lruvec() to working only on PageLRU
> pages, we could fix the problem with memory barriers. But this won't
> work for split_huge_page(), which is AFAICT the only user that needs
> to freeze the lru state of a page that could be isolated elsewhere.
>
> So AFAICS the only option is to lock out mem_cgroup_move_account()
> entirely when the lru_lock is held. Which I guess should be fine.
I guess we can try from lock_page_memcg, is that a good start?
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e6387ad01f0..f4bbbf72c5b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1224,7 +1224,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
goto out;
}
- memcg = page->mem_cgroup;
+ memcg = lock_page_memcg(page);
/*
* Swapcache readahead pages are added to the LRU - and
* possibly migrated - before they are charged.
Thanks a lot!
Alex
On Fri, Nov 22, 2019 at 10:36:32AM +0800, Alex Shi wrote:
> 在 2019/11/22 上午6:06, Johannes Weiner 写道:
> > If we could restrict lock_page_lruvec() to working only on PageLRU
> > pages, we could fix the problem with memory barriers. But this won't
> > work for split_huge_page(), which is AFAICT the only user that needs
> > to freeze the lru state of a page that could be isolated elsewhere.
> >
> > So AFAICS the only option is to lock out mem_cgroup_move_account()
> > entirely when the lru_lock is held. Which I guess should be fine.
>
> I guess we can try from lock_page_memcg, is that a good start?
Yes.
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7e6387ad01f0..f4bbbf72c5b8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1224,7 +1224,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
> goto out;
> }
>
> - memcg = page->mem_cgroup;
> + memcg = lock_page_memcg(page);
> /*
> * Swapcache readahead pages are added to the LRU - and
> * possibly migrated - before they are charged.
test_clear_page_writeback() calls this function with that lock already
held so that would deadlock. Let's keep locking in lock_page_lruvec().
lock_page_lruvec():
memcg = lock_page_memcg(page);
lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
spin_lock_irqsave(&lruvec->lru_lock, *flags);
return lruvec;
unlock_lruvec();
spin_unlock_irqrestore(&lruvec->lru_lock);
__unlock_page_memcg(lruvec_memcg(lruvec));
The lock ordering should be fine as well. But it might be a good idea
to stick a might_lock(&memcg->move_lock) in lock_page_memcg() before
that atomic_read() and test with lockdep enabled.
But that leaves me with one more worry: compaction. We locked out
charge moving now, so between that and knowing that the page is alive,
we have page->mem_cgroup stable. But compaction doesn't know whether
the page is alive - it comes from a pfn and finds out using PageLRU.
In the current code, pgdat->lru_lock remains the same before and after
the page is charged to a cgroup, so once compaction has that locked
and it observes PageLRU, it can go ahead and isolate the page.
But lruvec->lru_lock changes during charging, and then compaction may
hold the wrong lock during isolation:
compaction: generic_file_buffered_read:
page_cache_alloc()
!PageBuddy()
lock_page_lruvec(page)
lruvec = mem_cgroup_page_lruvec()
spin_lock(&lruvec->lru_lock)
if lruvec != mem_cgroup_page_lruvec()
goto again
add_to_page_cache_lru()
mem_cgroup_commit_charge()
page->mem_cgroup = foo
lru_cache_add()
__pagevec_lru_add()
SetPageLRU()
if PageLRU(page):
__isolate_lru_page()
I don't see what prevents the lruvec from changing under compaction,
neither in your patches nor in Hugh's. Maybe I'm missing something?
On Fri, 22 Nov 2019, Johannes Weiner wrote:
>
> But that leaves me with one more worry: compaction. We locked out
> charge moving now, so between that and knowing that the page is alive,
> we have page->mem_cgroup stable. But compaction doesn't know whether
> the page is alive - it comes from a pfn and finds out using PageLRU.
>
> In the current code, pgdat->lru_lock remains the same before and after
> the page is charged to a cgroup, so once compaction has that locked
> and it observes PageLRU, it can go ahead and isolate the page.
>
> But lruvec->lru_lock changes during charging, and then compaction may
> hold the wrong lock during isolation:
>
> compaction: generic_file_buffered_read:
>
> page_cache_alloc()
>
> !PageBuddy()
>
> lock_page_lruvec(page)
> lruvec = mem_cgroup_page_lruvec()
> spin_lock(&lruvec->lru_lock)
> if lruvec != mem_cgroup_page_lruvec()
> goto again
>
> add_to_page_cache_lru()
> mem_cgroup_commit_charge()
> page->mem_cgroup = foo
> lru_cache_add()
> __pagevec_lru_add()
> SetPageLRU()
>
> if PageLRU(page):
> __isolate_lru_page()
>
> I don't see what prevents the lruvec from changing under compaction,
> neither in your patches nor in Hugh's. Maybe I'm missing something?
Speaking for my patches only: I'm humbled, I think you have caught me,
I cannot find any argument against the race you suggest here.
The race with mem_cgroup_move_account(), which Konstantin pointed out
in 2012's https://lore.kernel.org/lkml/[email protected]/
but I later misunderstood, and came to think I needed no patch against,
until this week coming to perceive the same race in isolate_lru_page():
that one is easily and satisfactorily fixed by holding lruvec lock in
mem_cgroup_move_account() - embarrassing, but not too serious.
Your race here (again, lruvec lock taken then PageLRU observed, but
page->mem_cgroup changed in between) really questions my whole scheme:
I am not going to propose a solution now, I'll have to go back and
recheck my assumptions all over. Certainly isolate_migratepage_block()
has a harder job than any other, but I need to re-review it all.
Maybe we got it right back in the days of PageCgroupUsed, and then I
paid too little attention when rebasing to your welcome simplifications.
I don't think any of us want to bring back PageCgroupUsed! And maybe we
could get it right by always holding lruvec lock in commit_charge(),
lrucare or not; but that's a much hotter path, and not a change I'd
expect anyone to embrace.
I'll go away and re-examine it all; probably start by verifying that
your race actually happens in practice, though we never observed it.
Heavy-hearted thanks, Hannes!
Hugh
在 2019/11/23 上午12:16, Johannes Weiner 写道:
> On Fri, Nov 22, 2019 at 10:36:32AM +0800, Alex Shi wrote:
>> 在 2019/11/22 上午6:06, Johannes Weiner 写道:
>>> If we could restrict lock_page_lruvec() to working only on PageLRU
>>> pages, we could fix the problem with memory barriers. But this won't
>>> work for split_huge_page(), which is AFAICT the only user that needs
>>> to freeze the lru state of a page that could be isolated elsewhere.
>>>
>>> So AFAICS the only option is to lock out mem_cgroup_move_account()
>>> entirely when the lru_lock is held. Which I guess should be fine.
>>
>> I guess we can try from lock_page_memcg, is that a good start?
>
> Yes.
>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 7e6387ad01f0..f4bbbf72c5b8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1224,7 +1224,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>> goto out;
>> }
>>
>> - memcg = page->mem_cgroup;
>> + memcg = lock_page_memcg(page);
>> /*
>> * Swapcache readahead pages are added to the LRU - and
>> * possibly migrated - before they are charged.
>
> test_clear_page_writeback() calls this function with that lock already
> held so that would deadlock. Let's keep locking in lock_page_lruvec().
>
> lock_page_lruvec():
>
> memcg = lock_page_memcg(page);
> lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>
> spin_lock_irqsave(&lruvec->lru_lock, *flags);
> return lruvec;
>
> unlock_lruvec();
>
> spin_unlock_irqrestore(&lruvec->lru_lock);
> __unlock_page_memcg(lruvec_memcg(lruvec));
>
> The lock ordering should be fine as well. But it might be a good idea
> to stick a might_lock(&memcg->move_lock) in lock_page_memcg() before
> that atomic_read() and test with lockdep enabled.
Hi Johannes,
Thanks a lot for the suggestion. I will look into this and try.
>
>
> But that leaves me with one more worry: compaction. We locked out
> charge moving now, so between that and knowing that the page is alive,
> we have page->mem_cgroup stable. But compaction doesn't know whether
> the page is alive - it comes from a pfn and finds out using PageLRU.
>
> In the current code, pgdat->lru_lock remains the same before and after
> the page is charged to a cgroup, so once compaction has that locked
> and it observes PageLRU, it can go ahead and isolate the page.
>
> But lruvec->lru_lock changes during charging, and then compaction may
> hold the wrong lock during isolation:
>
> compaction: generic_file_buffered_read:
>
> page_cache_alloc()
>
> !PageBuddy()
>
> lock_page_lruvec(page)
> lruvec = mem_cgroup_page_lruvec()
> spin_lock(&lruvec->lru_lock)
> if lruvec != mem_cgroup_page_lruvec()
> goto again
>
> add_to_page_cache_lru()
> mem_cgroup_commit_charge()
> page->mem_cgroup = foo
> lru_cache_add()
> __pagevec_lru_add()
> SetPageLRU()
>
> if PageLRU(page):
> __isolate_lru_page()
>
> I don't see what prevents the lruvec from changing under compaction,
> neither in your patches nor in Hugh's. Maybe I'm missing something?
Yes, it's a problem.
Guess we could move the lruvec recheck after PageLRU() test in compaction. Then it could be safe, and add a bit more burden on compaction should be fine. at last we have no disturb to file read.
Thanks
Alex
On 19/11/2019 15.23, Alex Shi wrote:
> Hi all,
>
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg per node.
>
> According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
> containers on my 2s* 8 core * HT box with the modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>
> With this change above lru_lock censitive testing improved 17% with multiple
> containers scenario. And no performance lose w/o mem_cgroup.
Splitting lru_lock isn't only option for solving this lock contention.
Also it doesn't help if all this happens in one cgroup.
I think better batching could solve more problems with less overhead.
Like larger per-cpu vectors or queues for each numa node or even for each lruvec.
This will preliminarily sort and aggregate pages so actual modification under
lru_lock will be much cheaper and fine grained.
>
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought the same idea
> 7 years ago. Now I believe considering my testing result, and google internal
> using fact. This feature is clearly benefit multi-container users.
>
> So I'd like to introduce it here.
>
> Thanks all the comments from Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
> Johannes Weiner, Mel Gorman, Shakeel Butt, Rong Chen, Fengguang Wu, Yun Wang etc.
>
> v4:
> a, fix the page->mem_cgroup dereferencing issue, thanks Johannes Weiner
> b, remove the irqsave flags changes, thanks Metthew Wilcox
> c, merge/split patches for better understanding and bisection purpose
>
> v3: rebase on linux-next, and fold the relock fix patch into introduceing patch
>
> v2: bypass a performance regression bug and fix some function issues
>
> v1: initial version, aim testing show 5% performance increase
>
>
> Alex Shi (9):
> mm/swap: fix uninitialized compiler warning
> mm/huge_memory: fix uninitialized compiler warning
> mm/lru: replace pgdat lru_lock with lruvec lock
> mm/mlock: only change the lru_lock iff page's lruvec is different
> mm/swap: only change the lru_lock iff page's lruvec is different
> mm/vmscan: only change the lru_lock iff page's lruvec is different
> mm/pgdat: remove pgdat lru_lock
> mm/lru: likely enhancement
> mm/lru: revise the comments of lru_lock
>
> Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +----
> Documentation/admin-guide/cgroup-v1/memory.rst | 6 +-
> Documentation/trace/events-kmem.rst | 2 +-
> Documentation/vm/unevictable-lru.rst | 22 +++----
> include/linux/memcontrol.h | 68 ++++++++++++++++++++
> include/linux/mm_types.h | 2 +-
> include/linux/mmzone.h | 5 +-
> mm/compaction.c | 67 +++++++++++++------
> mm/filemap.c | 4 +-
> mm/huge_memory.c | 17 ++---
> mm/memcontrol.c | 75 +++++++++++++++++-----
> mm/mlock.c | 27 ++++----
> mm/mmzone.c | 1 +
> mm/page_alloc.c | 1 -
> mm/page_idle.c | 5 +-
> mm/rmap.c | 2 +-
> mm/swap.c | 74 +++++++++------------
> mm/vmscan.c | 74 ++++++++++-----------
> 18 files changed, 287 insertions(+), 180 deletions(-)
>
>
> But that leaves me with one more worry: compaction. We locked out
> charge moving now, so between that and knowing that the page is alive,
> we have page->mem_cgroup stable. But compaction doesn't know whether
> the page is alive - it comes from a pfn and finds out using PageLRU.
>
> In the current code, pgdat->lru_lock remains the same before and after
> the page is charged to a cgroup, so once compaction has that locked
> and it observes PageLRU, it can go ahead and isolate the page.
>
> But lruvec->lru_lock changes during charging, and then compaction may
> hold the wrong lock during isolation:
>
> compaction: generic_file_buffered_read:
>
> page_cache_alloc()
>
> !PageBuddy()
>
> lock_page_lruvec(page)
> lruvec = mem_cgroup_page_lruvec()
> spin_lock(&lruvec->lru_lock)
> if lruvec != mem_cgroup_page_lruvec()
> goto again
>
> add_to_page_cache_lru()
> mem_cgroup_commit_charge()
> page->mem_cgroup = foo
> lru_cache_add()
> __pagevec_lru_add()
> SetPageLRU()
>
> if PageLRU(page):
> __isolate_lru_page()
>
> I don't see what prevents the lruvec from changing under compaction,
> neither in your patches nor in Hugh's. Maybe I'm missing something?
>
Hi Johannes,
It looks my patch do the lruvec recheck/relock after PageLRU in compaction.c.
It should be fine for your question. So I will try more testing after all changes.
Thanks
Alex
@@ -949,10 +959,26 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
+ rcu_read_lock();
+reget_lruvec:
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
/* If we already hold the lock, we can skip some rechecking */
- if (!locked) {
- locked = compact_lock_irqsave(&pgdat->lru_lock,
- &flags, cc);
+ if (lruvec != locked_lruvec) {
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+ flags);
+ locked_lruvec = NULL;
+ }
+ if (compact_lock_irqsave(&lruvec->lru_lock,
+ &flags, cc))
+ locked_lruvec = lruvec;
+
+
+ if (lruvec != mem_cgroup_page_lruvec(page, pgdat))
+ goto reget_lruvec;
+
+ rcu_read_unlock();
/* Try get exclusive access under lock */
if (!skip_updated) {
@@ -974,9 +1000,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
low_pfn += compound_nr(page) - 1;
goto isolate_fail;
}
- }
+ } else
+ rcu_read_unlock();
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
/* Try isolate the page */
if (__isolate_lru_page(page, isolate_mode) != 0)
@@ -1017,9 +1043,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
* page anyway.
*/
if (nr_isolated) {
- if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
- locked = false;
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+ flags);
+ locked_lruvec = NULL;
}
putback_movable_pages(&cc->migratepages);
On Mon, Nov 25, 2019 at 1:26 AM Alex Shi <[email protected]> wrote:
>
>
> >
> > But that leaves me with one more worry: compaction. We locked out
> > charge moving now, so between that and knowing that the page is alive,
> > we have page->mem_cgroup stable. But compaction doesn't know whether
> > the page is alive - it comes from a pfn and finds out using PageLRU.
> >
> > In the current code, pgdat->lru_lock remains the same before and after
> > the page is charged to a cgroup, so once compaction has that locked
> > and it observes PageLRU, it can go ahead and isolate the page.
> >
> > But lruvec->lru_lock changes during charging, and then compaction may
> > hold the wrong lock during isolation:
> >
> > compaction: generic_file_buffered_read:
> >
> > page_cache_alloc()
> >
> > !PageBuddy()
> >
> > lock_page_lruvec(page)
> > lruvec = mem_cgroup_page_lruvec()
> > spin_lock(&lruvec->lru_lock)
> > if lruvec != mem_cgroup_page_lruvec()
> > goto again
> >
> > add_to_page_cache_lru()
> > mem_cgroup_commit_charge()
> > page->mem_cgroup = foo
> > lru_cache_add()
> > __pagevec_lru_add()
> > SetPageLRU()
> >
> > if PageLRU(page):
> > __isolate_lru_page()
> >
> > I don't see what prevents the lruvec from changing under compaction,
> > neither in your patches nor in Hugh's. Maybe I'm missing something?
> >
>
> Hi Johannes,
>
> It looks my patch do the lruvec recheck/relock after PageLRU in compaction.c.
> It should be fine for your question. So I will try more testing after all changes.
Actually no, unless PageLRU check and taking lruvec lock are atomic,
the race mentioned by Johannes still exist.
Shakeel