This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg.
Per memcg lru_lock would ease the lru_lock contention a lot in
this patch series.
In some data center, containers are used widely to deploy different kind
of services, then multiple memcgs share per node pgdat->lru_lock which
cause heavy lock contentions when doing lru operation.
On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
simultaneously with mmtests' config:
# AIM9
export AIM9_TESTTIME=180
export AIM9_TESTLIST=page_test,brk_test
perf lock report show much contentions on lru_lock in 20 second snapshot:
Name acquired contended avg wait (ns) total wait (ns) max wait (ns) min wait (ns)
&(ptlock_ptr(pag... 22 0 0 0 0 0
...
&(&pgdat->lru_lo... 9 7 12728 89096 26656 1597
With this patch series, lruvec->lru_lock show no contentions
&(&lruvec->lru_l... 8 0 0 0 0 0
and aim9 page_test/brk_test performance increased 5%~50%.
BTW, Detailed results in aim9-pft.compare.log if needed,
All containers data are increased and pretty steady.
$for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i page_test ============"; cat aim9-pft.compare.log | grep "^$i.*page_test" | awk 'BEGIN {a=b=0;} { a+=$3; b+=$6 } END { print "5.3-rc4 " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
========= Max page_test ============
5.3-rc4 34729.6
5.3-rc4+lru_lock 36128.3
========= Min page_test ============
5.3-rc4 33644.2
5.3-rc4+lru_lock 35349.7
========= Hmean page_test ============
5.3-rc4 34355.4
5.3-rc4+lru_lock 35810.9
========= Stddev page_test ============
5.3-rc4 319.757
5.3-rc4+lru_lock 223.324
========= CoeffVar page_test ============
5.3-rc4 0.93125
5.3-rc4+lru_lock 0.623333
========= BHmean-50 page_test ============
5.3-rc4 34579.2
5.3-rc4+lru_lock 35977.1
========= BHmean-95 page_test ============
5.3-rc4 34421.7
5.3-rc4+lru_lock 35853.6
========= BHmean-99 page_test ============
5.3-rc4 34421.7
5.3-rc4+lru_lock 35853.6
$for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i brk_test ============"; cat aim9-pft.compare.log | grep "^$i.*brk_test" | awk 'BEGIN {a=b=0;} { a+=$3; b+=$6 } END { print "5.3-rc4 " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
========= Max brk_test ============
5.3-rc4 96647.7
5.3-rc4+lru_lock 98960.3
========= Min brk_test ============
5.3-rc4 91800.8
5.3-rc4+lru_lock 96817.6
========= Hmean brk_test ============
5.3-rc4 95470
5.3-rc4+lru_lock 97769.6
========= Stddev brk_test ============
5.3-rc4 1253.52
5.3-rc4+lru_lock 596.593
========= CoeffVar brk_test ============
5.3-rc4 1.31375
5.3-rc4+lru_lock 0.609583
========= BHmean-50 brk_test ============
5.3-rc4 96141.4
5.3-rc4+lru_lock 98194
========= BHmean-95 brk_test ============
5.3-rc4 95818.5
5.3-rc4+lru_lock 97857.2
========= BHmean-99 brk_test ============
5.3-rc4 95818.5
5.3-rc4+lru_lock 97857.2
Alex Shi (14):
mm/lru: move pgdat lru_lock into lruvec
lru/memcg: move the lruvec->pgdat sync out lru_lock
lru/memcg: using per lruvec lock in un/lock_page_lru
lru/compaction: use per lruvec lock in isolate_migratepages_block
lru/huge_page: use per lruvec lock in __split_huge_page
lru/mlock: using per lruvec lock in munlock
lru/swap: using per lruvec lock in page_cache_release
lru/swap: uer lruvec lock in activate_page
lru/swap: uer per lruvec lock in pagevec_lru_move_fn
lru/swap: use per lruvec lock in release_pages
lru/vmscan: using per lruvec lock in lists shrinking.
lru/vmscan: use pre lruvec lock in check_move_unevictable_pages
lru/vmscan: using per lruvec lru_lock in get_scan_count
mm/lru: fix the comments of lru_lock
include/linux/memcontrol.h | 24 ++++++++++----
include/linux/mm_types.h | 2 +-
include/linux/mmzone.h | 6 ++--
mm/compaction.c | 48 +++++++++++++++++-----------
mm/filemap.c | 4 +--
mm/huge_memory.c | 9 ++++--
mm/memcontrol.c | 24 ++++++--------
mm/mlock.c | 35 ++++++++++----------
mm/mmzone.c | 1 +
mm/page_alloc.c | 1 -
mm/page_idle.c | 4 +--
mm/rmap.c | 2 +-
mm/swap.c | 79 +++++++++++++++++++++++++---------------------
mm/vmscan.c | 63 ++++++++++++++++++------------------
14 files changed, 166 insertions(+), 136 deletions(-)
--
1.8.3.1
to replace pgdat lru_lock.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index d2dad08fcfd0..24a2b3456e10 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -192,26 +192,27 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
void *arg)
{
int i;
- struct pglist_data *pgdat = NULL;
- struct lruvec *lruvec;
+ struct lruvec *locked_lruvec = NULL;
unsigned long flags = 0;
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- struct pglist_data *pagepgdat = page_pgdat(page);
-
- if (pagepgdat != pgdat) {
- if (pgdat)
- spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
- pgdat = pagepgdat;
- spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
+ struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ if (locked_lruvec != lruvec) {
+ if (locked_lruvec)
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+ locked_lruvec = lruvec;
+ spin_lock_irqsave(&lruvec->lru_lock, flags);
+ sync_lruvec_pgdat(lruvec, pgdat);
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
}
- if (pgdat)
- spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+ if (locked_lruvec)
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
}
--
1.8.3.1
The lruvec is passed as parameter, so no lruvec->pgdat syncing needed.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 123447b9beda..ea5c2f3f2567 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2372,7 +2372,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
@@ -2393,7 +2393,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
fraction[0] = ap;
fraction[1] = fp;
--
1.8.3.1
This patch move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg.
Per memcg lru_lock would ease the lru_lock contention a lot in
this patch series.
In some data center, containers are used widely to deploy different kind
of services, then multiple memcgs share per node pgdat->lru_lock which
cause heavy lock contentions when doing lru operations.
On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
simultaneously with mmtests' config:
# AIM9
export AIM9_TESTTIME=180
export AIM9_TESTLIST=page_test,brk_test
perf lock report show much contentions on lru_lock in 20 second snapshot:
Name acquired contended avg wait (ns) total wait (ns) max wait (ns) min wait (ns)
&(ptlock_ptr(pag... 22 0 0 0 0 0
...
&(&pgdat->lru_lo... 9 7 12728 89096 26656 1597
With this patch series, lruvec->lru_lock show no contentions
&(&lruvec->lru_l... 8 0 0 0 0 0
and aim9 page_test/brk_test performance increased 5%~50%.
Now this patch still using per pgdat lru_lock, no function changes yet.
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Colin Ian King <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/mmzone.h | 4 +++-
mm/compaction.c | 10 +++++-----
mm/huge_memory.c | 6 +++---
mm/memcontrol.c | 6 +++---
mm/mlock.c | 10 +++++-----
mm/mmzone.c | 1 +
mm/page_alloc.c | 1 -
mm/page_idle.c | 4 ++--
mm/swap.c | 28 ++++++++++++++--------------
mm/vmscan.c | 38 +++++++++++++++++++-------------------
10 files changed, 55 insertions(+), 53 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..8d0076d084be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -295,6 +295,9 @@ struct zone_reclaim_stat {
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
+ /* move lru_lock to per lruvec for memcg */
+ spinlock_t lru_lock;
+
struct zone_reclaim_stat reclaim_stat;
/* Evictions & activations on the inactive file list */
atomic_long_t inactive_age;
@@ -744,7 +747,6 @@ struct zonelist {
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
- spinlock_t lru_lock;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..9a737f343183 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -846,7 +846,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
* a fatal signal is pending.
*/
if (!(low_pfn % SWAP_CLUSTER_MAX)
- && compact_unlock_should_abort(&pgdat->lru_lock,
+ && compact_unlock_should_abort(&pgdat->lruvec.lru_lock,
flags, &locked, cc)) {
low_pfn = 0;
goto fatal_pending;
@@ -919,7 +919,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (unlikely(__PageMovable(page)) &&
!PageIsolated(page)) {
if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock,
+ spin_unlock_irqrestore(&pgdat->lruvec.lru_lock,
flags);
locked = false;
}
@@ -949,7 +949,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
- locked = compact_lock_irqsave(&pgdat->lru_lock,
+ locked = compact_lock_irqsave(&pgdat->lruvec.lru_lock,
&flags, cc);
/* Try get exclusive access under lock */
@@ -1016,7 +1016,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
*/
if (nr_isolated) {
if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
locked = false;
}
putback_movable_pages(&cc->migratepages);
@@ -1043,7 +1043,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
isolate_abort:
if (locked)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
/*
* Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 738065f765ab..3a483deee807 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2529,7 +2529,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_unlock(&head->mapping->i_pages);
}
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
remap_page(head);
@@ -2740,7 +2740,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
lru_add_drain();
/* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irqsave(&pgdata->lru_lock, flags);
+ spin_lock_irqsave(&pgdata->lruvec.lru_lock, flags);
if (mapping) {
XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2785,7 +2785,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&pgdata->split_queue_lock);
fail: if (mapping)
xa_unlock(&mapping->i_pages);
- spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+ spin_unlock_irqrestore(&pgdata->lruvec.lru_lock, flags);
remap_page(head);
ret = -EBUSY;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f5c0c517c49..2792b8ed405f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2557,7 +2557,7 @@ static void lock_page_lru(struct page *page, int *isolated)
{
pg_data_t *pgdat = page_pgdat(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
if (PageLRU(page)) {
struct lruvec *lruvec;
@@ -2581,7 +2581,7 @@ static void unlock_page_lru(struct page *page, int isolated)
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
}
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
}
static void commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -2901,7 +2901,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
/*
* Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * pgdat->lruvec.lru_lock and migration entries setup in all page mappings.
*/
void mem_cgroup_split_huge_fixup(struct page *head)
{
diff --git a/mm/mlock.c b/mm/mlock.c
index a90099da4fb4..1279684bada0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -194,7 +194,7 @@ unsigned int munlock_vma_page(struct page *page)
* might otherwise copy PageMlocked to part of the tail pages before
* we clear it in the head page. It also stabilizes hpage_nr_pages().
*/
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
if (!TestClearPageMlocked(page)) {
/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -206,14 +206,14 @@ unsigned int munlock_vma_page(struct page *page)
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
if (__munlock_isolate_lru_page(page, true)) {
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
__munlock_isolated_page(page);
goto out;
}
__munlock_isolation_failed(page);
unlock_out:
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
out:
return nr_pages - 1;
@@ -298,7 +298,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
pagevec_init(&pvec_putback);
/* Phase 1: page isolation */
- spin_lock_irq(&zone->zone_pgdat->lru_lock);
+ spin_lock_irq(&zone->zone_pgdat->lruvec.lru_lock);
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];
@@ -325,7 +325,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
pvec->pages[i] = NULL;
}
__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+ spin_unlock_irq(&zone->zone_pgdat->lruvec.lru_lock);
/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
enum lru_list lru;
memset(lruvec, 0, sizeof(struct lruvec));
+ spin_lock_init(&lruvec->lru_lock);
for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..1b07dcaabbd7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6677,7 +6677,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat_page_ext_init(pgdat);
- spin_lock_init(&pgdat->lru_lock);
lruvec_init(node_lruvec(pgdat));
}
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..420bc0ac8c1e 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -42,12 +42,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
return NULL;
pgdat = page_pgdat(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
if (unlikely(!PageLRU(page))) {
put_page(page);
page = NULL;
}
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
return page;
}
diff --git a/mm/swap.c b/mm/swap.c
index ae300397dfda..63f4782af57a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,12 +63,12 @@ static void __page_cache_release(struct page *page)
struct lruvec *lruvec;
unsigned long flags;
- spin_lock_irqsave(&pgdat->lru_lock, flags);
+ spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
lruvec = mem_cgroup_page_lruvec(page, pgdat);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
}
__ClearPageWaiters(page);
mem_cgroup_uncharge(page);
@@ -201,16 +201,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
if (pagepgdat != pgdat) {
if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
pgdat = pagepgdat;
- spin_lock_irqsave(&pgdat->lru_lock, flags);
+ spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
}
lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
}
if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
}
@@ -326,9 +326,9 @@ void activate_page(struct page *page)
pg_data_t *pgdat = page_pgdat(page);
page = compound_head(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
}
#endif
@@ -733,7 +733,7 @@ void release_pages(struct page **pages, int nr)
* same pgdat. The lock is held only if pgdat != NULL.
*/
if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
locked_pgdat = NULL;
}
@@ -742,7 +742,7 @@ void release_pages(struct page **pages, int nr)
if (is_zone_device_page(page)) {
if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+ spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
flags);
locked_pgdat = NULL;
}
@@ -762,7 +762,7 @@ void release_pages(struct page **pages, int nr)
if (PageCompound(page)) {
if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
locked_pgdat = NULL;
}
__put_compound_page(page);
@@ -774,11 +774,11 @@ void release_pages(struct page **pages, int nr)
if (pgdat != locked_pgdat) {
if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+ spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
flags);
lock_batch = 0;
locked_pgdat = pgdat;
- spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+ spin_lock_irqsave(&locked_pgdat->lruvec.lru_lock, flags);
}
lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
@@ -794,7 +794,7 @@ void release_pages(struct page **pages, int nr)
list_add(&page->lru, &pages_to_free);
}
if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
mem_cgroup_uncharge_list(&pages_to_free);
free_unref_page_list(&pages_to_free);
@@ -832,7 +832,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
- lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+ lockdep_assert_held(&lruvec->lru_lock);
if (!list)
SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c77d1e3761a7..c7a228525df0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1805,7 +1805,7 @@ int isolate_lru_page(struct page *page)
pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
lruvec = mem_cgroup_page_lruvec(page, pgdat);
if (PageLRU(page)) {
int lru = page_lru(page);
@@ -1814,7 +1814,7 @@ int isolate_lru_page(struct page *page)
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
}
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
}
return ret;
}
@@ -1890,9 +1890,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
VM_BUG_ON_PAGE(PageLRU(page), page);
if (unlikely(!page_evictable(page))) {
list_del(&page->lru);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
putback_lru_page(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
continue;
}
lruvec = mem_cgroup_page_lruvec(page, pgdat);
@@ -1910,10 +1910,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
del_page_from_lru_list(page, lruvec, lru);
if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
} else
list_add(&page->lru, &pages_to_free);
} else {
@@ -1976,7 +1976,7 @@ static int current_may_throttle(void)
lru_add_drain();
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, lru);
@@ -1988,7 +1988,7 @@ static int current_may_throttle(void)
if (global_reclaim(sc))
__count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
if (nr_taken == 0)
return 0;
@@ -1996,7 +1996,7 @@ static int current_may_throttle(void)
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
&stat, false);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (global_reclaim(sc))
@@ -2009,7 +2009,7 @@ static int current_may_throttle(void)
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
@@ -2062,7 +2062,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
lru_add_drain();
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, lru);
@@ -2073,7 +2073,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_vm_events(PGREFILL, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
while (!list_empty(&l_hold)) {
cond_resched();
@@ -2119,7 +2119,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
@@ -2137,7 +2137,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
mem_cgroup_uncharge_list(&l_active);
free_unref_page_list(&l_active);
@@ -2373,7 +2373,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
@@ -2394,7 +2394,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
fraction[0] = ap;
fraction[1] = fp;
@@ -4263,9 +4263,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
pgscanned++;
if (pagepgdat != pgdat) {
if (pgdat)
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
pgdat = pagepgdat;
- spin_lock_irq(&pgdat->lru_lock);
+ spin_lock_irq(&pgdat->lruvec.lru_lock);
}
lruvec = mem_cgroup_page_lruvec(page, pgdat);
@@ -4286,7 +4286,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
if (pgdat) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ spin_unlock_irq(&pgdat->lruvec.lru_lock);
}
}
EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
--
1.8.3.1
The involoving functions includes isolate_lru_page, move_pages_to_lru
and shrink_in/active_list. also remove unnecessary pgdat.
And remove unnecessary pgdat accordingly.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/vmscan.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c7a228525df0..defc2c4778eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1805,8 +1805,9 @@ int isolate_lru_page(struct page *page)
pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;
- spin_lock_irq(&pgdat->lruvec.lru_lock);
lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irq(&lruvec->lru_lock);
+ sync_lruvec_pgdat(lruvec, pgdat);
if (PageLRU(page)) {
int lru = page_lru(page);
get_page(page);
@@ -1814,7 +1815,7 @@ int isolate_lru_page(struct page *page)
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
}
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
}
return ret;
}
@@ -1879,7 +1880,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
struct list_head *list)
{
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
int nr_pages, nr_moved = 0;
LIST_HEAD(pages_to_free);
struct page *page;
@@ -1890,12 +1890,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
VM_BUG_ON_PAGE(PageLRU(page), page);
if (unlikely(!page_evictable(page))) {
list_del(&page->lru);
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
putback_lru_page(page);
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
continue;
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
SetPageLRU(page);
lru = page_lru(page);
@@ -1910,10 +1909,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
del_page_from_lru_list(page, lruvec, lru);
if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
} else
list_add(&page->lru, &pages_to_free);
} else {
@@ -1976,7 +1975,7 @@ static int current_may_throttle(void)
lru_add_drain();
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, lru);
@@ -1988,7 +1987,7 @@ static int current_may_throttle(void)
if (global_reclaim(sc))
__count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
if (nr_taken == 0)
return 0;
@@ -1996,7 +1995,7 @@ static int current_may_throttle(void)
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
&stat, false);
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (global_reclaim(sc))
@@ -2009,7 +2008,7 @@ static int current_may_throttle(void)
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
@@ -2062,7 +2061,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
lru_add_drain();
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, lru);
@@ -2073,7 +2072,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_vm_events(PGREFILL, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
while (!list_empty(&l_hold)) {
cond_resched();
@@ -2119,7 +2118,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
@@ -2137,7 +2136,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_list(&l_active);
free_unref_page_list(&l_active);
--
1.8.3.1
Using lruvec locking to replace pgdat lru_lock. and then unfold
compact_unlock_should_abort() to fit the replacement.
Signed-off-by: Alex Shi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/compaction.c | 48 ++++++++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 18 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 9a737f343183..8877f38410d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -785,7 +785,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
- bool locked = false;
+ struct lruvec *locked_lruvec = NULL;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
bool skip_on_failure = false;
@@ -845,11 +845,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
* contention, to give chance to IRQs. Abort completely if
* a fatal signal is pending.
*/
- if (!(low_pfn % SWAP_CLUSTER_MAX)
- && compact_unlock_should_abort(&pgdat->lruvec.lru_lock,
- flags, &locked, cc)) {
- low_pfn = 0;
- goto fatal_pending;
+ if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+ locked_lruvec = NULL;
+ }
+
+ if (fatal_signal_pending(current)) {
+ cc->contended = true;
+
+ low_pfn = 0;
+ goto fatal_pending;
+ }
+
+ cond_resched();
}
if (!pfn_valid_within(low_pfn))
@@ -918,10 +927,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
*/
if (unlikely(__PageMovable(page)) &&
!PageIsolated(page)) {
- if (locked) {
- spin_unlock_irqrestore(&pgdat->lruvec.lru_lock,
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
flags);
- locked = false;
+ locked_lruvec = NULL;
}
if (!isolate_movable_page(page, isolate_mode))
@@ -947,10 +956,14 @@ static bool too_many_isolated(pg_data_t *pgdat)
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
/* If we already hold the lock, we can skip some rechecking */
- if (!locked) {
- locked = compact_lock_irqsave(&pgdat->lruvec.lru_lock,
- &flags, cc);
+ if (lruvec != locked_lruvec) {
+ if (compact_lock_irqsave(&lruvec->lru_lock, &flags, cc))
+ locked_lruvec = lruvec;
+
+ sync_lruvec_pgdat(lruvec, pgdat);
/* Try get exclusive access under lock */
if (!skip_updated) {
@@ -974,7 +987,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
}
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
/* Try isolate the page */
if (__isolate_lru_page(page, isolate_mode) != 0)
@@ -1015,9 +1027,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
* page anyway.
*/
if (nr_isolated) {
- if (locked) {
- spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
- locked = false;
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+ locked_lruvec = NULL;
}
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
@@ -1042,8 +1054,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
low_pfn = end_pfn;
isolate_abort:
- if (locked)
- spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+ if (locked_lruvec)
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
/*
* Updated the cached scanner pfn once the pageblock has been scanned
--
1.8.3.1
We are going to move lruvec getting out of lru_lock, the only unsafe
part is lruvec->pgdat syncing when memory node hot pluging.
Splitting out the lruvec->pgdat assignment now and will put it in
lruvec lru_lock protection.
No function changes in this patch now.
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 24 +++++++++++++++++-------
mm/memcontrol.c | 8 +-------
2 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2cd4359cb38c..95b3d9885ab6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -359,6 +359,17 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
return memcg->nodeinfo[nid];
}
+static void sync_lruvec_pgdat(struct lruvec *lruvec, struct pglist_data *pgdat)
+{
+ /*
+ * Since a node can be onlined after the mem_cgroup was created,
+ * we have to be prepared to initialize lruvec->pgdat here;
+ * and if offlined then reonlined, we need to reinitialize it.
+ */
+ if (!mem_cgroup_disabled() && unlikely(lruvec->pgdat != pgdat))
+ lruvec->pgdat = pgdat;
+}
+
/**
* mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
* @node: node of the wanted lruvec
@@ -382,13 +393,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
lruvec = &mz->lruvec;
out:
- /*
- * Since a node can be onlined after the mem_cgroup was created,
- * we have to be prepared to initialize lruvec->pgdat here;
- * and if offlined then reonlined, we need to reinitialize it.
- */
- if (unlikely(lruvec->pgdat != pgdat))
- lruvec->pgdat = pgdat;
+ sync_lruvec_pgdat(lruvec, pgdat);
return lruvec;
}
@@ -857,6 +862,11 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
{
}
+static inline void sync_lruvec_pgdat(struct lruvec *lruvec,
+ struct pglist_data *pgdat)
+{
+}
+
static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2792b8ed405f..e8a1b0d95ba8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1257,13 +1257,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
mz = mem_cgroup_page_nodeinfo(memcg, page);
lruvec = &mz->lruvec;
out:
- /*
- * Since a node can be onlined after the mem_cgroup was created,
- * we have to be prepared to initialize lruvec->zone here;
- * and if offlined then reonlined, we need to reinitialize it.
- */
- if (unlikely(lruvec->pgdat != pgdat))
- lruvec->pgdat = pgdat;
+ sync_lruvec_pgdat(lruvec, pgdat);
return lruvec;
}
--
1.8.3.1
Using lruvec lock to replace pgdat lru_lock.
Signed-off-by: Alex Shi <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/huge_memory.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a483deee807..9a96c0944b4d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2529,7 +2529,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_unlock(&head->mapping->i_pages);
}
- spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
remap_page(head);
@@ -2671,6 +2671,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
+ struct lruvec *lruvec;
int count, mapcount, extra_pins, ret;
bool mlocked;
unsigned long flags;
@@ -2739,8 +2740,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (mlocked)
lru_add_drain();
+ lruvec = mem_cgroup_page_lruvec(head, pgdata);
/* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irqsave(&pgdata->lruvec.lru_lock, flags);
+ spin_lock_irqsave(&lruvec->lru_lock, flags);
+ sync_lruvec_pgdat(lruvec, pgdata);
if (mapping) {
XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2785,7 +2788,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&pgdata->split_queue_lock);
fail: if (mapping)
xa_unlock(&mapping->i_pages);
- spin_unlock_irqrestore(&pgdata->lruvec.lru_lock, flags);
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
remap_page(head);
ret = -EBUSY;
}
--
1.8.3.1
Now we repeatly assign the lruvec->pgdat in memcg. Will remove the
assignment in lruvec getting function after very points are protected.
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8a1b0d95ba8..19fd911e8098 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2550,12 +2550,12 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
static void lock_page_lru(struct page *page, int *isolated)
{
pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
+ sync_lruvec_pgdat(lruvec, pgdat);
if (PageLRU(page)) {
- struct lruvec *lruvec;
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
*isolated = 1;
@@ -2566,16 +2566,14 @@ static void lock_page_lru(struct page *page, int *isolated)
static void unlock_page_lru(struct page *page, int isolated)
{
pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
if (isolated) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
VM_BUG_ON_PAGE(PageLRU(page), page);
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
}
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
}
static void commit_charge(struct page *page, struct mem_cgroup *memcg,
--
1.8.3.1
Also cares the lruvec->pgdat syncing.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 63f4782af57a..2a8fe6df08fc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,12 +63,13 @@ static void __page_cache_release(struct page *page)
struct lruvec *lruvec;
unsigned long flags;
- spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irqsave(&lruvec->lru_lock, flags);
+ sync_lruvec_pgdat(lruvec, pgdat);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
- spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
}
__ClearPageWaiters(page);
mem_cgroup_uncharge(page);
--
1.8.3.1
Since we changed the pgdat->lru_lock to lruvec->lru_lock, have to fix the
incorrect comments in code. Also fixed some zone->lru_lock comment error
in ancient time.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: "Tobin C. Harding" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/mm_types.h | 2 +-
include/linux/mmzone.h | 4 ++--
mm/filemap.c | 4 ++--
mm/rmap.c | 2 +-
mm/vmscan.c | 6 +++---
5 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..f9f990d8f08f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -79,7 +79,7 @@ struct page {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
- * pgdat->lru_lock. Sometimes used as a generic list
+ * lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8d0076d084be..d2f782263e42 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
struct pglist_data;
/*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
* So add a wild amount of padding here to ensure that they fall into separate
* cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here.
@@ -295,7 +295,7 @@ struct zone_reclaim_stat {
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
- /* move lru_lock to per lruvec for memcg */
+ /* perf lruvec lru_lock for memcg */
spinlock_t lru_lock;
struct zone_reclaim_stat reclaim_stat;
diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..0a604c8284f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -100,8 +100,8 @@
* ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one)
* ->i_pages lock (try_to_unmap_one)
- * ->pgdat->lru_lock (follow_page->mark_page_accessed)
- * ->pgdat->lru_lock (check_pte_range->isolate_lru_page)
+ * ->lruvec->lru_lock (follow_page->mark_page_accessed)
+ * ->lruvec->lru_lock (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty)
* ->i_pages lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..6bee4aebced6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
* mapping->i_mmap_rwsem
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
- * pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
+ * lruvec->lru_lock (in mark_page_accessed, isolate_lru_page)
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ea5c2f3f2567..1328eb182a3e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1662,7 +1662,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
}
/**
- * pgdat->lru_lock is heavily contended. Some of the functions that
+ * lruvec->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
*
@@ -1864,9 +1864,9 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
* processes, from rmap.
*
* If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation. But if
+ * appropriate to hold lru_lock across the whole operation. But if
* the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page. It's impossible to balance
+ * should drop lru_lock around each page. It's impossible to balance
* this, so instead we remove the pages from the LRU while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
--
1.8.3.1
to replace per pgdat lru_lock.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/vmscan.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index defc2c4778eb..123447b9beda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4249,24 +4249,24 @@ int page_evictable(struct page *page)
*/
void check_move_unevictable_pages(struct pagevec *pvec)
{
- struct lruvec *lruvec;
- struct pglist_data *pgdat = NULL;
+ struct lruvec *locked_lruvec = NULL;
int pgscanned = 0;
int pgrescued = 0;
int i;
for (i = 0; i < pvec->nr; i++) {
struct page *page = pvec->pages[i];
- struct pglist_data *pagepgdat = page_pgdat(page);
+ struct pglist_data *pgdat = page_pgdat(page);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
pgscanned++;
- if (pagepgdat != pgdat) {
- if (pgdat)
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
- pgdat = pagepgdat;
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ if (lruvec != locked_lruvec) {
+ if (locked_lruvec)
+ spin_unlock_irq(&locked_lruvec->lru_lock);
+ locked_lruvec = lruvec;
+ spin_lock_irq(&lruvec->lru_lock);
+ sync_lruvec_pgdat(lruvec, pgdat);
}
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
if (!PageLRU(page) || !PageUnevictable(page))
continue;
@@ -4282,10 +4282,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
}
}
- if (pgdat) {
+ if (locked_lruvec) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&locked_lruvec->lru_lock);
}
}
EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
--
1.8.3.1
Replace pgdat lru_lock with lruvec lru_lock.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 40 +++++++++++++++++++++-------------------
1 file changed, 21 insertions(+), 19 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 24a2b3456e10..798bffe7875d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -724,8 +724,7 @@ void release_pages(struct page **pages, int nr)
{
int i;
LIST_HEAD(pages_to_free);
- struct pglist_data *locked_pgdat = NULL;
- struct lruvec *lruvec;
+ struct lruvec *locked_lruvec = NULL;
unsigned long uninitialized_var(flags);
unsigned int uninitialized_var(lock_batch);
@@ -737,19 +736,19 @@ void release_pages(struct page **pages, int nr)
* excessive with a continuous string of pages from the
* same pgdat. The lock is held only if pgdat != NULL.
*/
- if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
- locked_pgdat = NULL;
+ if (locked_lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+ locked_lruvec = NULL;
}
if (is_huge_zero_page(page))
continue;
if (is_zone_device_page(page)) {
- if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
flags);
- locked_pgdat = NULL;
+ locked_lruvec = NULL;
}
/*
* ZONE_DEVICE pages that return 'false' from
@@ -766,27 +765,30 @@ void release_pages(struct page **pages, int nr)
continue;
if (PageCompound(page)) {
- if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
- locked_pgdat = NULL;
+ if (locked_lruvec) {
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+ locked_lruvec = NULL;
}
__put_compound_page(page);
continue;
}
if (PageLRU(page)) {
+ struct lruvec *lruvec;
struct pglist_data *pgdat = page_pgdat(page);
- if (pgdat != locked_pgdat) {
- if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+ if (lruvec != locked_lruvec) {
+ if (locked_lruvec)
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock,
flags);
lock_batch = 0;
- locked_pgdat = pgdat;
- spin_lock_irqsave(&locked_pgdat->lruvec.lru_lock, flags);
+ locked_lruvec = lruvec;
+ spin_lock_irqsave(&locked_lruvec->lru_lock, flags);
+ sync_lruvec_pgdat(lruvec, pgdat);
}
- lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -798,8 +800,8 @@ void release_pages(struct page **pages, int nr)
list_add(&page->lru, &pages_to_free);
}
- if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
+ if (locked_lruvec)
+ spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
mem_cgroup_uncharge_list(&pages_to_free);
free_unref_page_list(&pages_to_free);
--
1.8.3.1
This patch using sperate lruvec lock for each of pages in
__munlock_pagevec().
Also this patch pass a lruvec in __munlock_isolate_lru_page() as
parameter to reduce a repeat lruvec locating.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: swkhack <[email protected]>
Cc: "Potyra, Stefan" <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/mlock.c | 35 +++++++++++++++++++----------------
1 file changed, 19 insertions(+), 16 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index 1279684bada0..9915968d490a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
* Isolate a page from LRU with optional get_page() pin.
* Assumes lru_lock already held and page already pinned.
*/
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+ struct lruvec *lruvec, bool getpage)
{
if (PageLRU(page)) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
if (getpage)
get_page(page);
ClearPageLRU(page);
@@ -183,6 +181,9 @@ unsigned int munlock_vma_page(struct page *page)
{
int nr_pages;
pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
/* For try_to_munlock() and to serialize with page migration */
BUG_ON(!PageLocked(page));
@@ -194,7 +195,8 @@ unsigned int munlock_vma_page(struct page *page)
* might otherwise copy PageMlocked to part of the tail pages before
* we clear it in the head page. It also stabilizes hpage_nr_pages().
*/
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ spin_lock_irq(&lruvec->lru_lock);
+ sync_lruvec_pgdat(lruvec, pgdat);
if (!TestClearPageMlocked(page)) {
/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -205,15 +207,15 @@ unsigned int munlock_vma_page(struct page *page)
nr_pages = hpage_nr_pages(page);
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
- if (__munlock_isolate_lru_page(page, true)) {
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ if (__munlock_isolate_lru_page(page, lruvec, true)) {
+ spin_unlock_irq(&lruvec->lru_lock);
__munlock_isolated_page(page);
goto out;
}
__munlock_isolation_failed(page);
unlock_out:
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
out:
return nr_pages - 1;
@@ -291,28 +293,30 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
{
int i;
int nr = pagevec_count(pvec);
- int delta_munlocked = -nr;
struct pagevec pvec_putback;
int pgrescued = 0;
pagevec_init(&pvec_putback);
/* Phase 1: page isolation */
- spin_lock_irq(&zone->zone_pgdat->lruvec.lru_lock);
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];
+ pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irq(&lruvec->lru_lock);
+ sync_lruvec_pgdat(lruvec, pgdat);
if (TestClearPageMlocked(page)) {
/*
* We already have pin from follow_page_mask()
* so we can spare the get_page() here.
*/
- if (__munlock_isolate_lru_page(page, false))
+ if (__munlock_isolate_lru_page(page, lruvec, false)) {
+ __mod_zone_page_state(zone, NR_MLOCK, -1);
+ spin_unlock_irq(&lruvec->lru_lock);
continue;
- else
+ } else
__munlock_isolation_failed(page);
- } else {
- delta_munlocked++;
}
/*
@@ -323,9 +327,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
*/
pagevec_add(&pvec_putback, pvec->pages[i]);
pvec->pages[i] = NULL;
+ spin_unlock_irq(&lruvec->lru_lock);
}
- __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- spin_unlock_irq(&zone->zone_pgdat->lruvec.lru_lock);
/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
--
1.8.3.1
to replace pgdat lru_lock.
Signed-off-by: Alex Shi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peng Fan <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 2a8fe6df08fc..d2dad08fcfd0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -325,11 +325,14 @@ static inline void activate_page_drain(int cpu)
void activate_page(struct page *page)
{
pg_data_t *pgdat = page_pgdat(page);
+ struct lruvec *lruvec;
page = compound_head(page);
- spin_lock_irq(&pgdat->lruvec.lru_lock);
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irq(&lruvec->lru_lock);
+ sync_lruvec_pgdat(lruvec, pgdat);
__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
- spin_unlock_irq(&pgdat->lruvec.lru_lock);
+ spin_unlock_irq(&lruvec->lru_lock);
}
#endif
--
1.8.3.1
On Tue 20-08-19 17:48:23, Alex Shi wrote:
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg.
>
> Per memcg lru_lock would ease the lru_lock contention a lot in
> this patch series.
>
> In some data center, containers are used widely to deploy different kind
> of services, then multiple memcgs share per node pgdat->lru_lock which
> cause heavy lock contentions when doing lru operation.
Having some real world workloads numbers would be more than useful
for a non trivial change like this. I believe googlers have tried
something like this in the past but then didn't have really a good
example of workloads that benefit. I might misremember though. Cc Hugh.
--
Michal Hocko
SUSE Labs
On Tue, Aug 20, 2019 at 05:48:24PM +0800, Alex Shi wrote:
> +++ b/include/linux/mmzone.h
> @@ -295,6 +295,9 @@ struct zone_reclaim_stat {
>
> struct lruvec {
> struct list_head lists[NR_LRU_LISTS];
> + /* move lru_lock to per lruvec for memcg */
> + spinlock_t lru_lock;
This comment makes no sense outside the context of this patch.
On Tue, Aug 20, 2019 at 05:48:37PM +0800, Alex Shi wrote:
> @@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
> struct pglist_data;
>
> /*
> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
> * So add a wild amount of padding here to ensure that they fall into separate
> * cachelines. There are very few zone structures in the machine, so space
> * consumption is not a concern here.
But after this patch series, the lru lock is no longer stored in the zone.
So this comment makes no sense.
> @@ -295,7 +295,7 @@ struct zone_reclaim_stat {
>
> struct lruvec {
> struct list_head lists[NR_LRU_LISTS];
> - /* move lru_lock to per lruvec for memcg */
> + /* perf lruvec lru_lock for memcg */
What does the word 'perf' mean here?
?? 2019/8/20 ????9:40, Matthew Wilcox д??:
> On Tue, Aug 20, 2019 at 05:48:24PM +0800, Alex Shi wrote:
>> +++ b/include/linux/mmzone.h
>> @@ -295,6 +295,9 @@ struct zone_reclaim_stat {
>>
>> struct lruvec {
>> struct list_head lists[NR_LRU_LISTS];
>> + /* move lru_lock to per lruvec for memcg */
>> + spinlock_t lru_lock;
>
> This comment makes no sense outside the context of this patch.
>
Right, Thanks for point out this. will remove it in v2.
Thanks
Alex
?? 2019/8/20 ????10:00, Matthew Wilcox д??:
> On Tue, Aug 20, 2019 at 05:48:37PM +0800, Alex Shi wrote:
>> @@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>> struct pglist_data;
>>
>> /*
>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>> * So add a wild amount of padding here to ensure that they fall into separate
>> * cachelines. There are very few zone structures in the machine, so space
>> * consumption is not a concern here.
>
> But after this patch series, the lru lock is no longer stored in the zone.
> So this comment makes no sense.
Yes, It's need reconsider here. thanks for opoint out.
>
>> @@ -295,7 +295,7 @@ struct zone_reclaim_stat {
>>
>> struct lruvec {
>> struct list_head lists[NR_LRU_LISTS];
>> - /* move lru_lock to per lruvec for memcg */
>> + /* perf lruvec lru_lock for memcg */
>
> What does the word 'perf' mean here?
sorry for typo, could be s/perf/per/ here.
Thanks
Alex
On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > lruvec, thus bring a lru_lock for each of memcg.
> >
> > Per memcg lru_lock would ease the lru_lock contention a lot in
> > this patch series.
> >
> > In some data center, containers are used widely to deploy different kind
> > of services, then multiple memcgs share per node pgdat->lru_lock which
> > cause heavy lock contentions when doing lru operation.
>
> Having some real world workloads numbers would be more than useful
> for a non trivial change like this. I believe googlers have tried
> something like this in the past but then didn't have really a good
> example of workloads that benefit. I might misremember though. Cc Hugh.
>
We, at Google, have been using per-memcg lru locks for more than 7
years. Per-memcg lru locks are really beneficial for providing
performance isolation if there are multiple distinct jobs/memcgs
running on large machines. We are planning to upstream our internal
implementation. I will let Hugh comment on that.
thanks,
Shakeel
On Tue, 20 Aug 2019, Shakeel Butt wrote:
> On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <[email protected]> wrote:
> > On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > > lruvec, thus bring a lru_lock for each of memcg.
> > >
> > > Per memcg lru_lock would ease the lru_lock contention a lot in
> > > this patch series.
> > >
> > > In some data center, containers are used widely to deploy different kind
> > > of services, then multiple memcgs share per node pgdat->lru_lock which
> > > cause heavy lock contentions when doing lru operation.
> >
> > Having some real world workloads numbers would be more than useful
> > for a non trivial change like this. I believe googlers have tried
> > something like this in the past but then didn't have really a good
> > example of workloads that benefit. I might misremember though. Cc Hugh.
> >
>
> We, at Google, have been using per-memcg lru locks for more than 7
> years. Per-memcg lru locks are really beneficial for providing
> performance isolation if there are multiple distinct jobs/memcgs
> running on large machines. We are planning to upstream our internal
> implementation. I will let Hugh comment on that.
Thanks for the Cc Michal. As Shakeel says, Google prodkernel has been
using our per-memcg lru locks for 7 years or so. Yes, we did not come
up with supporting performance data at the time of posting, nor since:
I see Alex has done much better on that (though I haven't even glanced
to see if +s are persuasive).
https://lkml.org/lkml/2012/2/20/434
was how ours was back then; some parts of that went in, then attached
lrulock417.tar is how it was the last time I rebased, to v4.17.
I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
and/or mmotm. Then compare with what Alex has, to see if there's any
good reason to prefer one to the other: if no good reason to prefer ours,
I doubt we shall bother to repost, but just use it as basis for helping
to review or improve Alex's.
Hugh
>
> Thanks for the Cc Michal. As Shakeel says, Google prodkernel has been
> using our per-memcg lru locks for 7 years or so. Yes, we did not come
> up with supporting performance data at the time of posting, nor since:
> I see Alex has done much better on that (though I haven't even glanced
> to see if +s are persuasive).
>
> https://lkml.org/lkml/2012/2/20/434
> was how ours was back then; some parts of that went in, then attached
> lrulock417.tar is how it was the last time I rebased, to v4.17.
>
> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> and/or mmotm. Then compare with what Alex has, to see if there's any
> good reason to prefer one to the other: if no good reason to prefer ours,
> I doubt we shall bother to repost, but just use it as basis for helping
> to review or improve Alex's.
>
Thanks for you all! Very glad to see we are trying on same point. :)
Not only on per memcg lru_lock, there are much room on lru and page replacement
tunings. Anyway Hope to see your update and more review comments soon.
Thanks
Alex
在 2019/8/21 上午2:24, Hugh Dickins 写道:
> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> and/or mmotm. Then compare with what Alex has, to see if there's any
> good reason to prefer one to the other: if no good reason to prefer ours,
> I doubt we shall bother to repost, but just use it as basis for helping
> to review or improve Alex's.
For your review, my patchset are pretty straight and simple. It just use per lruvec lru_lock to replace necessary pgdat lru_lock. just this.
We could talk more after I back to work. :)
Thanks alot!
Alex
Hi Alex,
On 8/20/19 5:48 AM, Alex Shi wrote:
> In some data center, containers are used widely to deploy different kind
> of services, then multiple memcgs share per node pgdat->lru_lock which
> cause heavy lock contentions when doing lru operation.
>
> On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
> simultaneously with mmtests' config:
> # AIM9
> export AIM9_TESTTIME=180
> export AIM9_TESTLIST=page_test,brk_test
>
> perf lock report show much contentions on lru_lock in 20 second snapshot:
> Name acquired contended avg wait (ns) total wait (ns) max wait (ns) min wait (ns)
> &(ptlock_ptr(pag... 22 0 0 0 0 0
> ...
> &(&pgdat->lru_lo... 9 7 12728 89096 26656 1597
This is system-wide right, not per container? Even per container, 89 usec isn't much contention over 20 seconds. You may want to give this a try:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention?
> With this patch series, lruvec->lru_lock show no contentions
> &(&lruvec->lru_l... 8 0 0 0 0 0
>
> and aim9 page_test/brk_test performance increased 5%~50%.
Where does the 50% number come in? The numbers below seem to only show ~4% boost.
> BTW, Detailed results in aim9-pft.compare.log if needed,
> All containers data are increased and pretty steady.
>
> $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i page_test ============"; cat aim9-pft.compare.log | grep "^$i.*page_test" | awk 'BEGIN {a=b=0;} { a+=$3; b+=$6 } END { print "5.3-rc4 " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
> ========= Max page_test ============
> 5.3-rc4 34729.6
> 5.3-rc4+lru_lock 36128.3
> ========= Min page_test ============
> 5.3-rc4 33644.2
> 5.3-rc4+lru_lock 35349.7
> ========= Hmean page_test ============
> 5.3-rc4 34355.4
> 5.3-rc4+lru_lock 35810.9
> ========= Stddev page_test ============
> 5.3-rc4 319.757
> 5.3-rc4+lru_lock 223.324
> ========= CoeffVar page_test ============
> 5.3-rc4 0.93125
> 5.3-rc4+lru_lock 0.623333
> ========= BHmean-50 page_test ============
> 5.3-rc4 34579.2
> 5.3-rc4+lru_lock 35977.1
> ========= BHmean-95 page_test ============
> 5.3-rc4 34421.7
> 5.3-rc4+lru_lock 35853.6
> ========= BHmean-99 page_test ============
> 5.3-rc4 34421.7
> 5.3-rc4+lru_lock 35853.6
>
> $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i brk_test ============"; cat aim9-pft.compare.log | grep "^$i.*brk_test" | awk 'BEGIN {a=b=0;} { a+=$3; b+=$6 } END { print "5.3-rc4 " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
> ========= Max brk_test ============
> 5.3-rc4 96647.7
> 5.3-rc4+lru_lock 98960.3
> ========= Min brk_test ============
> 5.3-rc4 91800.8
> 5.3-rc4+lru_lock 96817.6
> ========= Hmean brk_test ============
> 5.3-rc4 95470
> 5.3-rc4+lru_lock 97769.6
> ========= Stddev brk_test ============
> 5.3-rc4 1253.52
> 5.3-rc4+lru_lock 596.593
> ========= CoeffVar brk_test ============
> 5.3-rc4 1.31375
> 5.3-rc4+lru_lock 0.609583
> ========= BHmean-50 brk_test ============
> 5.3-rc4 96141.4
> 5.3-rc4+lru_lock 98194
> ========= BHmean-95 brk_test ============
> 5.3-rc4 95818.5
> 5.3-rc4+lru_lock 97857.2
> ========= BHmean-99 brk_test ============
> 5.3-rc4 95818.5
> 5.3-rc4+lru_lock 97857.2
[1] https://lore.kernel.org/linux-mm/CABdVr8R2y9B+2zzSAT_Ve=BQCa+F+E9_kVH+C28DGpkeQitiog@mail.gmail.com/
在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>
>
> This is system-wide right, not per container? Even per container, 89 usec isn't much contention over 20 seconds. You may want to give this a try:
yes, perf lock show the host info.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
> It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now. Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :)
>
> It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention?
We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove. aim9 could reflect part of scenarios. I don't know the DB scenario yet.
>
>> With this patch series, lruvec->lru_lock show no contentions
>> &(&lruvec->lru_l... 8 0 0 0 0 0
>>
>> and aim9 page_test/brk_test performance increased 5%~50%.
>
> Where does the 50% number come in? The numbers below seem to only show ~4% boost.
the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:
Stddev page_test 245.15 ( 0.00%) 189.29 ( 22.79%)
Stddev brk_test 1258.60 ( 0.00%) 629.16 ( 50.01%)
CoeffVar page_test 0.71 ( 0.00%) 0.53 ( 26.05%)
CoeffVar brk_test 1.32 ( 0.00%) 0.64 ( 51.14%)
On 8/22/19 7:56 AM, Alex Shi wrote:
> 在 2019/8/22 上午2:00, Daniel Jordan 写道:
>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>> It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
>
> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.
I should've been clearer. [1] is meant as an example of someone suffering from lru_lock during reclaim. Wouldn't your series help per-memcg reclaim?
> Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :)
>>
>> It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention?
>
> We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove.
I think numbers from that scenario would help your case.
> aim9 could reflect part of scenarios. I don't know the DB scenario yet.
We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed). But that's a different thing, clearly Not This Series.
>>> With this patch series, lruvec->lru_lock show no contentions
>>> &(&lruvec->lru_l... 8 0 0 0 0 0
>>>
>>> and aim9 page_test/brk_test performance increased 5%~50%.
>>
>> Where does the 50% number come in? The numbers below seem to only show ~4% boost.
>
> the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:
>
> Stddev page_test 245.15 ( 0.00%) 189.29 ( 22.79%)
> Stddev brk_test 1258.60 ( 0.00%) 629.16 ( 50.01%)
> CoeffVar page_test 0.71 ( 0.00%) 0.53 ( 26.05%)
> CoeffVar brk_test 1.32 ( 0.00%) 0.64 ( 51.14%)
Aha. 50% decrease in stdev.
On Wed, 21 Aug 2019, Alex Shi wrote:
> 在 2019/8/21 上午2:24, Hugh Dickins 写道:
> > I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> > and/or mmotm. Then compare with what Alex has, to see if there's any
> > good reason to prefer one to the other: if no good reason to prefer ours,
> > I doubt we shall bother to repost, but just use it as basis for helping
> > to review or improve Alex's.
>
> For your review, my patchset are pretty straight and simple.
> It just use per lruvec lru_lock to replace necessary pgdat lru_lock.
> just this. We could talk more after I back to work. :)
Sorry to be bearer of bad news, Alex, but when you said "straight and
simple", I feared that your patchset would turn out to be fundamentally
too simple.
And that is so. I have only to see the
lruvec = mem_cgroup_page_lruvec(page, pgdat);
line in isolate_migratepages_block() in mm/compaction.c, and check
that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c.
The central problem with per-memcg lru_lock is that you do not know
for sure what lock to take (which memcg a page belongs to) until you
have taken the expected lock, and then checked whether page->memcg
is still the same - backing out and trying again if not.
Fix that central problem, and you end up with a more complicated
patchset, much like ours. It's true that when ours was first developed,
the memcg situation was more complicated in several ways, and perhaps
some aspects of our patchset could be simplified now (though I've not
identified any). Johannes in particular has done a great deal of
simplifying work in memcg over the last few years, but there are still
situations in which a page's memcg can change (move_charge_at_immigrate
and swapin readahead spring to mind - or perhaps the latter is only an
issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if
reparenting will be brought back one day).
I did not review your patchset in detail, and wasn't able to get very
far in testing it. At first I was put off by set_task_reclaim_state
warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5
itself, not from your patchset or mine (but I've not yet investigated
what's responsible for them). Within a minute of starting swapping
load, kcompactd compact_lock_irqsave() in isolate_migratepages_block()
would deadlock, and I did not get further. (Though I did also notice
that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory"
froze in booting - tiresomely, one has to keep both the memcg and
no-memcg locking to cope with that case, and I guess you had not.)
Rather than duplicating effort, I would advise you to give our patchset
a try, and if it works for you, help towards getting that one merged:
but of course, it's up to you.
I've attached a tarfile of it rebased to v5.3-rc5: I do not want to
spam the list with patches yet, because I do not have any stats or
argument in support of the series, as Andrew asked for years ago and
Michal asks again now. But aside from that I consider it ready, and
will let Shakeel take it over from here, while I get back to what I
diverted from (but of course I'll try to answer questions on it).
Thanks,
Hugh
On 22/08/2019 18.20, Daniel Jordan wrote:
> On 8/22/19 7:56 AM, Alex Shi wrote:
>> 在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>>> It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this
>>> lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
>>
>> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.
>
> I should've been clearer. [1] is meant as an example of someone suffering from lru_lock during reclaim. Wouldn't your series help
> per-memcg reclaim?
>
>> Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning.
>> But anyway, aim9 is also pretty good to show the problem and solutions. :)
>>>
>>> It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention?
>>
>> We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages
>> alloc/remove.
>
> I think numbers from that scenario would help your case.
>
>> aim9 could reflect part of scenarios. I don't know the DB scenario yet.
>
> We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed). But that's a different thing,
> clearly Not This Series.
>
>>>> With this patch series, lruvec->lru_lock show no contentions
>>>> &(&lruvec->lru_l... 8 0 0 0 0 0
>>>>
>>>> and aim9 page_test/brk_test performance increased 5%~50%.
>>>
>>> Where does the 50% number come in? The numbers below seem to only show ~4% boost.
>>After splitting lru-locks present per-cpu page-vectors works no so well
because they mixes pages from different cgroups.
pagevec_lru_move_fn and friends need better implementation:
either sorting pages or splitting vectores in per-lruvec basis.
>> the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:
>>
>> Stddev page_test 245.15 ( 0.00%) 189.29 ( 22.79%)
>> Stddev brk_test 1258.60 ( 0.00%) 629.16 ( 50.01%)
>> CoeffVar page_test 0.71 ( 0.00%) 0.53 ( 26.05%)
>> CoeffVar brk_test 1.32 ( 0.00%) 0.64 ( 51.14%)
>
> Aha. 50% decrease in stdev.
>
After splitting lru-locks present per-cpu page-vectors works
no so well because they mix pages from different cgroups.
pagevec_lru_move_fn and friends need better implementation:
either sorting pages or splitting vectores in per-lruvec basis.
On 20/08/2019 12.48, Alex Shi wrote:
> Now we repeatly assign the lruvec->pgdat in memcg. Will remove the
> assignment in lruvec getting function after very points are protected.
>
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/memcontrol.c | 12 +++++-------
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8a1b0d95ba8..19fd911e8098 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2550,12 +2550,12 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
> static void lock_page_lru(struct page *page, int *isolated)
> {
> pg_data_t *pgdat = page_pgdat(page);
> + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
What protects lruvec from freeing at this point?
After reading resolving lruvec page could be moved and cgroup deleted.
In this old patches I've used RCU for that: https://lkml.org/lkml/2012/2/20/276
Pointer to lruvec should be resolved under disabled irq.
Not sure this works these days.
> - spin_lock_irq(&pgdat->lruvec.lru_lock);
> + spin_lock_irq(&lruvec->lru_lock);
> + sync_lruvec_pgdat(lruvec, pgdat);
> if (PageLRU(page)) {
> - struct lruvec *lruvec;
>
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> ClearPageLRU(page);
> del_page_from_lru_list(page, lruvec, page_lru(page));
> *isolated = 1;
> @@ -2566,16 +2566,14 @@ static void lock_page_lru(struct page *page, int *isolated)
> static void unlock_page_lru(struct page *page, int isolated)
> {
> pg_data_t *pgdat = page_pgdat(page);
> + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
> if (isolated) {
> - struct lruvec *lruvec;
> -
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> VM_BUG_ON_PAGE(PageLRU(page), page);
> SetPageLRU(page);
> add_page_to_lru_list(page, lruvec, page_lru(page));
> }
> - spin_unlock_irq(&pgdat->lruvec.lru_lock);
> + spin_unlock_irq(&lruvec->lru_lock);
> }
>
> static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>
在 2019/8/26 下午4:30, Konstantin Khlebnikov 写道:
>>
>>
>
> What protects lruvec from freeing at this point?
> After reading resolving lruvec page could be moved and cgroup deleted.
>
> In this old patches I've used RCU for that: https://lkml.org/lkml/2012/2/20/276
> Pointer to lruvec should be resolved under disabled irq.
> Not sure this works these days.
Thanks for reminder! I will reconsider this point and come up with changes.
Thanks
Alex
在 2019/8/24 上午9:59, Hugh Dickins 写道:
> On Wed, 21 Aug 2019, Alex Shi wrote:
>> 在 2019/8/21 上午2:24, Hugh Dickins 写道:
>>> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
>>> and/or mmotm. Then compare with what Alex has, to see if there's any
>>> good reason to prefer one to the other: if no good reason to prefer ours,
>>> I doubt we shall bother to repost, but just use it as basis for helping
>>> to review or improve Alex's.
>>
>> For your review, my patchset are pretty straight and simple.
>> It just use per lruvec lru_lock to replace necessary pgdat lru_lock.
>> just this. We could talk more after I back to work. :)
>
> Sorry to be bearer of bad news, Alex, but when you said "straight and
> simple", I feared that your patchset would turn out to be fundamentally
> too simple.
>
> And that is so. I have only to see the
> lruvec = mem_cgroup_page_lruvec(page, pgdat);
> line in isolate_migratepages_block() in mm/compaction.c, and check
> that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c.
>
> The central problem with per-memcg lru_lock is that you do not know
> for sure what lock to take (which memcg a page belongs to) until you
> have taken the expected lock, and then checked whether page->memcg
> is still the same - backing out and trying again if not.
>
> Fix that central problem, and you end up with a more complicated
> patchset, much like ours. It's true that when ours was first developed,
> the memcg situation was more complicated in several ways, and perhaps
> some aspects of our patchset could be simplified now (though I've not
> identified any). Johannes in particular has done a great deal of
> simplifying work in memcg over the last few years, but there are still
> situations in which a page's memcg can change (move_charge_at_immigrate
> and swapin readahead spring to mind - or perhaps the latter is only an
> issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if
> reparenting will be brought back one day).
>
> I did not review your patchset in detail, and wasn't able to get very
> far in testing it. At first I was put off by set_task_reclaim_state
> warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5
> itself, not from your patchset or mine (but I've not yet investigated
> what's responsible for them). Within a minute of starting swapping
> load, kcompactd compact_lock_irqsave() in isolate_migratepages_block()
> would deadlock, and I did not get further. (Though I did also notice
> that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory"
> froze in booting - tiresomely, one has to keep both the memcg and
> no-memcg locking to cope with that case, and I guess you had not.)
>
> Rather than duplicating effort, I would advise you to give our patchset
> a try, and if it works for you, help towards getting that one merged:
> but of course, it's up to you.
Thanks a lot for all infos and reminders! Yes, the page->memcg change would be a problem. I will studying your patchset and try to merge them.
>
> I've attached a tarfile of it rebased to v5.3-rc5: I do not want to
> spam the list with patches yet, because I do not have any stats or
> argument in support of the series, as Andrew asked for years ago and
> Michal asks again now. But aside from that I consider it ready, and
> will let Shakeel take it over from here, while I get back to what I
> diverted from (but of course I'll try to answer questions on it).
>
I will trying to look into them. Thanks for your kindly offer. :)
Thanks!
Alex
在 2019/8/26 下午4:39, Konstantin Khlebnikov 写道:
>>>
> because they mixes pages from different cgroups.
>
> pagevec_lru_move_fn and friends need better implementation:
> either sorting pages or splitting vectores in per-lruvec basis.
Right, this should be the next step to improve. Maybe we could try the per-lruvec pagevec?
Thanks
Alex
在 2019/8/22 下午11:20, Daniel Jordan 写道:
>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>>> It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
>>
>> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.
>
> I should've been clearer. [1] is meant as an example of someone suffering from lru_lock during reclaim. Wouldn't your series help per-memcg reclaim?
yes, I got your point, since the aim9 don't show much improvement, I am trying this case in containers.
Thanks
Alex