2018-01-31 23:24:34

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 00/13] lru_lock scalability

lru_lock, a per-node* spinlock that protects an LRU list, is one of the
hottest locks in the kernel. On some workloads on large machines, it
shows up at the top of lock_stat.

One way to improve lru_lock scalability is to introduce an array of locks,
with each lock protecting certain batches of LRU pages.

*ooooooooooo**ooooooooooo**ooooooooooo**oooo ...
| || || ||
\ batch 1 / \ batch 2 / \ batch 3 /

In this ASCII depiction of an LRU, a page is represented with either '*'
or 'o'. An asterisk indicates a sentinel page, which is a page at the
edge of a batch. An 'o' indicates a non-sentinel page.

To remove a non-sentinel LRU page, only one lock from the array is
required. This allows multiple threads to remove pages from different
batches simultaneously. A sentinel page requires lru_lock in addition to
a lock from the array.

Full performance numbers appear in the last patch in this series, but this
prototype allows a microbenchmark to do up to 28% more page faults per
second with 16 or more concurrent processes.

This work was developed in collaboration with Steve Sistare.

Note: This is an early prototype. I'm submitting it now to support my
request to attend LSF/MM, as well as get early feedback on the idea. Any
comments appreciated.


* lru_lock is actually per-memcg, but without memcg's in the picture it
becomes per-node.


Aaron Lu (1):
mm: add a percpu_pagelist_batch sysctl interface

Daniel Jordan (12):
mm: allow compaction to be disabled
mm: add lock array to pgdat and batch fields to struct page
mm: introduce struct lru_list_head in lruvec to hold per-LRU batch
info
mm: add batching logic to add/delete/move API's
mm: add lru_[un]lock_all APIs
mm: convert to-be-refactored lru_lock callsites to lock-all API
mm: temporarily convert lru_lock callsites to lock-all API
mm: introduce add-only version of pagevec_lru_move_fn
mm: add LRU batch lock API's
mm: use lru_batch locking in release_pages
mm: split up release_pages into non-sentinel and sentinel passes
mm: splice local lists onto the front of the LRU

include/linux/mm_inline.h | 209 +++++++++++++++++++++++++++++++++++++++++++++-
include/linux/mm_types.h | 5 ++
include/linux/mmzone.h | 25 +++++-
kernel/sysctl.c | 9 ++
mm/Kconfig | 1 -
mm/huge_memory.c | 6 +-
mm/memcontrol.c | 5 +-
mm/mlock.c | 11 +--
mm/mmzone.c | 7 +-
mm/page_alloc.c | 43 +++++++++-
mm/page_idle.c | 4 +-
mm/swap.c | 208 ++++++++++++++++++++++++++++++++++++---------
mm/vmscan.c | 49 +++++------
13 files changed, 500 insertions(+), 82 deletions(-)

--
2.16.1



2018-01-31 23:25:14

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 04/13] mm: introduce struct lru_list_head in lruvec to hold per-LRU batch info

Add information about the first and last LRU batches in struct lruvec.

lruvec's list_head is replaced with a pseudo struct page to avoid
special-casing LRU batch handling at the front or back of the LRU. This
pseudo page has its own lru_batch and lru_sentinel fields so that the
same code that deals with "inner" LRU pages (i.e. neither the first nor
the last page) can deal with the first and last pages.

Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/mm_inline.h | 4 ++--
include/linux/mmzone.h | 13 ++++++++++++-
mm/mmzone.c | 7 +++++--
mm/swap.c | 2 +-
mm/vmscan.c | 4 ++--
5 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index c30b32e3c862..d7fc46ebc33b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -48,14 +48,14 @@ static __always_inline void add_page_to_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
- list_add(&page->lru, &lruvec->lists[lru]);
+ list_add(&page->lru, lru_head(&lruvec->lists[lru]));
}

static __always_inline void add_page_to_lru_list_tail(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
- list_add_tail(&page->lru, &lruvec->lists[lru]);
+ list_add_tail(&page->lru, lru_head(&lruvec->lists[lru]));
}

static __always_inline void del_page_from_lru_list(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5ffb36b3f665..feca75b8f492 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -18,6 +18,7 @@
#include <linux/pageblock-flags.h>
#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
+#include <linux/mm_types.h>
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
@@ -232,8 +233,18 @@ struct zone_reclaim_stat {
unsigned long recent_scanned[2];
};

+#define lru_head(lru_list_head) (&(lru_list_head)->pseudo_page.lru)
+
+struct lru_list_head {
+ struct page pseudo_page;
+ unsigned first_batch_npages;
+ unsigned first_batch_tag;
+ unsigned last_batch_npages;
+ unsigned last_batch_tag;
+};
+
struct lruvec {
- struct list_head lists[NR_LRU_LISTS];
+ struct lru_list_head lists[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
/* Evictions & activations on the inactive file list */
atomic_long_t inactive_age;
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..c39fc6af3f13 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -92,8 +92,11 @@ void lruvec_init(struct lruvec *lruvec)

memset(lruvec, 0, sizeof(struct lruvec));

- for_each_lru(lru)
- INIT_LIST_HEAD(&lruvec->lists[lru]);
+ for_each_lru(lru) {
+ INIT_LIST_HEAD(lru_head(&lruvec->lists[lru]));
+ lruvec->lists[lru].pseudo_page.lru_sentinel = true;
+ lruvec->lists[lru].pseudo_page.lru_batch = NUM_LRU_BATCH_LOCKS;
+ }
}

#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index 38e1b6374a97..286636bb6a4f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -561,7 +561,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
* The page's writeback ends up during pagevec
* We moves tha page into tail of inactive.
*/
- list_move_tail(&page->lru, &lruvec->lists[lru]);
+ list_move_tail(&page->lru, lru_head(&lruvec->lists[lru]));
__count_vm_event(PGROTATED);
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 47d5ced51f2d..aa629c4720dd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1511,7 +1511,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
unsigned long *nr_scanned, struct scan_control *sc,
isolate_mode_t mode, enum lru_list lru)
{
- struct list_head *src = &lruvec->lists[lru];
+ struct list_head *src = lru_head(&lruvec->lists[lru]);
unsigned long nr_taken = 0;
unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
@@ -1943,7 +1943,7 @@ static unsigned move_active_pages_to_lru(struct lruvec *lruvec,

nr_pages = hpage_nr_pages(page);
update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
- list_move(&page->lru, &lruvec->lists[lru]);
+ list_move(&page->lru, lru_head(&lruvec->lists[lru]));

if (put_page_testzero(page)) {
__ClearPageLRU(page);
--
2.16.1


2018-01-31 23:26:10

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 12/13] mm: split up release_pages into non-sentinel and sentinel passes

A common case in release_pages is for the 'pages' list to be in roughly
the same order as they are in their LRU. With LRU batch locking, when a
sentinel page is removed, an adjacent non-sentinel page must be promoted
to a sentinel page to follow the locking scheme. So we can get behavior
where nearly every page in the 'pages' array is treated as a sentinel
page, hurting the scalability of this approach.

To address this, split up release_pages into non-sentinel and sentinel
passes so that the non-sentinel pages can be locked with an LRU batch
lock before the sentinel pages are removed.

For the prototype, just use a bitmap and a temporary outer loop to
implement this.

Performance numbers from a single microbenchmark at this point in the
series are included in the next patch.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/swap.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index fae766e035a4..a302224293ad 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -731,6 +731,7 @@ void lru_add_drain_all(void)
put_online_cpus();
}

+#define LRU_BITMAP_SIZE 512
/**
* release_pages - batched put_page()
* @pages: array of pages to release
@@ -742,16 +743,32 @@ void lru_add_drain_all(void)
*/
void release_pages(struct page **pages, int nr)
{
- int i;
+ int h, i;
LIST_HEAD(pages_to_free);
struct pglist_data *locked_pgdat = NULL;
spinlock_t *locked_lru_batch = NULL;
struct lruvec *lruvec;
unsigned long uninitialized_var(flags);
+ DECLARE_BITMAP(lru_bitmap, LRU_BITMAP_SIZE);
+
+ VM_BUG_ON(nr > LRU_BITMAP_SIZE);

+ bitmap_zero(lru_bitmap, nr);
+
+ for (h = 0; h < 2; h++) {
for (i = 0; i < nr; i++) {
struct page *page = pages[i];

+ if (h == 0) {
+ if (PageLRU(page) && page->lru_sentinel) {
+ bitmap_set(lru_bitmap, i, 1);
+ continue;
+ }
+ } else {
+ if (!test_bit(i, lru_bitmap))
+ continue;
+ }
+
if (is_huge_zero_page(page))
continue;

@@ -798,6 +815,7 @@ void release_pages(struct page **pages, int nr)

list_add(&page->lru, &pages_to_free);
}
+ }
if (locked_lru_batch) {
lru_batch_unlock(NULL, &locked_lru_batch, &locked_pgdat,
&flags);
--
2.16.1


2018-01-31 23:27:56

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU

Now that release_pages is scaling better with concurrent removals from
the LRU, the performance results (included below) showed increased
contention on lru_lock in the add-to-LRU path.

To alleviate some of this contention, do more work outside the LRU lock.
Prepare a local list of pages to be spliced onto the front of the LRU,
including setting PageLRU in each page, before taking lru_lock. Since
other threads use this page flag in certain checks outside lru_lock,
ensure each page's LRU links have been properly initialized before
setting the flag, and use memory barriers accordingly.

Performance Results

This is a will-it-scale run of page_fault1 using 4 different kernels.

kernel kern #

4.15-rc2 1
large-zone-batch 2
lru-lock-base 3
lru-lock-splice 4

Each kernel builds on the last. The first is a baseline, the second
makes zone->lock more scalable by increasing an order-0 per-cpu
pagelist's 'batch' and 'high' values to 310 and 1860 respectively
(courtesy of Aaron Lu's patch), the third scales lru_lock without
splicing pages (the previous patch in this series), and the fourth adds
page splicing (this patch).

N tasks mmap, fault, and munmap anonymous pages in a loop until the test
time has elapsed.

The process case generally does better than the thread case most likely
because of mmap_sem acting as a bottleneck. There's ongoing work
upstream[*] to scale this lock, however, and once it goes in, my
hypothesis is the thread numbers here will improve.

kern # ntask proc thr proc stdev thr stdev
speedup speedup pgf/s pgf/s
1 1 705,533 1,644 705,227 1,122
2 1 2.5% 2.8% 722,912 453 724,807 728
3 1 2.6% 2.6% 724,215 653 723,213 941
4 1 2.3% 2.8% 721,746 272 724,944 728

kern # ntask proc thr proc stdev thr stdev
speedup speedup pgf/s pgf/s
1 4 2,525,487 7,428 1,973,616 12,568
2 4 2.6% 7.6% 2,590,699 6,968 2,123,570 10,350
3 4 2.3% 4.4% 2,584,668 12,833 2,059,822 10,748
4 4 4.7% 5.2% 2,643,251 13,297 2,076,808 9,506

kern # ntask proc thr proc stdev thr stdev
speedup speedup pgf/s pgf/s
1 16 6,444,656 20,528 3,226,356 32,874
2 16 1.9% 10.4% 6,566,846 20,803 3,560,437 64,019
3 16 18.3% 6.8% 7,624,749 58,497 3,447,109 67,734
4 16 28.2% 2.5% 8,264,125 31,677 3,306,679 69,443

kern # ntask proc thr proc stdev thr stdev
speedup speedup pgf/s pgf/s
1 32 11,564,988 32,211 2,456,507 38,898
2 32 1.8% 1.5% 11,777,119 45,418 2,494,064 27,964
3 32 16.1% -2.7% 13,426,746 94,057 2,389,934 40,186
4 32 26.2% 1.2% 14,593,745 28,121 2,486,059 42,004

kern # ntask proc thr proc stdev thr stdev
speedup speedup pgf/s pgf/s
1 64 12,080,629 33,676 2,443,043 61,973
2 64 3.9% 9.9% 12,551,136 206,202 2,684,632 69,483
3 64 15.0% -3.8% 13,892,933 351,657 2,351,232 67,875
4 64 21.9% 1.8% 14,728,765 64,945 2,485,940 66,839

[*] https://lwn.net/Articles/724502/ Range reader/writer locks
https://lwn.net/Articles/744188/ Speculative page faults

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/memcontrol.c | 1 +
mm/mlock.c | 1 +
mm/swap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 1 +
4 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 99a54df760e3..6911626f29b2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2077,6 +2077,7 @@ static void lock_page_lru(struct page *page, int *isolated)

lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
ClearPageLRU(page);
+ smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
del_page_from_lru_list(page, lruvec, page_lru(page));
*isolated = 1;
} else
diff --git a/mm/mlock.c b/mm/mlock.c
index 6ba6a5887aeb..da294c5bbc2c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -109,6 +109,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
if (getpage)
get_page(page);
ClearPageLRU(page);
+ smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
del_page_from_lru_list(page, lruvec, page_lru(page));
return true;
}
diff --git a/mm/swap.c b/mm/swap.c
index a302224293ad..46a98dc8e9ad 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -220,6 +220,7 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
int *pgmoved = arg;

if (PageLRU(page) && !PageUnevictable(page)) {
+ smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
del_page_from_lru_list(page, lruvec, page_lru(page));
ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -277,6 +278,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
int file = page_is_file_cache(page);
int lru = page_lru_base_type(page);

+ smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
del_page_from_lru_list(page, lruvec, lru);
SetPageActive(page);
lru += LRU_ACTIVE;
@@ -544,6 +546,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
file = page_is_file_cache(page);
lru = page_lru_base_type(page);

+ smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
del_page_from_lru_list(page, lruvec, lru + active);
ClearPageActive(page);
ClearPageReferenced(page);
@@ -578,6 +581,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
!PageSwapCache(page) && !PageUnevictable(page)) {
bool active = PageActive(page);

+ smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
del_page_from_lru_list(page, lruvec,
LRU_INACTIVE_ANON + active);
ClearPageActive(page);
@@ -903,6 +907,60 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
trace_mm_lru_insertion(page, lru);
}

+#define MAX_LRU_SPLICES 4
+
+struct lru_splice {
+ struct list_head list;
+ struct lruvec *lruvec;
+ enum lru_list lru;
+ int nid;
+ int zid;
+ size_t nr_pages;
+};
+
+/*
+ * Adds a page to a local list for splicing, or else to the singletons
+ * list for individual processing.
+ *
+ * Returns the new number of splices in the splices list.
+ */
+size_t add_page_to_lru_splice(struct lru_splice *splices, size_t nr_splices,
+ struct list_head *singletons, struct page *page)
+{
+ int i;
+ enum lru_list lru = page_lru(page);
+ enum zone_type zid = page_zonenum(page);
+ int nid = page_to_nid(page);
+ struct lruvec *lruvec;
+
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+
+ lruvec = mem_cgroup_page_lruvec(page, NODE_DATA(nid));
+
+ for (i = 0; i < nr_splices; ++i) {
+ if (splices[i].lruvec == lruvec && splices[i].zid == zid) {
+ list_add(&page->lru, &splices[i].list);
+ splices[nr_splices].nr_pages += hpage_nr_pages(page);
+ return nr_splices;
+ }
+ }
+
+ if (nr_splices < MAX_LRU_SPLICES) {
+ INIT_LIST_HEAD(&splices[nr_splices].list);
+ splices[nr_splices].lruvec = lruvec;
+ splices[nr_splices].lru = lru;
+ splices[nr_splices].nid = nid;
+ splices[nr_splices].zid = zid;
+ splices[nr_splices].nr_pages = hpage_nr_pages(page);
+ list_add(&page->lru, &splices[nr_splices].list);
+ ++nr_splices;
+ } else {
+ list_add(&page->lru, singletons);
+ }
+
+ return nr_splices;
+}
+
/*
* Add the passed pages to the LRU, then drop the caller's refcount
* on them. Reinitialises the caller's pagevec.
@@ -911,12 +969,59 @@ void __pagevec_lru_add(struct pagevec *pvec)
{
int i;
struct pglist_data *pgdat = NULL;
- struct lruvec *lruvec;
unsigned long flags = 0;
+ struct lru_splice splices[MAX_LRU_SPLICES];
+ size_t nr_splices = 0;
+ LIST_HEAD(singletons);
+ struct page *page, *next;

- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- struct pglist_data *pagepgdat = page_pgdat(page);
+ /*
+ * Sort the pages into local lists to splice onto the LRU once we
+ * hold lru_lock. In the common case there should be few of these
+ * local lists.
+ */
+ for (i = 0; i < pagevec_count(pvec); ++i) {
+ page = pvec->pages[i];
+ nr_splices = add_page_to_lru_splice(splices, nr_splices,
+ &singletons, page);
+ }
+
+ /*
+ * Paired with read barriers where we check PageLRU and modify
+ * page->lru, for example pagevec_move_tail_fn.
+ */
+ smp_wmb();
+
+ for (i = 0; i < pagevec_count(pvec); i++)
+ SetPageLRU(pvec->pages[i]);
+
+ for (i = 0; i < nr_splices; ++i) {
+ struct lru_splice *s = &splices[i];
+ struct pglist_data *splice_pgdat = NODE_DATA(s->nid);
+
+ if (splice_pgdat != pgdat) {
+ if (pgdat)
+ spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ pgdat = splice_pgdat;
+ spin_lock_irqsave(&pgdat->lru_lock, flags);
+ }
+
+ update_lru_size(s->lruvec, s->lru, s->zid, s->nr_pages);
+ list_splice(&s->list, lru_head(&s->lruvec->lists[s->lru]));
+ update_page_reclaim_stat(s->lruvec, is_file_lru(s->lru),
+ is_active_lru(s->lru));
+ /* XXX add splice tracepoint */
+ }
+
+ while (!list_empty(&singletons)) {
+ struct pglist_data *pagepgdat;
+ struct lruvec *lruvec;
+ struct list_head *list;
+
+ list = singletons.next;
+ page = list_entry(list, struct page, lru);
+ list_del(list);
+ pagepgdat = page_pgdat(page);

if (pagepgdat != pgdat) {
if (pgdat)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f5ff0bb133f..338850ad03a6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1629,6 +1629,7 @@ int isolate_lru_page(struct page *page)
int lru = page_lru(page);
get_page(page);
ClearPageLRU(page);
+ smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
}
--
2.16.1


2018-01-31 23:28:17

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 09/13] mm: introduce add-only version of pagevec_lru_move_fn

For the purposes of this prototype, copy the body of pagevec_lru_move_fn
into __pagevec_lru_add so that it can be modified to use the batch
locking API while leaving all other callers of pagevec_lru_move_fn
unaffected.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/swap.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index cf6a59f2cad6..2bb28fcb7cc0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -902,7 +902,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
*/
void __pagevec_lru_add(struct pagevec *pvec)
{
- pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+ int i;
+ struct pglist_data *pgdat = NULL;
+ struct lruvec *lruvec;
+ unsigned long flags = 0;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct pglist_data *pagepgdat = page_pgdat(page);
+
+ if (pagepgdat != pgdat) {
+ if (pgdat)
+ spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ pgdat = pagepgdat;
+ spin_lock_irqsave(&pgdat->lru_lock, flags);
+ }
+
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ __pagevec_lru_add_fn(page, lruvec, NULL);
+ }
+ if (pgdat)
+ spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ release_pages(pvec->pages, pvec->nr);
+ pagevec_reinit(pvec);
}
EXPORT_SYMBOL(__pagevec_lru_add);

--
2.16.1


2018-01-31 23:52:06

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 10/13] mm: add LRU batch lock API's

Add the LRU batch locking API's themselves. This adds the final piece
of infrastructure necessary for locking batches on an LRU list.

The API's lock a specific page on the LRU list, taking only the
appropriate LRU batch lock for a non-sentinel page and taking the
node's/memcg's lru_lock in addition for a sentinel page.

These interfaces are designed for performance: they minimize the number
of times we needlessly drop and then reacquire the same lock(s) when
used in a loop. They're difficult to use but will do for a prototype.

Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/mm_inline.h | 58 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1f1657c75b1b..11d9fcf93f2b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -210,6 +210,64 @@ static __always_inline void lru_unlock_all(struct pglist_data *pgdat,
local_irq_enable();
}

+static __always_inline spinlock_t *page_lru_batch_lock(struct page *page)
+{
+ return &page_pgdat(page)->lru_batch_locks[page->lru_batch].lock;
+}
+
+/**
+ * lru_batch_lock - lock an LRU list batch
+ */
+static __always_inline void lru_batch_lock(struct page *page,
+ spinlock_t **locked_lru_batch,
+ struct pglist_data **locked_pgdat,
+ unsigned long *flags)
+{
+ spinlock_t *lru_batch = page_lru_batch_lock(page);
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ VM_BUG_ON(*locked_pgdat && !page->lru_sentinel);
+
+ if (lru_batch != *locked_lru_batch) {
+ VM_BUG_ON(*locked_pgdat);
+ VM_BUG_ON(*locked_lru_batch);
+ spin_lock_irqsave(lru_batch, *flags);
+ *locked_lru_batch = lru_batch;
+ if (page->lru_sentinel) {
+ spin_lock(&pgdat->lru_lock);
+ *locked_pgdat = pgdat;
+ }
+ } else if (!*locked_pgdat && page->lru_sentinel) {
+ spin_lock(&pgdat->lru_lock);
+ *locked_pgdat = pgdat;
+ }
+}
+
+/**
+ * lru_batch_unlock - unlock an LRU list batch
+ */
+static __always_inline void lru_batch_unlock(struct page *page,
+ spinlock_t **locked_lru_batch,
+ struct pglist_data **locked_pgdat,
+ unsigned long *flags)
+{
+ spinlock_t *lru_batch = (page) ? page_lru_batch_lock(page) : NULL;
+
+ VM_BUG_ON(!*locked_lru_batch);
+
+ if (lru_batch != *locked_lru_batch) {
+ if (*locked_pgdat) {
+ spin_unlock(&(*locked_pgdat)->lru_lock);
+ *locked_pgdat = NULL;
+ }
+ spin_unlock_irqrestore(*locked_lru_batch, *flags);
+ *locked_lru_batch = NULL;
+ } else if (*locked_pgdat && !page->lru_sentinel) {
+ spin_unlock(&(*locked_pgdat)->lru_lock);
+ *locked_pgdat = NULL;
+ }
+}
+
/**
* page_lru_base_type - which LRU list type should a page be on?
* @page: the page to test
--
2.16.1


2018-01-31 23:52:08

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 08/13] mm: temporarily convert lru_lock callsites to lock-all API

These will use the lru_batch_locks in a later series.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/swap.c | 18 ++++++++----------
mm/vmscan.c | 4 ++--
2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index c4ca7e1c7c03..cf6a59f2cad6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
struct lruvec *lruvec;
unsigned long flags;

- spin_lock_irqsave(zone_lru_lock(zone), flags);
+ lru_lock_all(zone->zone_pgdat, &flags);
lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
- spin_unlock_irqrestore(zone_lru_lock(zone), flags);
+ lru_unlock_all(zone->zone_pgdat, &flags);
}
__ClearPageWaiters(page);
mem_cgroup_uncharge(page);
@@ -758,7 +758,7 @@ void release_pages(struct page **pages, int nr)
* same pgdat. The lock is held only if pgdat != NULL.
*/
if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ lru_unlock_all(locked_pgdat, &flags);
locked_pgdat = NULL;
}

@@ -768,8 +768,7 @@ void release_pages(struct page **pages, int nr)
/* Device public page can not be huge page */
if (is_device_public_page(page)) {
if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
- flags);
+ lru_unlock_all(locked_pgdat, &flags);
locked_pgdat = NULL;
}
put_zone_device_private_or_public_page(page);
@@ -782,7 +781,7 @@ void release_pages(struct page **pages, int nr)

if (PageCompound(page)) {
if (locked_pgdat) {
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ lru_unlock_all(locked_pgdat, &flags);
locked_pgdat = NULL;
}
__put_compound_page(page);
@@ -794,11 +793,10 @@ void release_pages(struct page **pages, int nr)

if (pgdat != locked_pgdat) {
if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock,
- flags);
+ lru_unlock_all(locked_pgdat, &flags);
lock_batch = 0;
locked_pgdat = pgdat;
- spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+ lru_lock_all(locked_pgdat, &flags);
}

lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
@@ -814,7 +812,7 @@ void release_pages(struct page **pages, int nr)
list_add(&page->lru, &pages_to_free);
}
if (locked_pgdat)
- spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+ lru_unlock_all(locked_pgdat, &flags);

mem_cgroup_uncharge_list(&pages_to_free);
free_unref_page_list(&pages_to_free);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b893200a397d..7f5ff0bb133f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1623,7 +1623,7 @@ int isolate_lru_page(struct page *page)
struct zone *zone = page_zone(page);
struct lruvec *lruvec;

- spin_lock_irq(zone_lru_lock(zone));
+ lru_lock_all(zone->zone_pgdat, NULL);
lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
if (PageLRU(page)) {
int lru = page_lru(page);
@@ -1632,7 +1632,7 @@ int isolate_lru_page(struct page *page)
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
}
- spin_unlock_irq(zone_lru_lock(zone));
+ lru_unlock_all(zone->zone_pgdat, NULL);
}
return ret;
}
--
2.16.1


2018-01-31 23:53:11

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 11/13] mm: use lru_batch locking in release_pages

Introduce LRU batch locking in release_pages. This is the code path
where I see lru_lock contention most often, so this is the one I used in
this prototype.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/swap.c | 45 +++++++++++++++++----------------------------
1 file changed, 17 insertions(+), 28 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 2bb28fcb7cc0..fae766e035a4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -745,31 +745,21 @@ void release_pages(struct page **pages, int nr)
int i;
LIST_HEAD(pages_to_free);
struct pglist_data *locked_pgdat = NULL;
+ spinlock_t *locked_lru_batch = NULL;
struct lruvec *lruvec;
unsigned long uninitialized_var(flags);
- unsigned int uninitialized_var(lock_batch);

for (i = 0; i < nr; i++) {
struct page *page = pages[i];

- /*
- * Make sure the IRQ-safe lock-holding time does not get
- * excessive with a continuous string of pages from the
- * same pgdat. The lock is held only if pgdat != NULL.
- */
- if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
- lru_unlock_all(locked_pgdat, &flags);
- locked_pgdat = NULL;
- }
-
if (is_huge_zero_page(page))
continue;

/* Device public page can not be huge page */
if (is_device_public_page(page)) {
- if (locked_pgdat) {
- lru_unlock_all(locked_pgdat, &flags);
- locked_pgdat = NULL;
+ if (locked_lru_batch) {
+ lru_batch_unlock(NULL, &locked_lru_batch,
+ &locked_pgdat, &flags);
}
put_zone_device_private_or_public_page(page);
continue;
@@ -780,26 +770,23 @@ void release_pages(struct page **pages, int nr)
continue;

if (PageCompound(page)) {
- if (locked_pgdat) {
- lru_unlock_all(locked_pgdat, &flags);
- locked_pgdat = NULL;
+ if (locked_lru_batch) {
+ lru_batch_unlock(NULL, &locked_lru_batch,
+ &locked_pgdat, &flags);
}
__put_compound_page(page);
continue;
}

if (PageLRU(page)) {
- struct pglist_data *pgdat = page_pgdat(page);
-
- if (pgdat != locked_pgdat) {
- if (locked_pgdat)
- lru_unlock_all(locked_pgdat, &flags);
- lock_batch = 0;
- locked_pgdat = pgdat;
- lru_lock_all(locked_pgdat, &flags);
+ if (locked_lru_batch) {
+ lru_batch_unlock(page, &locked_lru_batch,
+ &locked_pgdat, &flags);
}
+ lru_batch_lock(page, &locked_lru_batch, &locked_pgdat,
+ &flags);

- lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
+ lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -811,8 +798,10 @@ void release_pages(struct page **pages, int nr)

list_add(&page->lru, &pages_to_free);
}
- if (locked_pgdat)
- lru_unlock_all(locked_pgdat, &flags);
+ if (locked_lru_batch) {
+ lru_batch_unlock(NULL, &locked_lru_batch, &locked_pgdat,
+ &flags);
+ }

mem_cgroup_uncharge_list(&pages_to_free);
free_unref_page_list(&pages_to_free);
--
2.16.1


2018-01-31 23:53:13

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 03/13] mm: add lock array to pgdat and batch fields to struct page

This patch simply adds the array of locks and struct page fields.
Ignore for now where the struct page fields are: we need to find a place
to put them that doesn't enlarge the struct.

Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/mm_types.h | 5 +++++
include/linux/mmzone.h | 7 +++++++
mm/page_alloc.c | 3 +++
3 files changed, 15 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cfd0ac4e5e0e..6e9d26f0cecf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -190,6 +190,11 @@ struct page {
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
};

+ struct {
+ unsigned lru_batch;
+ bool lru_sentinel;
+ };
+
#ifdef CONFIG_MEMCG
struct mem_cgroup *mem_cgroup;
#endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c05529473b80..5ffb36b3f665 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -249,6 +249,11 @@ struct lruvec {
#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
#define LRU_ALL ((1 << NR_LRU_LISTS) - 1)

+#define NUM_LRU_BATCH_LOCKS 32
+struct lru_batch_lock {
+ spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
/* Isolate unmapped file */
#define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x2)
/* Isolate for asynchronous migration */
@@ -715,6 +720,8 @@ typedef struct pglist_data {

unsigned long flags;

+ struct lru_batch_lock lru_batch_locks[NUM_LRU_BATCH_LOCKS];
+
ZONE_PADDING(_pad2_)

/* Per-node vmstats */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7078ed68b01..3248b48e11ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6070,6 +6070,7 @@ static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
*/
static void __paginginit free_area_init_core(struct pglist_data *pgdat)
{
+ size_t i;
enum zone_type j;
int nid = pgdat->node_id;

@@ -6092,6 +6093,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
pgdat_page_ext_init(pgdat);
spin_lock_init(&pgdat->lru_lock);
lruvec_init(node_lruvec(pgdat));
+ for (i = 0; i < NUM_LRU_BATCH_LOCKS; ++i)
+ spin_lock_init(&pgdat->lru_batch_locks[i].lock);

pgdat->per_cpu_nodestats = &boot_nodestats;

--
2.16.1


2018-01-31 23:53:20

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 07/13] mm: convert to-be-refactored lru_lock callsites to lock-all API

Use the heavy locking API for now to allow us to focus on the path we're
measuring to prove the concept--the release_pages path. In that path,
LRU batch locking will be used, but everywhere else will be heavy.

For now, exclude compaction since this would be a nontrivial
refactoring. We can deal with that in a future series.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/huge_memory.c | 6 +++---
mm/memcontrol.c | 4 ++--
mm/mlock.c | 10 +++++-----
mm/page_idle.c | 4 ++--
mm/swap.c | 10 +++++-----
mm/vmscan.c | 38 +++++++++++++++++++-------------------
6 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0e7ded98d114..787ad5ba55bb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2461,7 +2461,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
spin_unlock(&head->mapping->tree_lock);
}

- spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
+ lru_unlock_all(page_zone(head)->zone_pgdat, &flags);

unfreeze_page(head);

@@ -2661,7 +2661,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
lru_add_drain();

/* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irqsave(zone_lru_lock(page_zone(head)), flags);
+ lru_lock_all(page_zone(head)->zone_pgdat, &flags);

if (mapping) {
void **pslot;
@@ -2709,7 +2709,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&pgdata->split_queue_lock);
fail: if (mapping)
spin_unlock(&mapping->tree_lock);
- spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
+ lru_unlock_all(page_zone(head)->zone_pgdat, &flags);
unfreeze_page(head);
ret = -EBUSY;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac2ffd5e02b9..99a54df760e3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2071,7 +2071,7 @@ static void lock_page_lru(struct page *page, int *isolated)
{
struct zone *zone = page_zone(page);

- spin_lock_irq(zone_lru_lock(zone));
+ lru_lock_all(zone->zone_pgdat, NULL);
if (PageLRU(page)) {
struct lruvec *lruvec;

@@ -2095,7 +2095,7 @@ static void unlock_page_lru(struct page *page, int isolated)
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
}
- spin_unlock_irq(zone_lru_lock(zone));
+ lru_unlock_all(zone->zone_pgdat, NULL);
}

static void commit_charge(struct page *page, struct mem_cgroup *memcg,
diff --git a/mm/mlock.c b/mm/mlock.c
index 30472d438794..6ba6a5887aeb 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -188,7 +188,7 @@ unsigned int munlock_vma_page(struct page *page)
* might otherwise copy PageMlocked to part of the tail pages before
* we clear it in the head page. It also stabilizes hpage_nr_pages().
*/
- spin_lock_irq(zone_lru_lock(zone));
+ lru_lock_all(zone->zone_pgdat, NULL);

if (!TestClearPageMlocked(page)) {
/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -200,14 +200,14 @@ unsigned int munlock_vma_page(struct page *page)
__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

if (__munlock_isolate_lru_page(page, true)) {
- spin_unlock_irq(zone_lru_lock(zone));
+ lru_unlock_all(zone->zone_pgdat, NULL);
__munlock_isolated_page(page);
goto out;
}
__munlock_isolation_failed(page);

unlock_out:
- spin_unlock_irq(zone_lru_lock(zone));
+ lru_unlock_all(zone->zone_pgdat, NULL);

out:
return nr_pages - 1;
@@ -292,7 +292,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
pagevec_init(&pvec_putback);

/* Phase 1: page isolation */
- spin_lock_irq(zone_lru_lock(zone));
+ lru_lock_all(zone->zone_pgdat, NULL);
for (i = 0; i < nr; i++) {
struct page *page = pvec->pages[i];

@@ -319,7 +319,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
pvec->pages[i] = NULL;
}
__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- spin_unlock_irq(zone_lru_lock(zone));
+ lru_unlock_all(zone->zone_pgdat, NULL);

/* Now we can release pins of pages that we are not munlocking */
pagevec_release(&pvec_putback);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 0a49374e6931..3324527c1c34 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -42,12 +42,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
return NULL;

zone = page_zone(page);
- spin_lock_irq(zone_lru_lock(zone));
+ lru_lock_all(zone->zone_pgdat, NULL);
if (unlikely(!PageLRU(page))) {
put_page(page);
page = NULL;
}
- spin_unlock_irq(zone_lru_lock(zone));
+ lru_unlock_all(zone->zone_pgdat, NULL);
return page;
}

diff --git a/mm/swap.c b/mm/swap.c
index 67eb89fc9435..c4ca7e1c7c03 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,16 +200,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,

if (pagepgdat != pgdat) {
if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ lru_unlock_all(pgdat, &flags);
pgdat = pagepgdat;
- spin_lock_irqsave(&pgdat->lru_lock, flags);
+ lru_lock_all(pgdat, &flags);
}

lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
}
if (pgdat)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ lru_unlock_all(pgdat, &flags);
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
}
@@ -330,9 +330,9 @@ void activate_page(struct page *page)
struct zone *zone = page_zone(page);

page = compound_head(page);
- spin_lock_irq(zone_lru_lock(zone));
+ lru_lock_all(zone->zone_pgdat, NULL);
__activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
- spin_unlock_irq(zone_lru_lock(zone));
+ lru_unlock_all(zone->zone_pgdat, NULL);
}
#endif

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b4c32a65a40f..b893200a397d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1691,9 +1691,9 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
VM_BUG_ON_PAGE(PageLRU(page), page);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);
putback_lru_page(page);
- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);
continue;
}

@@ -1714,10 +1714,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);
} else
list_add(&page->lru, &pages_to_free);
}
@@ -1779,7 +1779,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (!sc->may_unmap)
isolate_mode |= ISOLATE_UNMAPPED;

- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);
@@ -1798,7 +1798,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
count_memcg_events(lruvec_memcg(lruvec), PGSCAN_DIRECT,
nr_scanned);
}
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);

if (nr_taken == 0)
return 0;
@@ -1806,7 +1806,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
&stat, false);

- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);

if (current_is_kswapd()) {
if (global_reclaim(sc))
@@ -1824,7 +1824,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);

- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);

mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
@@ -1951,10 +1951,10 @@ static unsigned move_active_pages_to_lru(struct lruvec *lruvec,
del_page_from_lru_list(page, lruvec, lru);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);
mem_cgroup_uncharge(page);
(*get_compound_page_dtor(page))(page);
- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);
} else
list_add(&page->lru, pages_to_free);
} else {
@@ -1995,7 +1995,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
if (!sc->may_unmap)
isolate_mode |= ISOLATE_UNMAPPED;

- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
@@ -2006,7 +2006,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_vm_events(PGREFILL, nr_scanned);
count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);

- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);

while (!list_empty(&l_hold)) {
cond_resched();
@@ -2051,7 +2051,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
@@ -2063,7 +2063,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);

mem_cgroup_uncharge_list(&l_hold);
free_unref_page_list(&l_hold);
@@ -2306,7 +2306,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);

- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
@@ -2327,7 +2327,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,

fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);

fraction[0] = ap;
fraction[1] = fp;
@@ -3978,9 +3978,9 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
pgscanned++;
if (pagepgdat != pgdat) {
if (pgdat)
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);
pgdat = pagepgdat;
- spin_lock_irq(&pgdat->lru_lock);
+ lru_lock_all(pgdat, NULL);
}
lruvec = mem_cgroup_page_lruvec(page, pgdat);

@@ -4001,7 +4001,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
if (pgdat) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- spin_unlock_irq(&pgdat->lru_lock);
+ lru_unlock_all(pgdat, NULL);
}
}
#endif /* CONFIG_SHMEM */
--
2.16.1


2018-01-31 23:53:23

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 06/13] mm: add lru_[un]lock_all APIs

Add heavy locking API's for the few cases that a thread needs exclusive
access to an LRU list. This locks lru_lock as well as every lock in
lru_batch_locks.

This API will be used often at first, in scaffolding code, to ease the
transition from using lru_lock to the batch locking scheme. Later it
will be rarely needed.

Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/mm_inline.h | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index ec8b966a1c76..1f1657c75b1b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -178,6 +178,38 @@ static __always_inline void move_page_to_lru_list_tail(struct page *page,
__add_page_to_lru_list_tail(page, lruvec, lru);
}

+static __always_inline void lru_lock_all(struct pglist_data *pgdat,
+ unsigned long *flags)
+{
+ size_t i;
+
+ if (flags)
+ local_irq_save(*flags);
+ else
+ local_irq_disable();
+
+ for (i = 0; i < NUM_LRU_BATCH_LOCKS; ++i)
+ spin_lock(&pgdat->lru_batch_locks[i].lock);
+
+ spin_lock(&pgdat->lru_lock);
+}
+
+static __always_inline void lru_unlock_all(struct pglist_data *pgdat,
+ unsigned long *flags)
+{
+ int i;
+
+ spin_unlock(&pgdat->lru_lock);
+
+ for (i = NUM_LRU_BATCH_LOCKS - 1; i >= 0; --i)
+ spin_unlock(&pgdat->lru_batch_locks[i].lock);
+
+ if (flags)
+ local_irq_restore(*flags);
+ else
+ local_irq_enable();
+}
+
/**
* page_lru_base_type - which LRU list type should a page be on?
* @page: the page to test
--
2.16.1


2018-01-31 23:53:34

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 05/13] mm: add batching logic to add/delete/move API's

Change the add/delete/move LRU API's in mm_inline.h to account for LRU
batching. Now when a page is added to the front of the LRU, it's
assigned a batch number that's used to decide which spinlock in the
lru_batch_lock array to take when removing that page from the LRU. Each
newly-added page is also unconditionally made a sentinel page.

As more pages are added to the front of an LRU, the same batch number is
used for each until a threshold is reached, at which point a batch is
ready and the sentinel bits are unset in all but the first and last pages
of the batch. This allows those inner pages to be removed with a batch
lock rather than the heavier lru_lock.

Signed-off-by: Daniel Jordan <[email protected]>
---
include/linux/mm_inline.h | 119 ++++++++++++++++++++++++++++++++++++++++++++--
include/linux/mmzone.h | 3 ++
mm/swap.c | 2 +-
mm/vmscan.c | 4 +-
4 files changed, 122 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index d7fc46ebc33b..ec8b966a1c76 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -3,6 +3,7 @@
#define LINUX_MM_INLINE_H

#include <linux/huge_mm.h>
+#include <linux/random.h>
#include <linux/swap.h>

/**
@@ -44,27 +45,139 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
#endif
}

+static __always_inline void __add_page_to_lru_list(struct page *page,
+ struct lruvec *lruvec, enum lru_list lru)
+{
+ int tag;
+ struct page *cur, *next, *second_page;
+ struct lru_list_head *head = &lruvec->lists[lru];
+
+ list_add(&page->lru, lru_head(head));
+ /* Set sentinel unconditionally until batch is full. */
+ page->lru_sentinel = true;
+
+ second_page = container_of(page->lru.next, struct page, lru);
+ VM_BUG_ON_PAGE(!second_page->lru_sentinel, second_page);
+
+ page->lru_batch = head->first_batch_tag;
+ ++head->first_batch_npages;
+
+ if (head->first_batch_npages < LRU_BATCH_MAX)
+ return;
+
+ tag = head->first_batch_tag;
+ if (likely(second_page->lru_batch == tag)) {
+ /* Unset sentinel bit in all non-sentinel nodes. */
+ cur = second_page;
+ list_for_each_entry_from(cur, lru_head(head), lru) {
+ next = list_next_entry(cur, lru);
+ if (next->lru_batch != tag)
+ break;
+ cur->lru_sentinel = false;
+ }
+ }
+
+ tag = prandom_u32_max(NUM_LRU_BATCH_LOCKS);
+ if (unlikely(tag == head->first_batch_tag))
+ tag = (tag + 1) % NUM_LRU_BATCH_LOCKS;
+ head->first_batch_tag = tag;
+ head->first_batch_npages = 0;
+}
+
static __always_inline void add_page_to_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
- list_add(&page->lru, lru_head(&lruvec->lists[lru]));
+ __add_page_to_lru_list(page, lruvec, lru);
+}
+
+static __always_inline void __add_page_to_lru_list_tail(struct page *page,
+ struct lruvec *lruvec, enum lru_list lru)
+{
+ int tag;
+ struct page *cur, *prev, *second_page;
+ struct lru_list_head *head = &lruvec->lists[lru];
+
+ list_add_tail(&page->lru, lru_head(head));
+ /* Set sentinel unconditionally until batch is full. */
+ page->lru_sentinel = true;
+
+ second_page = container_of(page->lru.prev, struct page, lru);
+ VM_BUG_ON_PAGE(!second_page->lru_sentinel, second_page);
+
+ page->lru_batch = head->last_batch_tag;
+ ++head->last_batch_npages;
+
+ if (head->last_batch_npages < LRU_BATCH_MAX)
+ return;
+
+ tag = head->last_batch_tag;
+ if (likely(second_page->lru_batch == tag)) {
+ /* Unset sentinel bit in all non-sentinel nodes. */
+ cur = second_page;
+ list_for_each_entry_from_reverse(cur, lru_head(head), lru) {
+ prev = list_prev_entry(cur, lru);
+ if (prev->lru_batch != tag)
+ break;
+ cur->lru_sentinel = false;
+ }
+ }
+
+ tag = prandom_u32_max(NUM_LRU_BATCH_LOCKS);
+ if (unlikely(tag == head->last_batch_tag))
+ tag = (tag + 1) % NUM_LRU_BATCH_LOCKS;
+ head->last_batch_tag = tag;
+ head->last_batch_npages = 0;
}

static __always_inline void add_page_to_lru_list_tail(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
+
update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
- list_add_tail(&page->lru, lru_head(&lruvec->lists[lru]));
+ __add_page_to_lru_list_tail(page, lruvec, lru);
}

-static __always_inline void del_page_from_lru_list(struct page *page,
+static __always_inline void __del_page_from_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
+ struct page *left, *right;
+
+ left = container_of(page->lru.prev, struct page, lru);
+ right = container_of(page->lru.next, struct page, lru);
+
+ if (page->lru_sentinel) {
+ VM_BUG_ON(!left->lru_sentinel && !right->lru_sentinel);
+ left->lru_sentinel = true;
+ right->lru_sentinel = true;
+ }
+
list_del(&page->lru);
+}
+
+static __always_inline void del_page_from_lru_list(struct page *page,
+ struct lruvec *lruvec, enum lru_list lru)
+{
+ __del_page_from_lru_list(page, lruvec, lru);
update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
}

+static __always_inline void move_page_to_lru_list(struct page *page,
+ struct lruvec *lruvec,
+ enum lru_list lru)
+{
+ __del_page_from_lru_list(page, lruvec, lru);
+ __add_page_to_lru_list(page, lruvec, lru);
+}
+
+static __always_inline void move_page_to_lru_list_tail(struct page *page,
+ struct lruvec *lruvec,
+ enum lru_list lru)
+{
+ __del_page_from_lru_list(page, lruvec, lru);
+ __add_page_to_lru_list_tail(page, lruvec, lru);
+}
+
/**
* page_lru_base_type - which LRU list type should a page be on?
* @page: the page to test
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index feca75b8f492..492f86cdb346 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -19,6 +19,7 @@
#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
#include <linux/mm_types.h>
+#include <linux/pagevec.h>
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
@@ -260,6 +261,8 @@ struct lruvec {
#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
#define LRU_ALL ((1 << NR_LRU_LISTS) - 1)

+#define LRU_BATCH_MAX PAGEVEC_SIZE
+
#define NUM_LRU_BATCH_LOCKS 32
struct lru_batch_lock {
spinlock_t lock;
diff --git a/mm/swap.c b/mm/swap.c
index 286636bb6a4f..67eb89fc9435 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -561,7 +561,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
* The page's writeback ends up during pagevec
* We moves tha page into tail of inactive.
*/
- list_move_tail(&page->lru, lru_head(&lruvec->lists[lru]));
+ move_page_to_lru_list_tail(page, lruvec, lru);
__count_vm_event(PGROTATED);
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index aa629c4720dd..b4c32a65a40f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1553,7 +1553,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

case -EBUSY:
/* else it is being freed elsewhere */
- list_move(&page->lru, src);
+ move_page_to_lru_list(page, lruvec, lru);
continue;

default:
@@ -1943,7 +1943,7 @@ static unsigned move_active_pages_to_lru(struct lruvec *lruvec,

nr_pages = hpage_nr_pages(page);
update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
- list_move(&page->lru, lru_head(&lruvec->lists[lru]));
+ move_page_to_lru_list(page, lruvec, lru);

if (put_page_testzero(page)) {
__ClearPageLRU(page);
--
2.16.1


2018-01-31 23:54:17

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 02/13] mm: allow compaction to be disabled

This is a temporary hack to avoid the non-trivial refactoring of the
compaction code that takes lru_lock in this prototype. This refactoring
can be done later.

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/Kconfig | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 03ff7703d322..96412c3939c5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -231,7 +231,6 @@ config BALLOON_COMPACTION
# support for memory compaction
config COMPACTION
bool "Allow for memory compaction"
- def_bool y
select MIGRATION
depends on MMU
help
--
2.16.1


2018-01-31 23:54:17

by Daniel Jordan

[permalink] [raw]
Subject: [RFC PATCH v1 01/13] mm: add a percpu_pagelist_batch sysctl interface

From: Aaron Lu <[email protected]>

---
include/linux/mmzone.h | 2 ++
kernel/sysctl.c | 9 +++++++++
mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..c05529473b80 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -891,6 +891,8 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int percpu_pagelist_batch_sysctl_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 557d46728577..1602bc14bf0d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -108,6 +108,7 @@ extern unsigned int core_pipe_limit;
extern int pid_max;
extern int pid_max_min, pid_max_max;
extern int percpu_pagelist_fraction;
+extern int percpu_pagelist_batch;
extern int latencytop_enabled;
extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
#ifndef CONFIG_MMU
@@ -1458,6 +1459,14 @@ static struct ctl_table vm_table[] = {
.proc_handler = percpu_pagelist_fraction_sysctl_handler,
.extra1 = &zero,
},
+ {
+ .procname = "percpu_pagelist_batch",
+ .data = &percpu_pagelist_batch,
+ .maxlen = sizeof(percpu_pagelist_batch),
+ .mode = 0644,
+ .proc_handler = percpu_pagelist_batch_sysctl_handler,
+ .extra1 = &zero,
+ },
#ifdef CONFIG_MMU
{
.procname = "max_map_count",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688b6a0a..d7078ed68b01 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -130,6 +130,7 @@ unsigned long totalreserve_pages __read_mostly;
unsigned long totalcma_pages __read_mostly;

int percpu_pagelist_fraction;
+int percpu_pagelist_batch;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

/*
@@ -5544,7 +5545,8 @@ static void pageset_set_high_and_batch(struct zone *zone,
(zone->managed_pages /
percpu_pagelist_fraction));
else
- pageset_set_batch(pcp, zone_batchsize(zone));
+ pageset_set_batch(pcp, percpu_pagelist_batch ?
+ percpu_pagelist_batch : zone_batchsize(zone));
}

static void __meminit zone_pageset_init(struct zone *zone, int cpu)
@@ -7266,6 +7268,42 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
return ret;
}

+int percpu_pagelist_batch_sysctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ struct zone *zone;
+ int old_percpu_pagelist_batch;
+ int ret;
+
+ mutex_lock(&pcp_batch_high_lock);
+ old_percpu_pagelist_batch = percpu_pagelist_batch;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (!write || ret < 0)
+ goto out;
+
+ /* Sanity checking to avoid pcp imbalance */
+ if (percpu_pagelist_batch <= 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* No change? */
+ if (percpu_pagelist_batch == old_percpu_pagelist_batch)
+ goto out;
+
+ for_each_populated_zone(zone) {
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu)
+ pageset_set_high_and_batch(zone,
+ per_cpu_ptr(zone->pageset, cpu));
+ }
+out:
+ mutex_unlock(&pcp_batch_high_lock);
+ return ret;
+}
+
#ifdef CONFIG_NUMA
int hashdist = HASHDIST_DEFAULT;

--
2.16.1


2018-02-01 15:55:28

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/13] lru_lock scalability

Hi,


On 31/01/18 23:04, [email protected] wrote:
> lru_lock, a per-node* spinlock that protects an LRU list, is one of the
> hottest locks in the kernel. On some workloads on large machines, it
> shows up at the top of lock_stat.
>
> One way to improve lru_lock scalability is to introduce an array of locks,
> with each lock protecting certain batches of LRU pages.
>
> *ooooooooooo**ooooooooooo**ooooooooooo**oooo ...
> | || || ||
> \ batch 1 / \ batch 2 / \ batch 3 /
>
> In this ASCII depiction of an LRU, a page is represented with either '*'
> or 'o'. An asterisk indicates a sentinel page, which is a page at the
> edge of a batch. An 'o' indicates a non-sentinel page.
>
> To remove a non-sentinel LRU page, only one lock from the array is
> required. This allows multiple threads to remove pages from different
> batches simultaneously. A sentinel page requires lru_lock in addition to
> a lock from the array.
>
> Full performance numbers appear in the last patch in this series, but this
> prototype allows a microbenchmark to do up to 28% more page faults per
> second with 16 or more concurrent processes.
>
> This work was developed in collaboration with Steve Sistare.
>
> Note: This is an early prototype. I'm submitting it now to support my
> request to attend LSF/MM, as well as get early feedback on the idea. Any
> comments appreciated.
>
>
> * lru_lock is actually per-memcg, but without memcg's in the picture it
> becomes per-node.
GFS2 has an lru list for glocks, which can be contended under certain
workloads. Work is still ongoing to figure out exactly why, but this
looks like it might be a good approach to that issue too. The main
purpose of GFS2's lru list is to allow shrinking of the glocks under
memory pressure via the gfs2_scan_glock_lru() function, and it looks
like this type of approach could be used there to improve the scalability,

Steve.

>
> Aaron Lu (1):
> mm: add a percpu_pagelist_batch sysctl interface
>
> Daniel Jordan (12):
> mm: allow compaction to be disabled
> mm: add lock array to pgdat and batch fields to struct page
> mm: introduce struct lru_list_head in lruvec to hold per-LRU batch
> info
> mm: add batching logic to add/delete/move API's
> mm: add lru_[un]lock_all APIs
> mm: convert to-be-refactored lru_lock callsites to lock-all API
> mm: temporarily convert lru_lock callsites to lock-all API
> mm: introduce add-only version of pagevec_lru_move_fn
> mm: add LRU batch lock API's
> mm: use lru_batch locking in release_pages
> mm: split up release_pages into non-sentinel and sentinel passes
> mm: splice local lists onto the front of the LRU
>
> include/linux/mm_inline.h | 209 +++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/mm_types.h | 5 ++
> include/linux/mmzone.h | 25 +++++-
> kernel/sysctl.c | 9 ++
> mm/Kconfig | 1 -
> mm/huge_memory.c | 6 +-
> mm/memcontrol.c | 5 +-
> mm/mlock.c | 11 +--
> mm/mmzone.c | 7 +-
> mm/page_alloc.c | 43 +++++++++-
> mm/page_idle.c | 4 +-
> mm/swap.c | 208 ++++++++++++++++++++++++++++++++++++---------
> mm/vmscan.c | 49 +++++------
> 13 files changed, 500 insertions(+), 82 deletions(-)
>


2018-02-01 22:52:57

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v1 03/13] mm: add lock array to pgdat and batch fields to struct page

On 01/31/2018 03:04 PM, [email protected] wrote:
> This patch simply adds the array of locks and struct page fields.
> Ignore for now where the struct page fields are: we need to find a place
> to put them that doesn't enlarge the struct.
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> include/linux/mm_types.h | 5 +++++
> include/linux/mmzone.h | 7 +++++++
> mm/page_alloc.c | 3 +++
> 3 files changed, 15 insertions(+)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index cfd0ac4e5e0e..6e9d26f0cecf 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -190,6 +190,11 @@ struct page {
> struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
> };
>
> + struct {
> + unsigned lru_batch;
> + bool lru_sentinel;

The above declaration adds at least 5 bytes to struct page.
It adds a lot of extra memory overhead when multiplied
by the number of pages in the system.

We can move sentinel bool to page flag, at least for 64 bit system.
And 8 bit is probably enough for lru_batch id to give a max
lru_batch number of 256 to break the locks into 256 smaller ones.
The max used in the patchset is 32 and that is already giving
pretty good spread of the locking.
It will be better if we can find some unused space in struct page
to squeeze it in.

Tim


2018-02-01 23:33:03

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU

On 01/31/2018 03:04 PM, [email protected] wrote:
> Now that release_pages is scaling better with concurrent removals from
> the LRU, the performance results (included below) showed increased
> contention on lru_lock in the add-to-LRU path.
>
> To alleviate some of this contention, do more work outside the LRU lock.
> Prepare a local list of pages to be spliced onto the front of the LRU,
> including setting PageLRU in each page, before taking lru_lock. Since
> other threads use this page flag in certain checks outside lru_lock,
> ensure each page's LRU links have been properly initialized before
> setting the flag, and use memory barriers accordingly.
>
> Performance Results
>
> This is a will-it-scale run of page_fault1 using 4 different kernels.
>
> kernel kern #
>
> 4.15-rc2 1
> large-zone-batch 2
> lru-lock-base 3
> lru-lock-splice 4
>
> Each kernel builds on the last. The first is a baseline, the second
> makes zone->lock more scalable by increasing an order-0 per-cpu
> pagelist's 'batch' and 'high' values to 310 and 1860 respectively
> (courtesy of Aaron Lu's patch), the third scales lru_lock without
> splicing pages (the previous patch in this series), and the fourth adds
> page splicing (this patch).
>
> N tasks mmap, fault, and munmap anonymous pages in a loop until the test
> time has elapsed.
>
> The process case generally does better than the thread case most likely
> because of mmap_sem acting as a bottleneck. There's ongoing work
> upstream[*] to scale this lock, however, and once it goes in, my
> hypothesis is the thread numbers here will improve.
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 1 705,533 1,644 705,227 1,122
> 2 1 2.5% 2.8% 722,912 453 724,807 728
> 3 1 2.6% 2.6% 724,215 653 723,213 941
> 4 1 2.3% 2.8% 721,746 272 724,944 728
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 4 2,525,487 7,428 1,973,616 12,568
> 2 4 2.6% 7.6% 2,590,699 6,968 2,123,570 10,350
> 3 4 2.3% 4.4% 2,584,668 12,833 2,059,822 10,748
> 4 4 4.7% 5.2% 2,643,251 13,297 2,076,808 9,506
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 16 6,444,656 20,528 3,226,356 32,874
> 2 16 1.9% 10.4% 6,566,846 20,803 3,560,437 64,019
> 3 16 18.3% 6.8% 7,624,749 58,497 3,447,109 67,734
> 4 16 28.2% 2.5% 8,264,125 31,677 3,306,679 69,443
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 32 11,564,988 32,211 2,456,507 38,898
> 2 32 1.8% 1.5% 11,777,119 45,418 2,494,064 27,964
> 3 32 16.1% -2.7% 13,426,746 94,057 2,389,934 40,186
> 4 32 26.2% 1.2% 14,593,745 28,121 2,486,059 42,004
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 64 12,080,629 33,676 2,443,043 61,973
> 2 64 3.9% 9.9% 12,551,136 206,202 2,684,632 69,483
> 3 64 15.0% -3.8% 13,892,933 351,657 2,351,232 67,875
> 4 64 21.9% 1.8% 14,728,765 64,945 2,485,940 66,839
>
> [*] https://lwn.net/Articles/724502/ Range reader/writer locks
> https://lwn.net/Articles/744188/ Speculative page faults
>

The speedup looks pretty nice and seems to peak at 16 tasks. Do you have an explanation of what
causes the drop from 28.2% to 21.9% going from 16 to 64 tasks? Was
the loss in performance due to increased contention on LRU lock when more tasks running
results in a higher likelihood of hitting the sentinel? If I understand
your patchset correctly, you will need to acquire LRU lock for sentinel page. Perhaps an increase
in batch size could help?

Thanks.

Tim

2018-02-02 04:20:01

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/13] lru_lock scalability



On 02/01/2018 10:54 AM, Steven Whitehouse wrote:
> Hi,
>
>
> On 31/01/18 23:04, [email protected] wrote:
>> lru_lock, a per-node* spinlock that protects an LRU list, is one of the
>> hottest locks in the kernel.  On some workloads on large machines, it
>> shows up at the top of lock_stat.
>>
>> One way to improve lru_lock scalability is to introduce an array of locks,
>> with each lock protecting certain batches of LRU pages.
>>
>>          *ooooooooooo**ooooooooooo**ooooooooooo**oooo ...
>>          |           ||           ||           ||
>>           \ batch 1 /  \ batch 2 /  \ batch 3 /
>>
>> In this ASCII depiction of an LRU, a page is represented with either '*'
>> or 'o'.  An asterisk indicates a sentinel page, which is a page at the
>> edge of a batch.  An 'o' indicates a non-sentinel page.
>>
>> To remove a non-sentinel LRU page, only one lock from the array is
>> required.  This allows multiple threads to remove pages from different
>> batches simultaneously.  A sentinel page requires lru_lock in addition to
>> a lock from the array.
>>
>> Full performance numbers appear in the last patch in this series, but this
>> prototype allows a microbenchmark to do up to 28% more page faults per
>> second with 16 or more concurrent processes.
>>
>> This work was developed in collaboration with Steve Sistare.
>>
>> Note: This is an early prototype.  I'm submitting it now to support my
>> request to attend LSF/MM, as well as get early feedback on the idea.  Any
>> comments appreciated.
>>
>>
>> * lru_lock is actually per-memcg, but without memcg's in the picture it
>>    becomes per-node.
> GFS2 has an lru list for glocks, which can be contended under certain workloads. Work is still ongoing to figure out exactly why, but this looks like it might be a good approach to that issue too. The main purpose of GFS2's lru list is to allow shrinking of the glocks under memory pressure via the gfs2_scan_glock_lru() function, and it looks like this type of approach could be used there to improve the scalability,

Glad to hear that this could help in gfs2 as well.

Hopefully struct gfs2_glock is less space constrained than struct page for storing the few bits of metadata that this approach requires.

Daniel

>
> Steve.
>
>>
>> Aaron Lu (1):
>>    mm: add a percpu_pagelist_batch sysctl interface
>>
>> Daniel Jordan (12):
>>    mm: allow compaction to be disabled
>>    mm: add lock array to pgdat and batch fields to struct page
>>    mm: introduce struct lru_list_head in lruvec to hold per-LRU batch
>>      info
>>    mm: add batching logic to add/delete/move API's
>>    mm: add lru_[un]lock_all APIs
>>    mm: convert to-be-refactored lru_lock callsites to lock-all API
>>    mm: temporarily convert lru_lock callsites to lock-all API
>>    mm: introduce add-only version of pagevec_lru_move_fn
>>    mm: add LRU batch lock API's
>>    mm: use lru_batch locking in release_pages
>>    mm: split up release_pages into non-sentinel and sentinel passes
>>    mm: splice local lists onto the front of the LRU
>>
>>   include/linux/mm_inline.h | 209 +++++++++++++++++++++++++++++++++++++++++++++-
>>   include/linux/mm_types.h  |   5 ++
>>   include/linux/mmzone.h    |  25 +++++-
>>   kernel/sysctl.c           |   9 ++
>>   mm/Kconfig                |   1 -
>>   mm/huge_memory.c          |   6 +-
>>   mm/memcontrol.c           |   5 +-
>>   mm/mlock.c                |  11 +--
>>   mm/mmzone.c               |   7 +-
>>   mm/page_alloc.c           |  43 +++++++++-
>>   mm/page_idle.c            |   4 +-
>>   mm/swap.c                 | 208 ++++++++++++++++++++++++++++++++++++---------
>>   mm/vmscan.c               |  49 +++++------
>>   13 files changed, 500 insertions(+), 82 deletions(-)
>>
>

2018-02-02 04:30:56

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 03/13] mm: add lock array to pgdat and batch fields to struct page



On 02/01/2018 05:50 PM, Tim Chen wrote:
> On 01/31/2018 03:04 PM, [email protected] wrote:
>> This patch simply adds the array of locks and struct page fields.
>> Ignore for now where the struct page fields are: we need to find a place
>> to put them that doesn't enlarge the struct.
>>
>> Signed-off-by: Daniel Jordan <[email protected]>
>> ---
>> include/linux/mm_types.h | 5 +++++
>> include/linux/mmzone.h | 7 +++++++
>> mm/page_alloc.c | 3 +++
>> 3 files changed, 15 insertions(+)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index cfd0ac4e5e0e..6e9d26f0cecf 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -190,6 +190,11 @@ struct page {
>> struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
>> };
>>
>> + struct {
>> + unsigned lru_batch;
>> + bool lru_sentinel;
>
> The above declaration adds at least 5 bytes to struct page.
> It adds a lot of extra memory overhead when multiplied
> by the number of pages in the system.

Yes, I completely agree, enlarging struct page won't cut it for the final solution.

> We can move sentinel bool to page flag, at least for 64 bit system.

There did seem to be room for one more bit the way my kernel was configured (without losing a component in page->flags), but I'd have to look again.

> And 8 bit is probably enough for lru_batch id to give a max
> lru_batch number of 256 to break the locks into 256 smaller ones.
> The max used in the patchset is 32 and that is already giving
> pretty good spread of the locking.
> It will be better if we can find some unused space in struct page
> to squeeze it in.

One idea we'd had was to store the batch id in the lower bits of the mem_cgroup pointer. CONFIG_MEMCG seems to be pretty ubiquitous these days, and it's a large enough struct (1048 bytes on one machine) to have room in the lower bits.

Another way might be to encode the previous and next lru page pointers as pfn's instead of struct list_head *'s, shrinking the footprint of struct page's lru field to allow room for the batch id.

2018-02-02 05:22:19

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU

On 02/01/2018 06:30 PM, Tim Chen wrote:
> On 01/31/2018 03:04 PM, [email protected] wrote:
>> Now that release_pages is scaling better with concurrent removals from
>> the LRU, the performance results (included below) showed increased
>> contention on lru_lock in the add-to-LRU path.
>>
>> To alleviate some of this contention, do more work outside the LRU lock.
>> Prepare a local list of pages to be spliced onto the front of the LRU,
>> including setting PageLRU in each page, before taking lru_lock. Since
>> other threads use this page flag in certain checks outside lru_lock,
>> ensure each page's LRU links have been properly initialized before
>> setting the flag, and use memory barriers accordingly.
>>
>> Performance Results
>>
>> This is a will-it-scale run of page_fault1 using 4 different kernels.
>>
>> kernel kern #
>>
>> 4.15-rc2 1
>> large-zone-batch 2
>> lru-lock-base 3
>> lru-lock-splice 4
>>
>> Each kernel builds on the last. The first is a baseline, the second
>> makes zone->lock more scalable by increasing an order-0 per-cpu
>> pagelist's 'batch' and 'high' values to 310 and 1860 respectively
>> (courtesy of Aaron Lu's patch), the third scales lru_lock without
>> splicing pages (the previous patch in this series), and the fourth adds
>> page splicing (this patch).
>>
>> N tasks mmap, fault, and munmap anonymous pages in a loop until the test
>> time has elapsed.
>>
>> The process case generally does better than the thread case most likely
>> because of mmap_sem acting as a bottleneck. There's ongoing work
>> upstream[*] to scale this lock, however, and once it goes in, my
>> hypothesis is the thread numbers here will improve.

Neglected to mention my hardware:
2-socket system, 44 cores, 503G memory, Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

>>
>> kern # ntask proc thr proc stdev thr stdev
>> speedup speedup pgf/s pgf/s
>> 1 1 705,533 1,644 705,227 1,122
>> 2 1 2.5% 2.8% 722,912 453 724,807 728
>> 3 1 2.6% 2.6% 724,215 653 723,213 941
>> 4 1 2.3% 2.8% 721,746 272 724,944 728
>>
>> kern # ntask proc thr proc stdev thr stdev
>> speedup speedup pgf/s pgf/s
>> 1 4 2,525,487 7,428 1,973,616 12,568
>> 2 4 2.6% 7.6% 2,590,699 6,968 2,123,570 10,350
>> 3 4 2.3% 4.4% 2,584,668 12,833 2,059,822 10,748
>> 4 4 4.7% 5.2% 2,643,251 13,297 2,076,808 9,506
>>
>> kern # ntask proc thr proc stdev thr stdev
>> speedup speedup pgf/s pgf/s
>> 1 16 6,444,656 20,528 3,226,356 32,874
>> 2 16 1.9% 10.4% 6,566,846 20,803 3,560,437 64,019
>> 3 16 18.3% 6.8% 7,624,749 58,497 3,447,109 67,734
>> 4 16 28.2% 2.5% 8,264,125 31,677 3,306,679 69,443
>>
>> kern # ntask proc thr proc stdev thr stdev
>> speedup speedup pgf/s pgf/s
>> 1 32 11,564,988 32,211 2,456,507 38,898
>> 2 32 1.8% 1.5% 11,777,119 45,418 2,494,064 27,964
>> 3 32 16.1% -2.7% 13,426,746 94,057 2,389,934 40,186
>> 4 32 26.2% 1.2% 14,593,745 28,121 2,486,059 42,004
>>
>> kern # ntask proc thr proc stdev thr stdev
>> speedup speedup pgf/s pgf/s
>> 1 64 12,080,629 33,676 2,443,043 61,973
>> 2 64 3.9% 9.9% 12,551,136 206,202 2,684,632 69,483
>> 3 64 15.0% -3.8% 13,892,933 351,657 2,351,232 67,875
>> 4 64 21.9% 1.8% 14,728,765 64,945 2,485,940 66,839
>>
>> [*] https://lwn.net/Articles/724502/ Range reader/writer locks
>> https://lwn.net/Articles/744188/ Speculative page faults
>>
>
> The speedup looks pretty nice and seems to peak at 16 tasks. Do you have an explanation of what
> causes the drop from 28.2% to 21.9% going from 16 to 64 tasks?

The system I was testing on had 44 cores, so part of the decrease in % speedup is just saturating the hardware (e.g. memory bandwidth). At 64 processes, we start having to share cores. Page faults per second did continue to increase each time we added more processes, though, so there's no anti-scaling going on.

> Was
> the loss in performance due to increased contention on LRU lock when more tasks running
> results in a higher likelihood of hitting the sentinel?

That seems to be another factor, yes. I used lock_stat to measure it, and it showed that wait time on lru_lock nearly tripled when going from 32 to 64 processes, but I also take lock_stat with a grain of salt as it changes the timing/interaction between processes.

> If I understand
> your patchset correctly, you will need to acquire LRU lock for sentinel page. Perhaps an increase
> in batch size could help?

Actually, I did try doing that. In this series the batch size is PAGEVEC_SIZE (14). When I did a run with PAGEVEC_SIZE*4, the performance stayed nearly the same for all but the 64 process case, where it dropped by ~10%. One explanation is as a process runs through one batch, it holds the batch lock longer before it has to switch batches, creating more opportunity for contention.


By the way, we're also working on another approach to scaling this look:
https://marc.info/?l=linux-mm&m=151746028405581

We plan to implement that idea and see how it compares performance-wise and diffstat-wise with this.

2018-02-02 05:22:49

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU

On Wed, Jan 31, 2018 at 06:04:13PM -0500, [email protected] wrote:
> Now that release_pages is scaling better with concurrent removals from
> the LRU, the performance results (included below) showed increased
> contention on lru_lock in the add-to-LRU path.
>
> To alleviate some of this contention, do more work outside the LRU lock.
> Prepare a local list of pages to be spliced onto the front of the LRU,
> including setting PageLRU in each page, before taking lru_lock. Since
> other threads use this page flag in certain checks outside lru_lock,
> ensure each page's LRU links have been properly initialized before
> setting the flag, and use memory barriers accordingly.
>
> Performance Results
>
> This is a will-it-scale run of page_fault1 using 4 different kernels.
>
> kernel kern #
>
> 4.15-rc2 1
> large-zone-batch 2
> lru-lock-base 3
> lru-lock-splice 4
>
> Each kernel builds on the last. The first is a baseline, the second
> makes zone->lock more scalable by increasing an order-0 per-cpu
> pagelist's 'batch' and 'high' values to 310 and 1860 respectively

Since the purpose of the patchset is to optimize lru_lock, you may
consider adjusting pcp->high to be >= 32768(page_fault1's test size is
128M = 32768 pages). That should eliminate zone->lock contention
entirely.

> (courtesy of Aaron Lu's patch), the third scales lru_lock without
> splicing pages (the previous patch in this series), and the fourth adds
> page splicing (this patch).
>
> N tasks mmap, fault, and munmap anonymous pages in a loop until the test
> time has elapsed.

2018-02-02 10:53:46

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/13] lru_lock scalability

Hi,


On 02/02/18 04:18, Daniel Jordan wrote:
>
>
> On 02/01/2018 10:54 AM, Steven Whitehouse wrote:
>> Hi,
>>
>>
>> On 31/01/18 23:04, [email protected] wrote:
>>> lru_lock, a per-node* spinlock that protects an LRU list, is one of the
>>> hottest locks in the kernel.  On some workloads on large machines, it
>>> shows up at the top of lock_stat.
>>>
>>> One way to improve lru_lock scalability is to introduce an array of
>>> locks,
>>> with each lock protecting certain batches of LRU pages.
>>>
>>>          *ooooooooooo**ooooooooooo**ooooooooooo**oooo ...
>>>          |           ||           ||           ||
>>>           \ batch 1 /  \ batch 2 /  \ batch 3 /
>>>
>>> In this ASCII depiction of an LRU, a page is represented with either
>>> '*'
>>> or 'o'.  An asterisk indicates a sentinel page, which is a page at the
>>> edge of a batch.  An 'o' indicates a non-sentinel page.
>>>
>>> To remove a non-sentinel LRU page, only one lock from the array is
>>> required.  This allows multiple threads to remove pages from different
>>> batches simultaneously.  A sentinel page requires lru_lock in
>>> addition to
>>> a lock from the array.
>>>
>>> Full performance numbers appear in the last patch in this series,
>>> but this
>>> prototype allows a microbenchmark to do up to 28% more page faults per
>>> second with 16 or more concurrent processes.
>>>
>>> This work was developed in collaboration with Steve Sistare.
>>>
>>> Note: This is an early prototype.  I'm submitting it now to support my
>>> request to attend LSF/MM, as well as get early feedback on the
>>> idea.  Any
>>> comments appreciated.
>>>
>>>
>>> * lru_lock is actually per-memcg, but without memcg's in the picture it
>>>    becomes per-node.
>> GFS2 has an lru list for glocks, which can be contended under certain
>> workloads. Work is still ongoing to figure out exactly why, but this
>> looks like it might be a good approach to that issue too. The main
>> purpose of GFS2's lru list is to allow shrinking of the glocks under
>> memory pressure via the gfs2_scan_glock_lru() function, and it looks
>> like this type of approach could be used there to improve the
>> scalability,
>
> Glad to hear that this could help in gfs2 as well.
>
> Hopefully struct gfs2_glock is less space constrained than struct page
> for storing the few bits of metadata that this approach requires.
>
> Daniel
>
We obviously want to keep gfs2_glock small, however within reason then
yet we can add some additional fields as required. The use case is
pretty much a standard LRU list, so items are added and removed, mostly
at the active end of the list, and the inactive end of the list is
scanned periodically by gfs2_scan_glock_lru()

Steve.


2018-02-02 14:41:30

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH v1 12/13] mm: split up release_pages into non-sentinel and sentinel passes



On 01/02/2018 00:04, [email protected] wrote:
> A common case in release_pages is for the 'pages' list to be in roughly
> the same order as they are in their LRU. With LRU batch locking, when a
> sentinel page is removed, an adjacent non-sentinel page must be promoted
> to a sentinel page to follow the locking scheme. So we can get behavior
> where nearly every page in the 'pages' array is treated as a sentinel
> page, hurting the scalability of this approach.
>
> To address this, split up release_pages into non-sentinel and sentinel
> passes so that the non-sentinel pages can be locked with an LRU batch
> lock before the sentinel pages are removed.
>
> For the prototype, just use a bitmap and a temporary outer loop to
> implement this.
>
> Performance numbers from a single microbenchmark at this point in the
> series are included in the next patch.
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> mm/swap.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index fae766e035a4..a302224293ad 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -731,6 +731,7 @@ void lru_add_drain_all(void)
> put_online_cpus();
> }
>
> +#define LRU_BITMAP_SIZE 512
> /**
> * release_pages - batched put_page()
> * @pages: array of pages to release
> @@ -742,16 +743,32 @@ void lru_add_drain_all(void)
> */
> void release_pages(struct page **pages, int nr)
> {
> - int i;
> + int h, i;
> LIST_HEAD(pages_to_free);
> struct pglist_data *locked_pgdat = NULL;
> spinlock_t *locked_lru_batch = NULL;
> struct lruvec *lruvec;
> unsigned long uninitialized_var(flags);
> + DECLARE_BITMAP(lru_bitmap, LRU_BITMAP_SIZE);
> +
> + VM_BUG_ON(nr > LRU_BITMAP_SIZE);

While running your series rebased on v4.15-mmotm-2018-01-31-16-51, I'm
hitting this VM_BUG sometimes on a ppc64 system where page size is set to 64K.
In my case, nr=537 while LRU_BITMAP_SIZE is 512. Here is the stack trace
displayed :

kernel BUG at /local/laurent/work/glinux/mm/swap.c:728!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: pseries_rng rng_core vmx_crypto virtio_balloon ip_tables
x_tables autofs4 virtio_net virtio_blk virtio_pci virtio_ring virtio
CPU: 41 PID: 3485 Comm: cc1 Not tainted 4.15.0-mm1-lru+ #2
NIP: c0000000002b0784 LR: c0000000002b0780 CTR: c0000000007bab20
REGS: c0000005e126b740 TRAP: 0700 Not tainted (4.15.0-mm1-lru+)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28002422 XER: 20000000
CFAR: c000000000192ae4 SOFTE: 0
GPR00: c0000000002b0780 c0000005e126b9c0 c00000000103c100 000000000000001c
GPR04: c0000005ffc4ce38 c0000005ffc63d00 0000000000000000 0000000000000001
GPR08: 0000000000000007 c000000000ec3a4c 00000005fed90000 0000000000000000
GPR12: 0000000000002200 c00000000fd8cd00 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: c0000005e11ab980 0000000000000000 0000000000000000 c0000005e126ba60
GPR28: 0000000000000219 c0000005e126bc40 0000000000000000 c0000005ec5f0000
NIP [c0000000002b0784] release_pages+0x864/0x880
LR [c0000000002b0780] release_pages+0x860/0x880
Call Trace:
[c0000005e126b9c0] [c0000000002b0780] release_pages+0x860/0x880 (unreliable)
[c0000005e126bb30] [c00000000031da3c] free_pages_and_swap_cache+0x11c/0x150
[c0000005e126bb80] [c0000000002ef5f8] tlb_flush_mmu_free+0x68/0xa0
[c0000005e126bbc0] [c0000000002f1568] arch_tlb_finish_mmu+0x58/0xf0
[c0000005e126bbf0] [c0000000002f19d4] tlb_finish_mmu+0x34/0x60
[c0000005e126bc20] [c0000000003031e8] exit_mmap+0xd8/0x1d0
[c0000005e126bce0] [c0000000000f3188] mmput+0x78/0x160
[c0000005e126bd10] [c0000000000ff568] do_exit+0x348/0xd00
[c0000005e126bdd0] [c0000000000fffd8] do_group_exit+0x58/0xd0
[c0000005e126be10] [c00000000010006c] SyS_exit_group+0x1c/0x20
[c0000005e126be30] [c00000000000ba60] system_call+0x58/0x6c
Instruction dump:
3949ffff 4bfffdc8 3c62ffce 38a00200 f9c100e0 f9e100e8 386345e8 fa0100f0
fa2100f8 fa410100 4bee2329 60000000 <0fe00000> 3b400001 4bfff868 7d5d5378
---[ end trace 55b1651f9d92f14f ]---

>
> + bitmap_zero(lru_bitmap, nr);
> +
> + for (h = 0; h < 2; h++) {
> for (i = 0; i < nr; i++) {
> struct page *page = pages[i];
>
> + if (h == 0) {
> + if (PageLRU(page) && page->lru_sentinel) {
> + bitmap_set(lru_bitmap, i, 1);
> + continue;
> + }
> + } else {
> + if (!test_bit(i, lru_bitmap))
> + continue;
> + }
> +
> if (is_huge_zero_page(page))
> continue;
>
> @@ -798,6 +815,7 @@ void release_pages(struct page **pages, int nr)
>
> list_add(&page->lru, &pages_to_free);
> }
> + }
> if (locked_lru_batch) {
> lru_batch_unlock(NULL, &locked_lru_batch, &locked_pgdat,
> &flags);
>


2018-02-02 15:24:54

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU

On 01/02/2018 00:04, [email protected] wrote:
> Now that release_pages is scaling better with concurrent removals from
> the LRU, the performance results (included below) showed increased
> contention on lru_lock in the add-to-LRU path.
>
> To alleviate some of this contention, do more work outside the LRU lock.
> Prepare a local list of pages to be spliced onto the front of the LRU,
> including setting PageLRU in each page, before taking lru_lock. Since
> other threads use this page flag in certain checks outside lru_lock,
> ensure each page's LRU links have been properly initialized before
> setting the flag, and use memory barriers accordingly.
>
> Performance Results
>
> This is a will-it-scale run of page_fault1 using 4 different kernels.
>
> kernel kern #
>
> 4.15-rc2 1
> large-zone-batch 2
> lru-lock-base 3
> lru-lock-splice 4
>
> Each kernel builds on the last. The first is a baseline, the second
> makes zone->lock more scalable by increasing an order-0 per-cpu
> pagelist's 'batch' and 'high' values to 310 and 1860 respectively
> (courtesy of Aaron Lu's patch), the third scales lru_lock without
> splicing pages (the previous patch in this series), and the fourth adds
> page splicing (this patch).
>
> N tasks mmap, fault, and munmap anonymous pages in a loop until the test
> time has elapsed.
>
> The process case generally does better than the thread case most likely
> because of mmap_sem acting as a bottleneck. There's ongoing work
> upstream[*] to scale this lock, however, and once it goes in, my
> hypothesis is the thread numbers here will improve.
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 1 705,533 1,644 705,227 1,122
> 2 1 2.5% 2.8% 722,912 453 724,807 728
> 3 1 2.6% 2.6% 724,215 653 723,213 941
> 4 1 2.3% 2.8% 721,746 272 724,944 728
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 4 2,525,487 7,428 1,973,616 12,568
> 2 4 2.6% 7.6% 2,590,699 6,968 2,123,570 10,350
> 3 4 2.3% 4.4% 2,584,668 12,833 2,059,822 10,748
> 4 4 4.7% 5.2% 2,643,251 13,297 2,076,808 9,506
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 16 6,444,656 20,528 3,226,356 32,874
> 2 16 1.9% 10.4% 6,566,846 20,803 3,560,437 64,019
> 3 16 18.3% 6.8% 7,624,749 58,497 3,447,109 67,734
> 4 16 28.2% 2.5% 8,264,125 31,677 3,306,679 69,443
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 32 11,564,988 32,211 2,456,507 38,898
> 2 32 1.8% 1.5% 11,777,119 45,418 2,494,064 27,964
> 3 32 16.1% -2.7% 13,426,746 94,057 2,389,934 40,186
> 4 32 26.2% 1.2% 14,593,745 28,121 2,486,059 42,004
>
> kern # ntask proc thr proc stdev thr stdev
> speedup speedup pgf/s pgf/s
> 1 64 12,080,629 33,676 2,443,043 61,973
> 2 64 3.9% 9.9% 12,551,136 206,202 2,684,632 69,483
> 3 64 15.0% -3.8% 13,892,933 351,657 2,351,232 67,875
> 4 64 21.9% 1.8% 14,728,765 64,945 2,485,940 66,839
>
> [*] https://lwn.net/Articles/724502/ Range reader/writer locks
> https://lwn.net/Articles/744188/ Speculative page faults
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> mm/memcontrol.c | 1 +
> mm/mlock.c | 1 +
> mm/swap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> mm/vmscan.c | 1 +
> 4 files changed, 112 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 99a54df760e3..6911626f29b2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2077,6 +2077,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>
> lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
> ClearPageLRU(page);
> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */

Why not include the call to smp_rmb() in del_page_from_lru_list() instead
of spreading smp_rmb() before calls to del_page_from_lru_list() ?

> del_page_from_lru_list(page, lruvec, page_lru(page));
> *isolated = 1;
> } else
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 6ba6a5887aeb..da294c5bbc2c 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -109,6 +109,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> if (getpage)
> get_page(page);
> ClearPageLRU(page);
> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
> del_page_from_lru_list(page, lruvec, page_lru(page));
> return true;
> }
> diff --git a/mm/swap.c b/mm/swap.c
> index a302224293ad..46a98dc8e9ad 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -220,6 +220,7 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
> int *pgmoved = arg;
>
> if (PageLRU(page) && !PageUnevictable(page)) {
> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
> del_page_from_lru_list(page, lruvec, page_lru(page));
> ClearPageActive(page);
> add_page_to_lru_list_tail(page, lruvec, page_lru(page));
> @@ -277,6 +278,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
> int file = page_is_file_cache(page);
> int lru = page_lru_base_type(page);
>
> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
> del_page_from_lru_list(page, lruvec, lru);
> SetPageActive(page);
> lru += LRU_ACTIVE;
> @@ -544,6 +546,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> file = page_is_file_cache(page);
> lru = page_lru_base_type(page);
>
> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
> del_page_from_lru_list(page, lruvec, lru + active);
> ClearPageActive(page);
> ClearPageReferenced(page);
> @@ -578,6 +581,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
> !PageSwapCache(page) && !PageUnevictable(page)) {
> bool active = PageActive(page);
>
> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
> del_page_from_lru_list(page, lruvec,
> LRU_INACTIVE_ANON + active);
> ClearPageActive(page);
> @@ -903,6 +907,60 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
> trace_mm_lru_insertion(page, lru);
> }
>
> +#define MAX_LRU_SPLICES 4
> +
> +struct lru_splice {
> + struct list_head list;
> + struct lruvec *lruvec;
> + enum lru_list lru;
> + int nid;
> + int zid;
> + size_t nr_pages;
> +};
> +
> +/*
> + * Adds a page to a local list for splicing, or else to the singletons
> + * list for individual processing.
> + *
> + * Returns the new number of splices in the splices list.
> + */
> +size_t add_page_to_lru_splice(struct lru_splice *splices, size_t nr_splices,
> + struct list_head *singletons, struct page *page)
> +{
> + int i;
> + enum lru_list lru = page_lru(page);
> + enum zone_type zid = page_zonenum(page);
> + int nid = page_to_nid(page);
> + struct lruvec *lruvec;
> +
> + VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> + lruvec = mem_cgroup_page_lruvec(page, NODE_DATA(nid));
> +
> + for (i = 0; i < nr_splices; ++i) {
> + if (splices[i].lruvec == lruvec && splices[i].zid == zid) {
> + list_add(&page->lru, &splices[i].list);
> + splices[nr_splices].nr_pages += hpage_nr_pages(page);
> + return nr_splices;
> + }
> + }
> +
> + if (nr_splices < MAX_LRU_SPLICES) {
> + INIT_LIST_HEAD(&splices[nr_splices].list);
> + splices[nr_splices].lruvec = lruvec;
> + splices[nr_splices].lru = lru;
> + splices[nr_splices].nid = nid;
> + splices[nr_splices].zid = zid;
> + splices[nr_splices].nr_pages = hpage_nr_pages(page);
> + list_add(&page->lru, &splices[nr_splices].list);
> + ++nr_splices;
> + } else {
> + list_add(&page->lru, singletons);
> + }
> +
> + return nr_splices;
> +}
> +
> /*
> * Add the passed pages to the LRU, then drop the caller's refcount
> * on them. Reinitialises the caller's pagevec.
> @@ -911,12 +969,59 @@ void __pagevec_lru_add(struct pagevec *pvec)
> {
> int i;
> struct pglist_data *pgdat = NULL;
> - struct lruvec *lruvec;
> unsigned long flags = 0;
> + struct lru_splice splices[MAX_LRU_SPLICES];
> + size_t nr_splices = 0;
> + LIST_HEAD(singletons);
> + struct page *page, *next;
>
> - for (i = 0; i < pagevec_count(pvec); i++) {
> - struct page *page = pvec->pages[i];
> - struct pglist_data *pagepgdat = page_pgdat(page);
> + /*
> + * Sort the pages into local lists to splice onto the LRU once we
> + * hold lru_lock. In the common case there should be few of these
> + * local lists.
> + */
> + for (i = 0; i < pagevec_count(pvec); ++i) {
> + page = pvec->pages[i];
> + nr_splices = add_page_to_lru_splice(splices, nr_splices,
> + &singletons, page);
> + }
> +
> + /*
> + * Paired with read barriers where we check PageLRU and modify
> + * page->lru, for example pagevec_move_tail_fn.
> + */
> + smp_wmb();
> +
> + for (i = 0; i < pagevec_count(pvec); i++)
> + SetPageLRU(pvec->pages[i]);
> +
> + for (i = 0; i < nr_splices; ++i) {
> + struct lru_splice *s = &splices[i];
> + struct pglist_data *splice_pgdat = NODE_DATA(s->nid);
> +
> + if (splice_pgdat != pgdat) {
> + if (pgdat)
> + spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> + pgdat = splice_pgdat;
> + spin_lock_irqsave(&pgdat->lru_lock, flags);
> + }
> +
> + update_lru_size(s->lruvec, s->lru, s->zid, s->nr_pages);
> + list_splice(&s->list, lru_head(&s->lruvec->lists[s->lru]));
> + update_page_reclaim_stat(s->lruvec, is_file_lru(s->lru),
> + is_active_lru(s->lru));
> + /* XXX add splice tracepoint */
> + }
> +
> + while (!list_empty(&singletons)) {
> + struct pglist_data *pagepgdat;
> + struct lruvec *lruvec;
> + struct list_head *list;
> +
> + list = singletons.next;
> + page = list_entry(list, struct page, lru);
> + list_del(list);
> + pagepgdat = page_pgdat(page);
>
> if (pagepgdat != pgdat) {
> if (pgdat)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7f5ff0bb133f..338850ad03a6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1629,6 +1629,7 @@ int isolate_lru_page(struct page *page)
> int lru = page_lru(page);
> get_page(page);
> ClearPageLRU(page);
> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
> del_page_from_lru_list(page, lruvec, lru);
> ret = 0;
> }
>


2018-02-02 18:32:48

by Laurent Dufour

[permalink] [raw]
Subject: Re: [RFC PATCH v1 12/13] mm: split up release_pages into non-sentinel and sentinel passes

On 02/02/2018 15:40, Laurent Dufour wrote:
>
>
> On 01/02/2018 00:04, [email protected] wrote:
>> A common case in release_pages is for the 'pages' list to be in roughly
>> the same order as they are in their LRU. With LRU batch locking, when a
>> sentinel page is removed, an adjacent non-sentinel page must be promoted
>> to a sentinel page to follow the locking scheme. So we can get behavior
>> where nearly every page in the 'pages' array is treated as a sentinel
>> page, hurting the scalability of this approach.
>>
>> To address this, split up release_pages into non-sentinel and sentinel
>> passes so that the non-sentinel pages can be locked with an LRU batch
>> lock before the sentinel pages are removed.
>>
>> For the prototype, just use a bitmap and a temporary outer loop to
>> implement this.
>>
>> Performance numbers from a single microbenchmark at this point in the
>> series are included in the next patch.
>>
>> Signed-off-by: Daniel Jordan <[email protected]>
>> ---
>> mm/swap.c | 20 +++++++++++++++++++-
>> 1 file changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index fae766e035a4..a302224293ad 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -731,6 +731,7 @@ void lru_add_drain_all(void)
>> put_online_cpus();
>> }
>>
>> +#define LRU_BITMAP_SIZE 512
>> /**
>> * release_pages - batched put_page()
>> * @pages: array of pages to release
>> @@ -742,16 +743,32 @@ void lru_add_drain_all(void)
>> */
>> void release_pages(struct page **pages, int nr)
>> {
>> - int i;
>> + int h, i;
>> LIST_HEAD(pages_to_free);
>> struct pglist_data *locked_pgdat = NULL;
>> spinlock_t *locked_lru_batch = NULL;
>> struct lruvec *lruvec;
>> unsigned long uninitialized_var(flags);
>> + DECLARE_BITMAP(lru_bitmap, LRU_BITMAP_SIZE);
>> +
>> + VM_BUG_ON(nr > LRU_BITMAP_SIZE);
>
> While running your series rebased on v4.15-mmotm-2018-01-31-16-51, I'm
> hitting this VM_BUG sometimes on a ppc64 system where page size is set to 64K.

I can't see any link between nr and LRU_BITMAP_SIZE, caller may pass a
larger list of pages which is not relative to the LRU list.

To move forward seeing the benefit of this series with the SPF one, I
declared the bit map based on nr. This is still not a valid option but this
at least allows to process all the passed pages.

> In my case, nr=537 while LRU_BITMAP_SIZE is 512. Here is the stack trace
> displayed :
>
> kernel BUG at /local/laurent/work/glinux/mm/swap.c:728!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: pseries_rng rng_core vmx_crypto virtio_balloon ip_tables
> x_tables autofs4 virtio_net virtio_blk virtio_pci virtio_ring virtio
> CPU: 41 PID: 3485 Comm: cc1 Not tainted 4.15.0-mm1-lru+ #2
> NIP: c0000000002b0784 LR: c0000000002b0780 CTR: c0000000007bab20
> REGS: c0000005e126b740 TRAP: 0700 Not tainted (4.15.0-mm1-lru+)
> MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28002422 XER: 20000000
> CFAR: c000000000192ae4 SOFTE: 0
> GPR00: c0000000002b0780 c0000005e126b9c0 c00000000103c100 000000000000001c
> GPR04: c0000005ffc4ce38 c0000005ffc63d00 0000000000000000 0000000000000001
> GPR08: 0000000000000007 c000000000ec3a4c 00000005fed90000 0000000000000000
> GPR12: 0000000000002200 c00000000fd8cd00 0000000000000000 0000000000000000
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR24: c0000005e11ab980 0000000000000000 0000000000000000 c0000005e126ba60
> GPR28: 0000000000000219 c0000005e126bc40 0000000000000000 c0000005ec5f0000
> NIP [c0000000002b0784] release_pages+0x864/0x880
> LR [c0000000002b0780] release_pages+0x860/0x880
> Call Trace:
> [c0000005e126b9c0] [c0000000002b0780] release_pages+0x860/0x880 (unreliable)
> [c0000005e126bb30] [c00000000031da3c] free_pages_and_swap_cache+0x11c/0x150
> [c0000005e126bb80] [c0000000002ef5f8] tlb_flush_mmu_free+0x68/0xa0
> [c0000005e126bbc0] [c0000000002f1568] arch_tlb_finish_mmu+0x58/0xf0
> [c0000005e126bbf0] [c0000000002f19d4] tlb_finish_mmu+0x34/0x60
> [c0000005e126bc20] [c0000000003031e8] exit_mmap+0xd8/0x1d0
> [c0000005e126bce0] [c0000000000f3188] mmput+0x78/0x160
> [c0000005e126bd10] [c0000000000ff568] do_exit+0x348/0xd00
> [c0000005e126bdd0] [c0000000000fffd8] do_group_exit+0x58/0xd0
> [c0000005e126be10] [c00000000010006c] SyS_exit_group+0x1c/0x20
> [c0000005e126be30] [c00000000000ba60] system_call+0x58/0x6c
> Instruction dump:
> 3949ffff 4bfffdc8 3c62ffce 38a00200 f9c100e0 f9e100e8 386345e8 fa0100f0
> fa2100f8 fa410100 4bee2329 60000000 <0fe00000> 3b400001 4bfff868 7d5d5378
> ---[ end trace 55b1651f9d92f14f ]---
>
>>
>> + bitmap_zero(lru_bitmap, nr);
>> +
>> + for (h = 0; h < 2; h++) {
>> for (i = 0; i < nr; i++) {
>> struct page *page = pages[i];
>>
>> + if (h == 0) {
>> + if (PageLRU(page) && page->lru_sentinel) {
>> + bitmap_set(lru_bitmap, i, 1);
>> + continue;
>> + }
>> + } else {
>> + if (!test_bit(i, lru_bitmap))
>> + continue;
>> + }
>> +
>> if (is_huge_zero_page(page))
>> continue;
>>
>> @@ -798,6 +815,7 @@ void release_pages(struct page **pages, int nr)
>>
>> list_add(&page->lru, &pages_to_free);
>> }
>> + }
>> if (locked_lru_batch) {
>> lru_batch_unlock(NULL, &locked_lru_batch, &locked_pgdat,
>> &flags);
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>


2018-02-05 05:01:04

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [mm] 44b163e12f: kernel_BUG_at_mm/swap.c


FYI, we noticed the following commit (built with gcc-7):

commit: 44b163e12fd4a133016482d94ad11d8f3365ddd2 ("mm: split up release_pages into non-sentinel and sentinel passes")
url: https://github.com/0day-ci/linux/commits/daniel-m-jordan-oracle-com/mm-add-a-percpu_pagelist_batch-sysctl-interface/20180202-131129


in testcase: boot

on test machine: qemu-system-i386 -enable-kvm -m 360M

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+-----------------------------------------------------+------------+------------+
| | 6fe15c1d7a | 44b163e12f |
+-----------------------------------------------------+------------+------------+
| boot_successes | 0 | 0 |
| boot_failures | 46 | 12 |
| WARNING:possible_recursive_locking_detected | 46 | 12 |
| WARNING:at_arch/x86/mm/dump_pagetables.c:#note_page | 8 | 2 |
| EIP:note_page | 8 | 2 |
| kernel_BUG_at_mm/swap.c | 0 | 12 |
| invalid_opcode:#[##] | 0 | 12 |
| EIP:release_pages | 0 | 12 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 12 |
+-----------------------------------------------------+------------+------------+



[ 245.413373] kernel BUG at mm/swap.c:754!
[ 245.424199] invalid opcode: 0000 [#1] SMP
[ 245.432437] CPU: 0 PID: 164 Comm: sh Not tainted 4.15.0-00012-g44b163e #153
[ 245.445522] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 245.461052] EIP: release_pages+0x26/0x3ab
[ 245.468947] EFLAGS: 00010202 CPU: 0
[ 245.476401] EAX: c9c6200c EBX: c9c62000 ECX: c9c6dd80 EDX: 00000297
[ 245.490767] ESI: 00000000 EDI: c9c6de3c EBP: c9c6ddd8 ESP: c9c6dd64
[ 245.502693] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 245.513095] CR0: 80050033 CR2: 08138000 CR3: 0c9c0220 CR4: 000006b0
[ 245.524953] Call Trace:
[ 245.530908] ? cpumask_next+0x21/0x24
[ 245.537234] ? cpumask_any_but+0x1d/0x2d
[ 245.544004] ? flush_tlb_mm_range+0xcc/0x103
[ 245.552467] tlb_flush_mmu_free+0x17/0x33
[ 245.560820] tlb_flush_mmu+0x12/0x15
[ 245.568370] arch_tlb_finish_mmu+0x28/0x47
[ 245.575761] tlb_finish_mmu+0x1d/0x2c
[ 245.582080] exit_mmap+0xbc/0x10c
[ 245.588629] ? trace_hardirqs_off_caller+0x1b/0x99
[ 245.598128] mmput+0x53/0xc1
[ 245.604470] flush_old_exec+0x59f/0x60e
[ 245.612514] load_elf_binary+0x238/0x9d4
[ 245.620644] ? search_binary_handler+0x5c/0xbe
[ 245.629747] ? search_binary_handler+0x5c/0xbe
[ 245.638823] search_binary_handler+0x50/0xbe
[ 245.647474] do_execveat_common+0x545/0x7af
[ 245.656070] do_execve+0x14/0x16
[ 245.663265] SyS_execve+0x16/0x18
[ 245.670448] do_fast_syscall_32+0x11b/0x222
[ 245.679075] entry_SYSENTER_32+0x53/0x86
[ 245.687212] EIP: 0xb7eecbe5
[ 245.693652] EFLAGS: 00000292 CPU: 0
[ 245.701007] EAX: ffffffda EBX: 08138028 ECX: 081382a8 EDX: 08136008
[ 245.712423] ESI: 081382a8 EDI: b7ebbff4 EBP: 00000000 ESP: bfb82ed4
[ 245.723085] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 245.733522] Code: 7c f1 ff 5d c3 55 89 e5 57 56 53 83 ec 68 8d 4d a8 65 8b 35 14 00 00 00 89 75 f0 31 f6 81 fa 00 02 00 00 89 4d a8 89 4d ac 7e 02 <0f> 0b 8d 4a 1f c1 e9 05 c1 e1 02 83 f9 40 89 55 94 89 45 8c 76
[ 245.767993] EIP: release_pages+0x26/0x3ab SS:ESP: 0068:c9c6dd64
[ 245.779532] ---[ end trace 9116e5f455646a7b ]---


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Xiaolong


Attachments:
(No filename) (3.86 kB)
config-4.15.0-00012-g44b163e (121.29 kB)
job-script (4.05 kB)
dmesg.xz (17.48 kB)
Download all attachments

2018-02-06 17:40:55

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU

On 02/02/2018 12:21 AM, Aaron Lu wrote:
> On Wed, Jan 31, 2018 at 06:04:13PM -0500, [email protected] wrote:
>> Now that release_pages is scaling better with concurrent removals from
>> the LRU, the performance results (included below) showed increased
>> contention on lru_lock in the add-to-LRU path.
>>
>> To alleviate some of this contention, do more work outside the LRU lock.
>> Prepare a local list of pages to be spliced onto the front of the LRU,
>> including setting PageLRU in each page, before taking lru_lock. Since
>> other threads use this page flag in certain checks outside lru_lock,
>> ensure each page's LRU links have been properly initialized before
>> setting the flag, and use memory barriers accordingly.
>>
>> Performance Results
>>
>> This is a will-it-scale run of page_fault1 using 4 different kernels.
>>
>> kernel kern #
>>
>> 4.15-rc2 1
>> large-zone-batch 2
>> lru-lock-base 3
>> lru-lock-splice 4
>>
>> Each kernel builds on the last. The first is a baseline, the second
>> makes zone->lock more scalable by increasing an order-0 per-cpu
>> pagelist's 'batch' and 'high' values to 310 and 1860 respectively
>
> Since the purpose of the patchset is to optimize lru_lock, you may
> consider adjusting pcp->high to be >= 32768(page_fault1's test size is
> 128M = 32768 pages). That should eliminate zone->lock contention
> entirely.

Interesting, hadn't thought about taking zone->lock completely out of
the equation. Will try this next time I test this series.


While we're on this topic, it does seem from the performance of kernel
#2, and the numbers Aaron posted in a previous thread[*], that the
default 'batch' and 'high' values should be bigger on large systems.

The code to control these two values last changed in 2005[**], so we hit
the largest values with just a 512M zone:

zone 4k_pages batch high high/4k_pages
64M 16,384 3 18 0.10986%
128M 32,768 7 42 0.12817%
256M 65,536 15 90 0.13733%
512M 131,072 31 186 0.14191%
1G 262,144 31 186 0.07095%
2G 524,288 31 186 0.03548%
4G 1,048,576 31 186 0.01774%
8G 2,097,152 31 186 0.00887%
16G 4,194,304 31 186 0.00443%
32G 8,388,608 31 186 0.00222%
64G 16,777,216 31 186 0.00111%
128G 33,554,432 31 186 0.00055%
256G 67,108,864 31 186 0.00028%
512G 134,217,728 31 186 0.00014%
1024G 268,435,456 31 186 0.00007%


[*] https://marc.info/?l=linux-netdev&m=150572010919327
[**] ba56e91c9401 ("[PATCH] mm: page_alloc: increase size of per-cpu-pages")

>
>> (courtesy of Aaron Lu's patch), the third scales lru_lock without
>> splicing pages (the previous patch in this series), and the fourth adds
>> page splicing (this patch).
>>
>> N tasks mmap, fault, and munmap anonymous pages in a loop until the test
>> time has elapsed.

2018-02-06 17:49:45

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 12/13] mm: split up release_pages into non-sentinel and sentinel passes

On 02/02/2018 12:00 PM, Laurent Dufour wrote:
> On 02/02/2018 15:40, Laurent Dufour wrote:
>>
>>
>> On 01/02/2018 00:04, [email protected] wrote:
>>> A common case in release_pages is for the 'pages' list to be in roughly
>>> the same order as they are in their LRU. With LRU batch locking, when a
>>> sentinel page is removed, an adjacent non-sentinel page must be promoted
>>> to a sentinel page to follow the locking scheme. So we can get behavior
>>> where nearly every page in the 'pages' array is treated as a sentinel
>>> page, hurting the scalability of this approach.
>>>
>>> To address this, split up release_pages into non-sentinel and sentinel
>>> passes so that the non-sentinel pages can be locked with an LRU batch
>>> lock before the sentinel pages are removed.
>>>
>>> For the prototype, just use a bitmap and a temporary outer loop to
>>> implement this.
>>>
>>> Performance numbers from a single microbenchmark at this point in the
>>> series are included in the next patch.
>>>
>>> Signed-off-by: Daniel Jordan <[email protected]>
>>> ---
>>> mm/swap.c | 20 +++++++++++++++++++-
>>> 1 file changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index fae766e035a4..a302224293ad 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -731,6 +731,7 @@ void lru_add_drain_all(void)
>>> put_online_cpus();
>>> }
>>>
>>> +#define LRU_BITMAP_SIZE 512
>>> /**
>>> * release_pages - batched put_page()
>>> * @pages: array of pages to release
>>> @@ -742,16 +743,32 @@ void lru_add_drain_all(void)
>>> */
>>> void release_pages(struct page **pages, int nr)
>>> {
>>> - int i;
>>> + int h, i;
>>> LIST_HEAD(pages_to_free);
>>> struct pglist_data *locked_pgdat = NULL;
>>> spinlock_t *locked_lru_batch = NULL;
>>> struct lruvec *lruvec;
>>> unsigned long uninitialized_var(flags);
>>> + DECLARE_BITMAP(lru_bitmap, LRU_BITMAP_SIZE);
>>> +
>>> + VM_BUG_ON(nr > LRU_BITMAP_SIZE);
>>
>> While running your series rebased on v4.15-mmotm-2018-01-31-16-51, I'm
>> hitting this VM_BUG sometimes on a ppc64 system where page size is set to 64K.
>
> I can't see any link between nr and LRU_BITMAP_SIZE, caller may pass a
> larger list of pages which is not relative to the LRU list.

You're correct, I used the hard-coded size to quickly prototype, just to
see how this approach performs. That's unfortunate that it bit you.
> To move forward seeing the benefit of this series with the SPF one, I
> declared the bit map based on nr. This is still not a valid option but this
> at least allows to process all the passed pages.

Yes, the bitmap's not for the final version.

2018-02-06 18:22:48

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU

On 02/02/2018 10:22 AM, Laurent Dufour wrote:
> On 01/02/2018 00:04, [email protected] wrote:
...snip...
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 99a54df760e3..6911626f29b2 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2077,6 +2077,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>>
>> lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>> ClearPageLRU(page);
>> + smp_rmb(); /* Pairs with smp_wmb in __pagevec_lru_add */
>
> Why not include the call to smp_rmb() in del_page_from_lru_list() instead
> of spreading smp_rmb() before calls to del_page_from_lru_list() ?

Yes, this is what I should have done. The memory barriers came from
another patch I squashed in and I didn't look back to see how obvious
the encapsulation was.

2018-02-08 23:38:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/13] lru_lock scalability

On Wed, 31 Jan 2018 18:04:00 -0500 [email protected] wrote:

> lru_lock, a per-node* spinlock that protects an LRU list, is one of the
> hottest locks in the kernel. On some workloads on large machines, it
> shows up at the top of lock_stat.

Do you have details on which callsites are causing the problem? That
would permit us to consider other approaches, perhaps.


2018-02-13 21:08:55

by Daniel Jordan

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/13] lru_lock scalability

On 02/08/2018 06:36 PM, Andrew Morton wrote:
> On Wed, 31 Jan 2018 18:04:00 -0500 [email protected] wrote:
>
>> lru_lock, a per-node* spinlock that protects an LRU list, is one of the
>> hottest locks in the kernel. On some workloads on large machines, it
>> shows up at the top of lock_stat.
>
> Do you have details on which callsites are causing the problem? That
> would permit us to consider other approaches, perhaps.

Sure, there are two paths where we're seeing contention.

In the first one, a pagevec's worth of anonymous pages are added to
various LRUs when the per-cpu pagevec fills up:

/* take an anonymous page fault, eventually end up at... */
handle_pte_fault
do_anonymous_page
lru_cache_add_active_or_unevictable
lru_cache_add
__lru_cache_add
__pagevec_lru_add
pagevec_lru_move_fn
/* contend on lru_lock */


In the second, one or more pages are removed from an LRU under one hold
of lru_lock:

// userland calls munmap or exit, eventually end up at...
zap_pte_range
__tlb_remove_page // returns true because we eventually hit
// MAX_GATHER_BATCH_COUNT in tlb_next_batch
tlb_flush_mmu_free
free_pages_and_swap_cache
release_pages
/* contend on lru_lock */


For a broader context, we've run decision support benchmarks where
lru_lock (and zone->lock) show long wait times. But we're not the only
ones according to certain kernel comments:

mm/vmscan.c:
* zone_lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
*
* For pagecache intensive workloads, this function is the hottest
* spot in the kernel (apart from copy_*_user functions).
...
static unsigned long isolate_lru_pages(unsigned long nr_to_scan,


include/linux/mmzone.h:
* zone->lock and the [pgdat->lru_lock] are two of the hottest locks in
the kernel.
* So add a wild amount of padding here to ensure that they fall into
separate
* cachelines. ...


Anyway, if you're seeing this lock in your workloads, I'm interested in
hearing what you're running so we can get more real world data on this.