2021-03-10 16:16:01

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 1/3] mm: replace migrate_prep with lru_add_drain_all

Currently, migrate_prep is merely a wrapper of lru_cache_add_all.
There is not much to gain from having additional abstraction.

Use lru_add_drain_all instead of migrate_prep, which would be more
descriptive.

note: migrate_prep_local in compaction.c changed into lru_add_drain
to avoid CPU schedule cost with involving many other CPUs to keep
keep old behavior.

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/migrate.h | 5 -----
mm/compaction.c | 3 ++-
mm/mempolicy.c | 4 ++--
mm/migrate.c | 24 +-----------------------
mm/page_alloc.c | 2 +-
mm/swap.c | 5 +++++
6 files changed, 11 insertions(+), 32 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..6155d97ec76c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -45,8 +45,6 @@ extern struct page *alloc_migration_target(struct page *page, unsigned long priv
extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
extern void putback_movable_page(struct page *page);

-extern void migrate_prep(void);
-extern void migrate_prep_local(void);
extern void migrate_page_states(struct page *newpage, struct page *page);
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
@@ -66,9 +64,6 @@ static inline struct page *alloc_migration_target(struct page *page,
static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
{ return -EBUSY; }

-static inline int migrate_prep(void) { return -ENOSYS; }
-static inline int migrate_prep_local(void) { return -ENOSYS; }
-
static inline void migrate_page_states(struct page *newpage, struct page *page)
{
}
diff --git a/mm/compaction.c b/mm/compaction.c
index e04f4476e68e..3be017ececc0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2319,7 +2319,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync);

- migrate_prep_local();
+ /* lru_add_drain_all could be expensive with involving other CPUs */
+ lru_add_drain();

while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
int err;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ab51132547b8..fc024e97be37 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1124,7 +1124,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
int err = 0;
nodemask_t tmp;

- migrate_prep();
+ lru_add_drain_all();

mmap_read_lock(mm);

@@ -1323,7 +1323,7 @@ static long do_mbind(unsigned long start, unsigned long len,

if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {

- migrate_prep();
+ lru_add_drain_all();
}
{
NODEMASK_SCRATCH(scratch);
diff --git a/mm/migrate.c b/mm/migrate.c
index 62b81d5257aa..45f925e10f5a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,28 +57,6 @@

#include "internal.h"

-/*
- * migrate_prep() needs to be called before we start compiling a list of pages
- * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
- * undesirable, use migrate_prep_local()
- */
-void migrate_prep(void)
-{
- /*
- * Clear the LRU lists so pages can be isolated.
- * Note that pages may be moved off the LRU after we have
- * drained them. Those pages will fail to migrate like other
- * pages that may be busy.
- */
- lru_add_drain_all();
-}
-
-/* Do the necessary work of migrate_prep but not if it involves other CPUs */
-void migrate_prep_local(void)
-{
- lru_add_drain();
-}
-
int isolate_movable_page(struct page *page, isolate_mode_t mode)
{
struct address_space *mapping;
@@ -1769,7 +1747,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
int start, i;
int err = 0, err1;

- migrate_prep();
+ lru_add_drain_all();

for (i = start = 0; i < nr_pages; i++) {
const void __user *p;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e8348936df8..f05a8db741ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8467,7 +8467,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
};

- migrate_prep();
+ lru_add_drain_all();

while (pfn < end || !list_empty(&cc->migratepages)) {
if (fatal_signal_pending(current)) {
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..441d1ae1f285 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -729,6 +729,11 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
}

/*
+ * lru_add_drain_all() usually needs to be called before we start compiling
+ * a list of pages to be migrated using isolate_lru_page(). Note that pages
+ * may be moved off the LRU after we have drained them. Those pages will
+ * fail to migrate like other pages that may be busy.
+ *
* Doesn't need any cpu hotplug locking because we do rely on per-cpu
* kworkers being shut down before our page_alloc_cpu_dead callback is
* executed on the offlined cpu.
--
2.30.1.766.gb4fecdf3b7-goog


2021-03-10 16:16:29

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration

Pages containing buffer_heads that are in one of the per-CPU
buffer_head LRU caches will be pinned and thus cannot be migrated.
This can prevent CMA allocations from succeeding, which are often used
on platforms with co-processors (such as a DSP) that can only use
physically contiguous memory. It can also prevent memory
hot-unplugging from succeeding, which involves migrating at least
MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
GiB based on the architecture in use.

Correspondingly, invalidate the BH LRU caches before a migration
starts and stop any buffer_head from being cached in the LRU caches,
until migration has finished.

Signed-off-by: Chris Goldsworthy <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
fs/buffer.c | 12 ++++++++++--
include/linux/buffer_head.h | 4 ++++
mm/swap.c | 5 ++++-
3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 0cb7ffd4977c..ca9dd736bcb8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1264,6 +1264,14 @@ static void bh_lru_install(struct buffer_head *bh)
int i;

check_irqs_on();
+ /*
+ * buffer_head in bh_lru could increase refcount of the page
+ * until it will be invalidated. It causes page migraion failure.
+ * Skip putting upcoming bh into bh_lru until migration is done.
+ */
+ if (lru_cache_disabled())
+ return;
+
bh_lru_lock();

b = this_cpu_ptr(&bh_lrus);
@@ -1409,7 +1417,7 @@ EXPORT_SYMBOL(__bread_gfp);
* This doesn't race because it runs in each cpu either in irq
* or with preempt disabled.
*/
-static void invalidate_bh_lru(void *arg)
+void invalidate_bh_lru(void *arg)
{
struct bh_lru *b = &get_cpu_var(bh_lrus);
int i;
@@ -1421,7 +1429,7 @@ static void invalidate_bh_lru(void *arg)
put_cpu_var(bh_lrus);
}

-static bool has_bh_in_lru(int cpu, void *dummy)
+bool has_bh_in_lru(int cpu, void *dummy)
{
struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
int i;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6b47f94378c5..05998b5947a2 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -194,6 +194,8 @@ void __breadahead_gfp(struct block_device *, sector_t block, unsigned int size,
struct buffer_head *__bread_gfp(struct block_device *,
sector_t block, unsigned size, gfp_t gfp);
void invalidate_bh_lrus(void);
+void invalidate_bh_lru(void *arg);
+bool has_bh_in_lru(int cpu, void *dummy);
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
void free_buffer_head(struct buffer_head * bh);
void unlock_buffer(struct buffer_head *bh);
@@ -406,6 +408,8 @@ static inline int inode_has_buffers(struct inode *inode) { return 0; }
static inline void invalidate_inode_buffers(struct inode *inode) {}
static inline int remove_inode_buffers(struct inode *inode) { return 1; }
static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
+static inline void invalidate_bh_lru(void *arg) {}
+static inline bool has_bh_in_lru(int cpu, void *dummy) { return 0; }
#define buffer_heads_over_limit 0

#endif /* CONFIG_BLOCK */
diff --git a/mm/swap.c b/mm/swap.c
index fbdf6ac05aec..2a431959a45d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
#include <linux/hugetlb.h>
#include <linux/page_idle.h>
#include <linux/local_lock.h>
+#include <linux/buffer_head.h>

#include "internal.h"

@@ -641,6 +642,7 @@ void lru_add_drain_cpu(int cpu)
pagevec_lru_move_fn(pvec, lru_lazyfree_fn);

activate_page_drain(cpu);
+ invalidate_bh_lru(NULL);
}

/**
@@ -828,7 +830,8 @@ static void __lru_add_drain_all(bool force_all_cpus)
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
- need_activate_page_drain(cpu)) {
+ need_activate_page_drain(cpu) ||
+ has_bh_in_lru(cpu, NULL)) {
INIT_WORK(work, lru_add_drain_per_cpu);
queue_work_on(cpu, mm_percpu_wq, work);
__cpumask_set_cpu(cpu, &has_work);
--
2.30.1.766.gb4fecdf3b7-goog

2021-03-10 16:18:13

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a
sync migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..

if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}

The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 3 ++
mm/memory_hotplug.c | 3 +-
mm/mempolicy.c | 4 ++-
mm/migrate.c | 3 +-
mm/swap.c | 79 ++++++++++++++++++++++++++++++++++++--------
5 files changed, 75 insertions(+), 17 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 32f665b1ee85..a3e258335a7f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,6 +339,9 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
extern void lru_note_cost_page(struct page *);
extern void lru_cache_add(struct page *);
extern void mark_page_accessed(struct page *);
+extern void lru_cache_disable(void);
+extern void lru_cache_enable(void);
+extern bool lru_cache_disabled(void);
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_cpu_zone(struct zone *zone);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5ba51a8bdaeb..959f659ef085 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1611,6 +1611,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
* in a way that pages from isolated pageblock are left on pcplists.
*/
zone_pcp_disable(zone);
+ lru_cache_disable();

/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
@@ -1642,7 +1643,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
}

cond_resched();
- lru_add_drain_all();

ret = scan_movable_pages(pfn, end_pfn, &pfn);
if (!ret) {
@@ -1687,6 +1687,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
spin_unlock_irqrestore(&zone->lock, flags);

+ lru_cache_enable();
zone_pcp_enable(zone);

/* removal success */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fc024e97be37..658238e69551 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1323,7 +1323,7 @@ static long do_mbind(unsigned long start, unsigned long len,

if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {

- lru_add_drain_all();
+ lru_cache_disable();
}
{
NODEMASK_SCRATCH(scratch);
@@ -1371,6 +1371,8 @@ static long do_mbind(unsigned long start, unsigned long len,
mmap_write_unlock(mm);
mpol_out:
mpol_put(new);
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ lru_cache_enable();
return err;
}

diff --git a/mm/migrate.c b/mm/migrate.c
index 45f925e10f5a..acc9913e4303 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1747,7 +1747,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
int start, i;
int err = 0, err1;

- lru_add_drain_all();
+ lru_cache_disable();

for (i = start = 0; i < nr_pages; i++) {
const void __user *p;
@@ -1816,6 +1816,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (err >= 0)
err = err1;
out:
+ lru_cache_enable();
return err;
}

diff --git a/mm/swap.c b/mm/swap.c
index 441d1ae1f285..fbdf6ac05aec 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -235,6 +235,18 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
}
}

+/* return true if pagevec needs to drain */
+static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page)
+{
+ bool ret = false;
+
+ if (!pagevec_add(pvec, page) || PageCompound(page) ||
+ lru_cache_disabled())
+ ret = true;
+
+ return ret;
+}
+
/*
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
@@ -252,7 +264,7 @@ void rotate_reclaimable_page(struct page *page)
get_page(page);
local_lock_irqsave(&lru_rotate.lock, flags);
pvec = this_cpu_ptr(&lru_rotate.pvec);
- if (!pagevec_add(pvec, page) || PageCompound(page))
+ if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
local_unlock_irqrestore(&lru_rotate.lock, flags);
}
@@ -343,7 +355,7 @@ static void activate_page(struct page *page)
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.activate_page);
get_page(page);
- if (!pagevec_add(pvec, page) || PageCompound(page))
+ if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, __activate_page);
local_unlock(&lru_pvecs.lock);
}
@@ -458,7 +470,7 @@ void lru_cache_add(struct page *page)
get_page(page);
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
- if (!pagevec_add(pvec, page) || PageCompound(page))
+ if (pagevec_add_and_need_flush(pvec, page))
__pagevec_lru_add(pvec);
local_unlock(&lru_pvecs.lock);
}
@@ -654,7 +666,7 @@ void deactivate_file_page(struct page *page)
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);

- if (!pagevec_add(pvec, page) || PageCompound(page))
+ if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
local_unlock(&lru_pvecs.lock);
}
@@ -676,7 +688,7 @@ void deactivate_page(struct page *page)
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
get_page(page);
- if (!pagevec_add(pvec, page) || PageCompound(page))
+ if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, lru_deactivate_fn);
local_unlock(&lru_pvecs.lock);
}
@@ -698,7 +710,7 @@ void mark_page_lazyfree(struct page *page)
local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
get_page(page);
- if (!pagevec_add(pvec, page) || PageCompound(page))
+ if (pagevec_add_and_need_flush(pvec, page))
pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
local_unlock(&lru_pvecs.lock);
}
@@ -729,18 +741,13 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
}

/*
- * lru_add_drain_all() usually needs to be called before we start compiling
- * a list of pages to be migrated using isolate_lru_page(). Note that pages
- * may be moved off the LRU after we have drained them. Those pages will
- * fail to migrate like other pages that may be busy.
- *
* Doesn't need any cpu hotplug locking because we do rely on per-cpu
* kworkers being shut down before our page_alloc_cpu_dead callback is
* executed on the offlined cpu.
* Calling this function with cpu hotplug locks held can actually lead
* to obscure indirect dependencies via WQ context.
*/
-void lru_add_drain_all(void)
+static void __lru_add_drain_all(bool force_all_cpus)
{
/*
* lru_drain_gen - Global pages generation number
@@ -785,7 +792,7 @@ void lru_add_drain_all(void)
* (C) Exit the draining operation if a newer generation, from another
* lru_add_drain_all(), was already scheduled for draining. Check (A).
*/
- if (unlikely(this_gen != lru_drain_gen))
+ if (unlikely(this_gen != lru_drain_gen && !force_all_cpus))
goto done;

/*
@@ -815,7 +822,8 @@ void lru_add_drain_all(void)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

- if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+ if (force_all_cpus ||
+ pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -833,6 +841,11 @@ void lru_add_drain_all(void)
done:
mutex_unlock(&lock);
}
+
+void lru_add_drain_all(void)
+{
+ __lru_add_drain_all(false);
+}
#else
void lru_add_drain_all(void)
{
@@ -840,6 +853,44 @@ void lru_add_drain_all(void)
}
#endif /* CONFIG_SMP */

+static atomic_t lru_disable_count = ATOMIC_INIT(0);
+
+bool lru_cache_disabled(void)
+{
+ return atomic_read(&lru_disable_count);
+}
+
+void lru_cache_enable(void)
+{
+ atomic_dec(&lru_disable_count);
+}
+
+/*
+ * lru_cache_disable() needs to be called before we start compiling
+ * a list of pages to be migrated using isolate_lru_page().
+ * It drains pages on LRU cache and then disable on all cpus until
+ * lru_cache_enable is called.
+ *
+ * Must be paired with a call to lru_cache_enable().
+ */
+void lru_cache_disable(void)
+{
+ atomic_inc(&lru_disable_count);
+#ifdef CONFIG_SMP
+ /*
+ * lru_add_drain_all in the force mode will schedule draining on
+ * all online CPUs so any calls of lru_cache_disabled wrapped by
+ * local_lock or preemption disabled would be ordered by that.
+ * The atomic operation doesn't need to have stronger ordering
+ * requirements because that is enforeced by the scheduling
+ * guarantees.
+ */
+ __lru_add_drain_all(true);
+#else
+ lru_add_drain();
+#endif
+}
+
/**
* release_pages - batched put_page()
* @pages: array of pages to release
--
2.30.1.766.gb4fecdf3b7-goog

2021-03-11 22:43:56

by Chris Goldsworthy

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

On 2021-03-10 08:14, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
>
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
>
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
>
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
>
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> ..
> ..
>
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page),
> rc);
> dump_page(page, "fail to migrate");
> }
>
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
>
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---

Hi Minchan,

This all looks good to me - feel free to add a Reviewed-by from me.

Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project

2021-03-12 08:24:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] mm: replace migrate_prep with lru_add_drain_all

On Wed 10-03-21 08:14:27, Minchan Kim wrote:
> Currently, migrate_prep is merely a wrapper of lru_cache_add_all.
> There is not much to gain from having additional abstraction.
>
> Use lru_add_drain_all instead of migrate_prep, which would be more
> descriptive.
>
> note: migrate_prep_local in compaction.c changed into lru_add_drain
> to avoid CPU schedule cost with involving many other CPUs to keep
> keep old behavior.
>
> Signed-off-by: Minchan Kim <[email protected]>

Acked-by: Michal Hocko <[email protected]>

Btw. that migrate_prep_local likely needs revisiting. I really fail to
see why it is useful. It looks like just in case thing to me. If it is
needed then the comment should be describing why. Something for a
separate patch though.

> ---
> include/linux/migrate.h | 5 -----
> mm/compaction.c | 3 ++-
> mm/mempolicy.c | 4 ++--
> mm/migrate.c | 24 +-----------------------
> mm/page_alloc.c | 2 +-
> mm/swap.c | 5 +++++
> 6 files changed, 11 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3a389633b68f..6155d97ec76c 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -45,8 +45,6 @@ extern struct page *alloc_migration_target(struct page *page, unsigned long priv
> extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
> extern void putback_movable_page(struct page *page);
>
> -extern void migrate_prep(void);
> -extern void migrate_prep_local(void);
> extern void migrate_page_states(struct page *newpage, struct page *page);
> extern void migrate_page_copy(struct page *newpage, struct page *page);
> extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> @@ -66,9 +64,6 @@ static inline struct page *alloc_migration_target(struct page *page,
> static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
> { return -EBUSY; }
>
> -static inline int migrate_prep(void) { return -ENOSYS; }
> -static inline int migrate_prep_local(void) { return -ENOSYS; }
> -
> static inline void migrate_page_states(struct page *newpage, struct page *page)
> {
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f4476e68e..3be017ececc0 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2319,7 +2319,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
> trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
> cc->free_pfn, end_pfn, sync);
>
> - migrate_prep_local();
> + /* lru_add_drain_all could be expensive with involving other CPUs */
> + lru_add_drain();
>
> while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
> int err;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ab51132547b8..fc024e97be37 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1124,7 +1124,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
> int err = 0;
> nodemask_t tmp;
>
> - migrate_prep();
> + lru_add_drain_all();
>
> mmap_read_lock(mm);
>
> @@ -1323,7 +1323,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>
> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
>
> - migrate_prep();
> + lru_add_drain_all();
> }
> {
> NODEMASK_SCRATCH(scratch);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 62b81d5257aa..45f925e10f5a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -57,28 +57,6 @@
>
> #include "internal.h"
>
> -/*
> - * migrate_prep() needs to be called before we start compiling a list of pages
> - * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> - * undesirable, use migrate_prep_local()
> - */
> -void migrate_prep(void)
> -{
> - /*
> - * Clear the LRU lists so pages can be isolated.
> - * Note that pages may be moved off the LRU after we have
> - * drained them. Those pages will fail to migrate like other
> - * pages that may be busy.
> - */
> - lru_add_drain_all();
> -}
> -
> -/* Do the necessary work of migrate_prep but not if it involves other CPUs */
> -void migrate_prep_local(void)
> -{
> - lru_add_drain();
> -}
> -
> int isolate_movable_page(struct page *page, isolate_mode_t mode)
> {
> struct address_space *mapping;
> @@ -1769,7 +1747,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> int start, i;
> int err = 0, err1;
>
> - migrate_prep();
> + lru_add_drain_all();
>
> for (i = start = 0; i < nr_pages; i++) {
> const void __user *p;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e8348936df8..f05a8db741ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8467,7 +8467,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> };
>
> - migrate_prep();
> + lru_add_drain_all();
>
> while (pfn < end || !list_empty(&cc->migratepages)) {
> if (fatal_signal_pending(current)) {
> diff --git a/mm/swap.c b/mm/swap.c
> index 31b844d4ed94..441d1ae1f285 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -729,6 +729,11 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> }
>
> /*
> + * lru_add_drain_all() usually needs to be called before we start compiling
> + * a list of pages to be migrated using isolate_lru_page(). Note that pages
> + * may be moved off the LRU after we have drained them. Those pages will
> + * fail to migrate like other pages that may be busy.
> + *
> * Doesn't need any cpu hotplug locking because we do rely on per-cpu
> * kworkers being shut down before our page_alloc_cpu_dead callback is
> * executed on the offlined cpu.
> --
> 2.30.1.766.gb4fecdf3b7-goog

--
Michal Hocko
SUSE Labs

2021-03-12 08:24:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

On Wed 10-03-21 08:14:28, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
>
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
>
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
>
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
>
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> ..
> ..
>
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> dump_page(page, "fail to migrate");
> }
>
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
>
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
>
> Signed-off-by: Minchan Kim <[email protected]>

Looks goot to me
Acked-by: Michal Hocko <[email protected]>

Thanks

--
Michal Hocko
SUSE Labs

2021-03-12 08:56:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] mm: replace migrate_prep with lru_add_drain_all

On 10.03.21 17:14, Minchan Kim wrote:
> Currently, migrate_prep is merely a wrapper of lru_cache_add_all.
> There is not much to gain from having additional abstraction.
>
> Use lru_add_drain_all instead of migrate_prep, which would be more
> descriptive.
>
> note: migrate_prep_local in compaction.c changed into lru_add_drain
> to avoid CPU schedule cost with involving many other CPUs to keep
> keep old behavior.
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/linux/migrate.h | 5 -----
> mm/compaction.c | 3 ++-
> mm/mempolicy.c | 4 ++--
> mm/migrate.c | 24 +-----------------------
> mm/page_alloc.c | 2 +-
> mm/swap.c | 5 +++++
> 6 files changed, 11 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3a389633b68f..6155d97ec76c 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -45,8 +45,6 @@ extern struct page *alloc_migration_target(struct page *page, unsigned long priv
> extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
> extern void putback_movable_page(struct page *page);
>
> -extern void migrate_prep(void);
> -extern void migrate_prep_local(void);
> extern void migrate_page_states(struct page *newpage, struct page *page);
> extern void migrate_page_copy(struct page *newpage, struct page *page);
> extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> @@ -66,9 +64,6 @@ static inline struct page *alloc_migration_target(struct page *page,
> static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
> { return -EBUSY; }
>
> -static inline int migrate_prep(void) { return -ENOSYS; }
> -static inline int migrate_prep_local(void) { return -ENOSYS; }
> -
> static inline void migrate_page_states(struct page *newpage, struct page *page)
> {
> }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f4476e68e..3be017ececc0 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2319,7 +2319,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
> trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
> cc->free_pfn, end_pfn, sync);
>
> - migrate_prep_local();
> + /* lru_add_drain_all could be expensive with involving other CPUs */
> + lru_add_drain();
>
> while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
> int err;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ab51132547b8..fc024e97be37 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1124,7 +1124,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
> int err = 0;
> nodemask_t tmp;
>
> - migrate_prep();
> + lru_add_drain_all();
>
> mmap_read_lock(mm);
>
> @@ -1323,7 +1323,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>
> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
>
> - migrate_prep();
> + lru_add_drain_all();
> }
> {
> NODEMASK_SCRATCH(scratch);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 62b81d5257aa..45f925e10f5a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -57,28 +57,6 @@
>
> #include "internal.h"
>
> -/*
> - * migrate_prep() needs to be called before we start compiling a list of pages
> - * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> - * undesirable, use migrate_prep_local()
> - */
> -void migrate_prep(void)
> -{
> - /*
> - * Clear the LRU lists so pages can be isolated.
> - * Note that pages may be moved off the LRU after we have
> - * drained them. Those pages will fail to migrate like other
> - * pages that may be busy.
> - */
> - lru_add_drain_all();
> -}
> -
> -/* Do the necessary work of migrate_prep but not if it involves other CPUs */
> -void migrate_prep_local(void)
> -{
> - lru_add_drain();
> -}
> -
> int isolate_movable_page(struct page *page, isolate_mode_t mode)
> {
> struct address_space *mapping;
> @@ -1769,7 +1747,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> int start, i;
> int err = 0, err1;
>
> - migrate_prep();
> + lru_add_drain_all();
>
> for (i = start = 0; i < nr_pages; i++) {
> const void __user *p;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e8348936df8..f05a8db741ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8467,7 +8467,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> };
>
> - migrate_prep();
> + lru_add_drain_all();
>
> while (pfn < end || !list_empty(&cc->migratepages)) {
> if (fatal_signal_pending(current)) {
> diff --git a/mm/swap.c b/mm/swap.c
> index 31b844d4ed94..441d1ae1f285 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -729,6 +729,11 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> }
>
> /*
> + * lru_add_drain_all() usually needs to be called before we start compiling
> + * a list of pages to be migrated using isolate_lru_page(). Note that pages
> + * may be moved off the LRU after we have drained them. Those pages will
> + * fail to migrate like other pages that may be busy.
> + *
> * Doesn't need any cpu hotplug locking because we do rely on per-cpu
> * kworkers being shut down before our page_alloc_cpu_dead callback is
> * executed on the offlined cpu.
>

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2021-03-12 09:04:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

On 10.03.21 17:14, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
>
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
>
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
>
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
>
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> ..
> ..
>
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> dump_page(page, "fail to migrate");
> }
>
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
>
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/linux/swap.h | 3 ++
> mm/memory_hotplug.c | 3 +-
> mm/mempolicy.c | 4 ++-
> mm/migrate.c | 3 +-
> mm/swap.c | 79 ++++++++++++++++++++++++++++++++++++--------
> 5 files changed, 75 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 32f665b1ee85..a3e258335a7f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -339,6 +339,9 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
> extern void lru_note_cost_page(struct page *);
> extern void lru_cache_add(struct page *);
> extern void mark_page_accessed(struct page *);
> +extern void lru_cache_disable(void);
> +extern void lru_cache_enable(void);
> +extern bool lru_cache_disabled(void);
> extern void lru_add_drain(void);
> extern void lru_add_drain_cpu(int cpu);
> extern void lru_add_drain_cpu_zone(struct zone *zone);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 5ba51a8bdaeb..959f659ef085 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1611,6 +1611,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> * in a way that pages from isolated pageblock are left on pcplists.
> */
> zone_pcp_disable(zone);
> + lru_cache_disable();

Did you also experiment which effects zone_pcp_disable() might have on
alloc_contig_range() ?

Feels like both calls could be abstracted somehow and used in both
(memory offlining/alloc_contig_range) cases. It's essentially disabling
some kind of caching.


Looks sane to me, but I am not that experienced with migration code to
give this a real RB.

--
Thanks,

David / dhildenb

2021-03-12 09:06:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration

On 10.03.21 17:14, Minchan Kim wrote:
> ffer_head LRU caches will be pinned and thus cannot be migrated.
> This can prevent CMA allocations from succeeding, which are often used
> on platforms with co-processors (such as a DSP) that can only use
> physically contiguous memory. It can also prevent memory
> hot-unplugging from succeeding, which involves migrating at least
> MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
> GiB based on the architecture in use.

Actually, it's memory_block_size_bytes(), which can be even bigger
(IIRC, 128MiB..2 GiB on x86-64) that fails to get offlined. But that
will prevent bigger granularity (e.g., a whole DIMM) from getting unplugged.

>
> Correspondingly, invalidate the BH LRU caches before a migration
> starts and stop any buffer_head from being cached in the LRU caches,
> until migration has finished.

Sounds sane to me.

--
Thanks,

David / dhildenb

2021-03-12 09:35:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration

On 12.03.21 10:03, David Hildenbrand wrote:
> On 10.03.21 17:14, Minchan Kim wrote:
>> ffer_head LRU caches will be pinned and thus cannot be migrated.
>> This can prevent CMA allocations from succeeding, which are often used
>> on platforms with co-processors (such as a DSP) that can only use
>> physically contiguous memory. It can also prevent memory
>> hot-unplugging from succeeding, which involves migrating at least
>> MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
>> GiB based on the architecture in use.
>
> Actually, it's memory_block_size_bytes(), which can be even bigger
> (IIRC, 128MiB..2 GiB on x86-64) that fails to get offlined. But that
> will prevent bigger granularity (e.g., a whole DIMM) from getting unplugged.
>
>>
>> Correspondingly, invalidate the BH LRU caches before a migration
>> starts and stop any buffer_head from being cached in the LRU caches,
>> until migration has finished.
>
> Sounds sane to me.
>

Diving a bit into the code, I am wondering:


a) Are these buffer head pages marked as movable?

IOW, are they either PageLRU() or __PageMovable()?


b) How do these pages end up on ZONE_MOVABLE or MIGRATE_CMA?

I assume these pages come via
alloc_page_buffers()->alloc_buffer_head()->kmem_cache_zalloc(GFP_NOFS |
__GFP_ACCOUNT)



--
Thanks,

David / dhildenb

2021-03-12 17:21:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration

On Fri, Mar 12, 2021 at 10:33:48AM +0100, David Hildenbrand wrote:
> On 12.03.21 10:03, David Hildenbrand wrote:
> > On 10.03.21 17:14, Minchan Kim wrote:
> > > ffer_head LRU caches will be pinned and thus cannot be migrated.
> > > This can prevent CMA allocations from succeeding, which are often used
> > > on platforms with co-processors (such as a DSP) that can only use
> > > physically contiguous memory. It can also prevent memory
> > > hot-unplugging from succeeding, which involves migrating at least
> > > MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
> > > GiB based on the architecture in use.
> >
> > Actually, it's memory_block_size_bytes(), which can be even bigger
> > (IIRC, 128MiB..2 GiB on x86-64) that fails to get offlined. But that
> > will prevent bigger granularity (e.g., a whole DIMM) from getting unplugged.
> >
> > >
> > > Correspondingly, invalidate the BH LRU caches before a migration
> > > starts and stop any buffer_head from being cached in the LRU caches,
> > > until migration has finished.
> >
> > Sounds sane to me.
> >
>
> Diving a bit into the code, I am wondering:
>
>
> a) Are these buffer head pages marked as movable?
>
> IOW, are they either PageLRU() or __PageMovable()?
>
>
> b) How do these pages end up on ZONE_MOVABLE or MIGRATE_CMA?
>
> I assume these pages come via
> alloc_page_buffers()->alloc_buffer_head()->kmem_cache_zalloc(GFP_NOFS |
> __GFP_ACCOUNT)
>

It's indirect it was not clear

try_to_release_page
try_to_free_buffers
buffer_busy
failed

Yeah, comment is misleading. This one would be better.

/*
* the refcount of buffer_head in bh_lru prevents dropping the
* attached page(i.e., try_to_free_buffers) so it could cause
* failing page migrationn.
* Skip putting upcoming bh into bh_lru until migration is done.
*/

2021-03-14 05:20:50

by Chris Goldsworthy

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

On 2021-03-11 14:41, Chris Goldsworthy wrote:
> On 2021-03-10 08:14, Minchan Kim wrote:
>> LRU pagevec holds refcount of pages until the pagevec are drained.
>> It could prevent migration since the refcount of the page is greater
>> than the expection in migration logic. To mitigate the issue,
>> callers of migrate_pages drains LRU pagevec via migrate_prep or
>> lru_add_drain_all before migrate_pages call.
>>
>> However, it's not enough because pages coming into pagevec after the
>> draining call still could stay at the pagevec so it could keep
>> preventing page migration. Since some callers of migrate_pages have
>> retrial logic with LRU draining, the page would migrate at next trail
>> but it is still fragile in that it doesn't close the fundamental race
>> between upcoming LRU pages into pagvec and migration so the migration
>> failure could cause contiguous memory allocation failure in the end.
>>
>> To close the race, this patch disables lru caches(i.e, pagevec)
>> during ongoing migration until migrate is done.
>>
>> Since it's really hard to reproduce, I measured how many times
>> migrate_pages retried with force mode(it is about a fallback to a
>> sync migration) with below debug code.
>>
>> int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> ..
>> ..
>>
>> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
>> printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page),
>> rc);
>> dump_page(page, "fail to migrate");
>> }
>>
>> The test was repeating android apps launching with cma allocation
>> in background every five seconds. Total cma allocation count was
>> about 500 during the testing. With this patch, the dump_page count
>> was reduced from 400 to 30.
>>
>> The new interface is also useful for memory hotplug which currently
>> drains lru pcp caches after each migration failure. This is rather
>> suboptimal as it has to disrupt others running during the operation.
>> With the new interface the operation happens only once. This is also
>> in
>> line with pcp allocator cache which are disabled for the offlining as
>> well.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>
> Hi Minchan,
>
> This all looks good to me - feel free to add a Reviewed-by from me.
>
> Thanks,
>
> Chris.
Should have added:

Reviewed-by: Chris Goldsworthy <[email protected]>
--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project

2021-03-16 21:20:36

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration

On Fri, Mar 12, 2021 at 09:17:23AM -0800, Minchan Kim wrote:
> On Fri, Mar 12, 2021 at 10:33:48AM +0100, David Hildenbrand wrote:
> > On 12.03.21 10:03, David Hildenbrand wrote:
> > > On 10.03.21 17:14, Minchan Kim wrote:
> > > > ffer_head LRU caches will be pinned and thus cannot be migrated.
> > > > This can prevent CMA allocations from succeeding, which are often used
> > > > on platforms with co-processors (such as a DSP) that can only use
> > > > physically contiguous memory. It can also prevent memory
> > > > hot-unplugging from succeeding, which involves migrating at least
> > > > MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
> > > > GiB based on the architecture in use.
> > >
> > > Actually, it's memory_block_size_bytes(), which can be even bigger
> > > (IIRC, 128MiB..2 GiB on x86-64) that fails to get offlined. But that
> > > will prevent bigger granularity (e.g., a whole DIMM) from getting unplugged.
> > >
> > > >
> > > > Correspondingly, invalidate the BH LRU caches before a migration
> > > > starts and stop any buffer_head from being cached in the LRU caches,
> > > > until migration has finished.
> > >
> > > Sounds sane to me.
> > >
> >
> > Diving a bit into the code, I am wondering:
> >
> >
> > a) Are these buffer head pages marked as movable?
> >
> > IOW, are they either PageLRU() or __PageMovable()?
> >
> >
> > b) How do these pages end up on ZONE_MOVABLE or MIGRATE_CMA?
> >
> > I assume these pages come via
> > alloc_page_buffers()->alloc_buffer_head()->kmem_cache_zalloc(GFP_NOFS |
> > __GFP_ACCOUNT)
> >
>
> It's indirect it was not clear
>
> try_to_release_page
> try_to_free_buffers
> buffer_busy
> failed
>
> Yeah, comment is misleading. This one would be better.
>
> /*
> * the refcount of buffer_head in bh_lru prevents dropping the
> * attached page(i.e., try_to_free_buffers) so it could cause
> * failing page migrationn.
> * Skip putting upcoming bh into bh_lru until migration is done.
> */

Hi Andrew,

Could you fold this comment fix patch? If you prefer formal patch,
let me know. I will resend it.

Thank you.

From 0774f21e2dc8220fc2be80c25f711cb061363519 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Fri, 12 Mar 2021 09:17:34 -0800
Subject: [PATCH] comment fix

Signed-off-by: Minchan Kim <[email protected]>
---
fs/buffer.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index ca9dd736bcb8..8602dcbe0327 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1265,8 +1265,9 @@ static void bh_lru_install(struct buffer_head *bh)

check_irqs_on();
/*
- * buffer_head in bh_lru could increase refcount of the page
- * until it will be invalidated. It causes page migraion failure.
+ * the refcount of buffer_head in bh_lru prevents dropping the
+ * attached page(i.e., try_to_free_buffers) so it could cause
+ * failing page migratoin.
* Skip putting upcoming bh into bh_lru until migration is done.
*/
if (lru_cache_disabled())
--
2.31.0.rc2.261.g7f71774620-goog

2021-03-17 02:45:08

by kernel test robot

[permalink] [raw]
Subject: [mm] 8fd8d23ab1: WARNING:at_fs/buffer.c:#__brelse



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 8fd8d23ab10cc2fceeac25ea7b0e2eaf98e78d64 ("[PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration")
url: https://github.com/0day-ci/linux/commits/Minchan-Kim/mm-replace-migrate_prep-with-lru_add_drain_all/20210311-001714
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 144c79ef33536b4ecb4951e07dbc1f2b7fa99d32

in testcase: blktests
version: blktests-x86_64-a210761-1_20210124
with following parameters:

test: nbd-group-01
ucode: 0xe2



on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 40.465061] WARNING: CPU: 2 PID: 5207 at fs/buffer.c:1177 __brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
[ 40.465066] Modules linked in: nbd loop xfs libcrc32c dm_multipath dm_mod ipmi_devintf ipmi_msghandler sd_mod t10_pi sg intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel rapl i915 mei_wdt intel_cstate wmi_bmof intel_gtt drm_kms_helper syscopyarea ahci intel_uncore sysfillrect sysimgblt libahci fb_sys_fops drm libata mei_me mei intel_pch_thermal wmi video intel_pmc_core acpi_pad ip_tables
[ 40.465086] CPU: 2 PID: 5207 Comm: mount_clear_soc Tainted: G I 5.12.0-rc2-00062-g8fd8d23ab10c #1
[ 40.465088] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
[ 40.465089] RIP: 0010:__brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
[ 40.465091] Code: 00 00 00 00 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 47 60 85 c0 74 05 f0 ff 4f 60 c3 48 c7 c7 d8 99 57 82 e8 02 5d 80 00 <0f> 0b c3 0f 1f 44 00 00 55 65 ff 05 13 79 c8 7e 53 48 c7 c3 c0 89
All code
========
0: 00 00 add %al,(%rax)
2: 00 00 add %al,(%rax)
4: 00 0f add %cl,(%rdi)
6: 1f (bad)
7: 84 00 test %al,(%rax)
9: 00 00 add %al,(%rax)
b: 00 00 add %al,(%rax)
d: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
12: 8b 47 60 mov 0x60(%rdi),%eax
15: 85 c0 test %eax,%eax
17: 74 05 je 0x1e
19: f0 ff 4f 60 lock decl 0x60(%rdi)
1d: c3 retq
1e: 48 c7 c7 d8 99 57 82 mov $0xffffffff825799d8,%rdi
25: e8 02 5d 80 00 callq 0x805d2c
2a:* 0f 0b ud2 <-- trapping instruction
2c: c3 retq
2d: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
32: 55 push %rbp
33: 65 ff 05 13 79 c8 7e incl %gs:0x7ec87913(%rip) # 0x7ec8794d
3a: 53 push %rbx
3b: 48 rex.W
3c: c7 .byte 0xc7
3d: c3 retq
3e: c0 .byte 0xc0
3f: 89 .byte 0x89

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: c3 retq
3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
8: 55 push %rbp
9: 65 ff 05 13 79 c8 7e incl %gs:0x7ec87913(%rip) # 0x7ec87923
10: 53 push %rbx
11: 48 rex.W
12: c7 .byte 0xc7
13: c3 retq
14: c0 .byte 0xc0
15: 89 .byte 0x89
[ 40.465092] RSP: 0018:ffffc90000138fa8 EFLAGS: 00010086
[ 40.465094] RAX: 0000000000000000 RBX: ffff888871d289c0 RCX: 0000000000000027
[ 40.465095] RDX: 0000000000000027 RSI: 0000000000000002 RDI: ffff888871d177f8
[ 40.465096] RBP: ffff888871d28a40 R08: ffff888871d177f0 R09: ffffc90000138f40
[ 40.465096] R10: 0000000000000001 R11: ffffc90000138ff8 R12: ffff888871d312c0
[ 40.465097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 40.465098] FS: 0000000000000000(0000) GS:ffff888871d00000(0000) knlGS:0000000000000000
[ 40.465099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.465100] CR2: 00007f99690c38a0 CR3: 000000087020a001 CR4: 00000000003706e0
[ 40.465101] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 40.465102] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 40.465102] Call Trace:
[ 40.465104] <IRQ>
[ 40.465104] invalidate_bh_lru (kbuild/src/consumer/fs/buffer.c:1427)
[ 40.465107] flush_smp_call_function_queue (kbuild/src/consumer/kernel/smp.c:247 kbuild/src/consumer/kernel/smp.c:396)
[ 40.465111] __sysvec_call_function (kbuild/src/consumer/arch/x86/include/asm/jump_label.h:25 kbuild/src/consumer/include/linux/jump_label.h:200 kbuild/src/consumer/arch/x86/include/asm/trace/irq_vectors.h:93 kbuild/src/consumer/arch/x86/kernel/smp.c:240)
[ 40.465113] sysvec_call_function (kbuild/src/consumer/arch/x86/kernel/smp.c:234 (discriminator 14))
[ 40.465116] </IRQ>
[ 40.465116] asm_sysvec_call_function (kbuild/src/consumer/arch/x86/include/asm/idtentry.h:641)
[ 40.465119] RIP: 0010:console_unlock (kbuild/src/consumer/kernel/printk/printk.c:2586)
[ 40.465121] Code: ff 85 c0 0f 85 3a ff ff ff e8 04 f6 ff ff 85 c0 0f 85 06 fd ff ff e9 28 ff ff ff e8 92 2b 00 00 48 85 ed 74 01 fb 8b 44 24 14 <85> c0 0f 84 39 fd ff ff 31 d2 be 1b 0a 00 00 48 c7 c7 e3 41 55 82
All code
========
0: ff 85 c0 0f 85 3a incl 0x3a850fc0(%rbp)
6: ff (bad)
7: ff (bad)
8: ff (bad)
9: e8 04 f6 ff ff callq 0xfffffffffffff612
e: 85 c0 test %eax,%eax
10: 0f 85 06 fd ff ff jne 0xfffffffffffffd1c
16: e9 28 ff ff ff jmpq 0xffffffffffffff43
1b: e8 92 2b 00 00 callq 0x2bb2
20: 48 85 ed test %rbp,%rbp
23: 74 01 je 0x26
25: fb sti
26: 8b 44 24 14 mov 0x14(%rsp),%eax
2a:* 85 c0 test %eax,%eax <-- trapping instruction
2c: 0f 84 39 fd ff ff je 0xfffffffffffffd6b
32: 31 d2 xor %edx,%edx
34: be 1b 0a 00 00 mov $0xa1b,%esi
39: 48 c7 c7 e3 41 55 82 mov $0xffffffff825541e3,%rdi

Code starting with the faulting instruction
===========================================
0: 85 c0 test %eax,%eax
2: 0f 84 39 fd ff ff je 0xfffffffffffffd41
8: 31 d2 xor %edx,%edx
a: be 1b 0a 00 00 mov $0xa1b,%esi
f: 48 c7 c7 e3 41 55 82 mov $0xffffffff825541e3,%rdi
[ 40.465122] RSP: 0018:ffffc900077d7be0 EFLAGS: 00000206
[ 40.465124] RAX: 0000000000000000 RBX: ffffffff8356c2ac RCX: 0000000000000000
[ 40.465124] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff8356e738
[ 40.465125] RBP: 0000000000000200 R08: 0000000000000001 R09: 0000000000000019
[ 40.465126] R10: 000000000000072d R11: 000000002d2d2d2d R12: ffffffff8356e738
[ 40.465126] R13: 0000000000000000 R14: 0000000000000034 R15: 0000000000000000
[ 40.465129] vprintk_emit (kbuild/src/consumer/arch/x86/include/asm/preempt.h:85 kbuild/src/consumer/kernel/printk/printk.c:2099)
[ 40.465130] ? release_pages (kbuild/src/consumer/mm/swap.c:983)
[ 40.465132] printk (kbuild/src/consumer/kernel/printk/printk.c:2150)
[ 40.465135] __warn_printk (kbuild/src/consumer/kernel/panic.c:646)
[ 40.465137] __brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
[ 40.465138] invalidate_bh_lru (kbuild/src/consumer/fs/buffer.c:1427)
[ 40.465140] lru_add_drain (kbuild/src/consumer/mm/swap.c:725)
[ 40.465141] exit_mmap (kbuild/src/consumer/mm/mmap.c:3215)
[ 40.465144] ? filename_lookup (kbuild/src/consumer/fs/namei.c:2465)
[ 40.465146] ? __cond_resched (kbuild/src/consumer/kernel/sched/core.c:6990)
[ 40.465148] mmput (kbuild/src/consumer/kernel/fork.c:1083 kbuild/src/consumer/kernel/fork.c:1103)
[ 40.465149] do_exit (kbuild/src/consumer/arch/x86/include/asm/bitops.h:207 kbuild/src/consumer/include/asm-generic/bitops/instrumented-non-atomic.h:135 kbuild/src/consumer/include/linux/thread_info.h:104 kbuild/src/consumer/kernel/exit.c:502 kbuild/src/consumer/kernel/exit.c:812)
[ 40.465152] ? do_user_addr_fault (kbuild/src/consumer/arch/x86/include/asm/jump_label.h:25 kbuild/src/consumer/include/linux/jump_label.h:200 kbuild/src/consumer/include/linux/mmap_lock.h:41 kbuild/src/consumer/include/linux/mmap_lock.h:144 kbuild/src/consumer/arch/x86/mm/fault.c:1414)
[ 40.465154] do_group_exit (kbuild/src/consumer/include/linux/list.h:282 kbuild/src/consumer/include/linux/sched/signal.h:684 kbuild/src/consumer/kernel/exit.c:907)
[ 40.465156] __x64_sys_exit_group (kbuild/src/consumer/kernel/exit.c:933)
[ 40.465159] do_syscall_64 (kbuild/src/consumer/arch/x86/entry/common.c:46)
[ 40.465160] entry_SYSCALL_64_after_hwframe (kbuild/src/consumer/arch/x86/entry/entry_64.S:112)
[ 40.465162] RIP: 0033:0x7f9968fcc9d6
[ 40.465164] Code: Unable to access opcode bytes at RIP 0x7f9968fcc9ac.

Code starting with the faulting instruction
===========================================


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (9.85 kB)
config-5.12.0-rc2-00062-g8fd8d23ab10c (175.56 kB)
job-script (5.84 kB)
dmesg.xz (34.66 kB)
blktests (1.40 kB)
job.yaml (4.89 kB)
reproduce (86.00 B)
Download all attachments

2021-03-17 11:20:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration

On 16.03.21 19:26, Minchan Kim wrote:
> On Fri, Mar 12, 2021 at 09:17:23AM -0800, Minchan Kim wrote:
>> On Fri, Mar 12, 2021 at 10:33:48AM +0100, David Hildenbrand wrote:
>>> On 12.03.21 10:03, David Hildenbrand wrote:
>>>> On 10.03.21 17:14, Minchan Kim wrote:
>>>>> ffer_head LRU caches will be pinned and thus cannot be migrated.
>>>>> This can prevent CMA allocations from succeeding, which are often used
>>>>> on platforms with co-processors (such as a DSP) that can only use
>>>>> physically contiguous memory. It can also prevent memory
>>>>> hot-unplugging from succeeding, which involves migrating at least
>>>>> MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
>>>>> GiB based on the architecture in use.
>>>>
>>>> Actually, it's memory_block_size_bytes(), which can be even bigger
>>>> (IIRC, 128MiB..2 GiB on x86-64) that fails to get offlined. But that
>>>> will prevent bigger granularity (e.g., a whole DIMM) from getting unplugged.
>>>>
>>>>>
>>>>> Correspondingly, invalidate the BH LRU caches before a migration
>>>>> starts and stop any buffer_head from being cached in the LRU caches,
>>>>> until migration has finished.
>>>>
>>>> Sounds sane to me.
>>>>
>>>
>>> Diving a bit into the code, I am wondering:
>>>
>>>
>>> a) Are these buffer head pages marked as movable?
>>>
>>> IOW, are they either PageLRU() or __PageMovable()?
>>>
>>>
>>> b) How do these pages end up on ZONE_MOVABLE or MIGRATE_CMA?
>>>
>>> I assume these pages come via
>>> alloc_page_buffers()->alloc_buffer_head()->kmem_cache_zalloc(GFP_NOFS |
>>> __GFP_ACCOUNT)
>>>
>>
>> It's indirect it was not clear
>>
>> try_to_release_page
>> try_to_free_buffers
>> buffer_busy
>> failed
>>
>> Yeah, comment is misleading. This one would be better.
>>
>> /*
>> * the refcount of buffer_head in bh_lru prevents dropping the
>> * attached page(i.e., try_to_free_buffers) so it could cause
>> * failing page migrationn.
>> * Skip putting upcoming bh into bh_lru until migration is done.
>> */
>

Thanks, that makes more sense to me now :)

> Hi Andrew,
>
> Could you fold this comment fix patch? If you prefer formal patch,
> let me know. I will resend it.
>
> Thank you.
>
> From 0774f21e2dc8220fc2be80c25f711cb061363519 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Fri, 12 Mar 2021 09:17:34 -0800
> Subject: [PATCH] comment fix
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> fs/buffer.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ca9dd736bcb8..8602dcbe0327 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1265,8 +1265,9 @@ static void bh_lru_install(struct buffer_head *bh)
>
> check_irqs_on();
> /*
> - * buffer_head in bh_lru could increase refcount of the page
> - * until it will be invalidated. It causes page migraion failure.
> + * the refcount of buffer_head in bh_lru prevents dropping the
> + * attached page(i.e., try_to_free_buffers) so it could cause
> + * failing page migratoin.

s/migratoin/migration/

> * Skip putting upcoming bh into bh_lru until migration is done.
> */
> if (lru_cache_disabled())
>

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2021-03-17 16:32:02

by Minchan Kim

[permalink] [raw]
Subject: Re: [mm] 8fd8d23ab1: WARNING:at_fs/buffer.c:#__brelse

On Wed, Mar 17, 2021 at 10:37:57AM +0800, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 8fd8d23ab10cc2fceeac25ea7b0e2eaf98e78d64 ("[PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration")
> url: https://github.com/0day-ci/linux/commits/Minchan-Kim/mm-replace-migrate_prep-with-lru_add_drain_all/20210311-001714
> base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 144c79ef33536b4ecb4951e07dbc1f2b7fa99d32
>
> in testcase: blktests
> version: blktests-x86_64-a210761-1_20210124
> with following parameters:
>
> test: nbd-group-01
> ucode: 0xe2
>
>
>
> on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> [ 40.465061] WARNING: CPU: 2 PID: 5207 at fs/buffer.c:1177 __brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
> [ 40.465066] Modules linked in: nbd loop xfs libcrc32c dm_multipath dm_mod ipmi_devintf ipmi_msghandler sd_mod t10_pi sg intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel rapl i915 mei_wdt intel_cstate wmi_bmof intel_gtt drm_kms_helper syscopyarea ahci intel_uncore sysfillrect sysimgblt libahci fb_sys_fops drm libata mei_me mei intel_pch_thermal wmi video intel_pmc_core acpi_pad ip_tables
> [ 40.465086] CPU: 2 PID: 5207 Comm: mount_clear_soc Tainted: G I 5.12.0-rc2-00062-g8fd8d23ab10c #1
> [ 40.465088] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
> [ 40.465089] RIP: 0010:__brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
> [ 40.465091] Code: 00 00 00 00 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 47 60 85 c0 74 05 f0 ff 4f 60 c3 48 c7 c7 d8 99 57 82 e8 02 5d 80 00 <0f> 0b c3 0f 1f 44 00 00 55 65 ff 05 13 79 c8 7e 53 48 c7 c3 c0 89

Hi,

Unfortunately, I couldn't set the lkp test in my local mahcine
since installation failed(I guess my linux distribution is
very minor)

Do you mind testing this patch? (Please replace the original
patch with this one)

From 23cfe5a8e939e2c077223e009887af8a0b5d6381 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 2 Mar 2021 12:05:08 -0800
Subject: [PATCH] mm: fs: Invalidate BH LRU during page migration

Pages containing buffer_heads that are in one of the per-CPU
buffer_head LRU caches will be pinned and thus cannot be migrated.
This can prevent CMA allocations from succeeding, which are often used
on platforms with co-processors (such as a DSP) that can only use
physically contiguous memory. It can also prevent memory
hot-unplugging from succeeding, which involves migrating at least
MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
GiB based on the architecture in use.

Correspondingly, invalidate the BH LRU caches before a migration
starts and stop any buffer_head from being cached in the LRU caches,
until migration has finished.

Signed-off-by: Chris Goldsworthy <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
fs/buffer.c | 36 ++++++++++++++++++++++++++++++------
include/linux/buffer_head.h | 4 ++++
mm/swap.c | 5 ++++-
3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 0cb7ffd4977c..e9872d0dcbf1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1264,6 +1264,15 @@ static void bh_lru_install(struct buffer_head *bh)
int i;

check_irqs_on();
+ /*
+ * the refcount of buffer_head in bh_lru prevents dropping the
+ * attached page(i.e., try_to_free_buffers) so it could cause
+ * failing page migration.
+ * Skip putting upcoming bh into bh_lru until migration is done.
+ */
+ if (lru_cache_disabled())
+ return;
+
bh_lru_lock();

b = this_cpu_ptr(&bh_lrus);
@@ -1404,6 +1413,15 @@ __bread_gfp(struct block_device *bdev, sector_t block,
}
EXPORT_SYMBOL(__bread_gfp);

+static void __invalidate_bh_lrus(struct bh_lru *b)
+{
+ int i;
+
+ for (i = 0; i < BH_LRU_SIZE; i++) {
+ brelse(b->bhs[i]);
+ b->bhs[i] = NULL;
+ }
+}
/*
* invalidate_bh_lrus() is called rarely - but not only at unmount.
* This doesn't race because it runs in each cpu either in irq
@@ -1412,16 +1430,12 @@ EXPORT_SYMBOL(__bread_gfp);
static void invalidate_bh_lru(void *arg)
{
struct bh_lru *b = &get_cpu_var(bh_lrus);
- int i;

- for (i = 0; i < BH_LRU_SIZE; i++) {
- brelse(b->bhs[i]);
- b->bhs[i] = NULL;
- }
+ __invalidate_bh_lrus(b);
put_cpu_var(bh_lrus);
}

-static bool has_bh_in_lru(int cpu, void *dummy)
+bool has_bh_in_lru(int cpu, void *dummy)
{
struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
int i;
@@ -1440,6 +1454,16 @@ void invalidate_bh_lrus(void)
}
EXPORT_SYMBOL_GPL(invalidate_bh_lrus);

+void invalidate_bh_lrus_cpu(int cpu)
+{
+ struct bh_lru *b;
+
+ bh_lru_lock();
+ b = per_cpu_ptr(&bh_lrus, cpu);
+ __invalidate_bh_lrus(b);
+ bh_lru_unlock();
+}
+
void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
{
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6b47f94378c5..e7e99da31349 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -194,6 +194,8 @@ void __breadahead_gfp(struct block_device *, sector_t block, unsigned int size,
struct buffer_head *__bread_gfp(struct block_device *,
sector_t block, unsigned size, gfp_t gfp);
void invalidate_bh_lrus(void);
+void invalidate_bh_lrus_cpu(int cpu);
+bool has_bh_in_lru(int cpu, void *dummy);
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
void free_buffer_head(struct buffer_head * bh);
void unlock_buffer(struct buffer_head *bh);
@@ -406,6 +408,8 @@ static inline int inode_has_buffers(struct inode *inode) { return 0; }
static inline void invalidate_inode_buffers(struct inode *inode) {}
static inline int remove_inode_buffers(struct inode *inode) { return 1; }
static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
+static inline void invalidate_bh_lrus_cpu(int cpu) {}
+static inline bool has_bh_in_lru(int cpu, void *dummy) { return 0; }
#define buffer_heads_over_limit 0

#endif /* CONFIG_BLOCK */
diff --git a/mm/swap.c b/mm/swap.c
index fbdf6ac05aec..b962fe45bc02 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
#include <linux/hugetlb.h>
#include <linux/page_idle.h>
#include <linux/local_lock.h>
+#include <linux/buffer_head.h>

#include "internal.h"

@@ -641,6 +642,7 @@ void lru_add_drain_cpu(int cpu)
pagevec_lru_move_fn(pvec, lru_lazyfree_fn);

activate_page_drain(cpu);
+ invalidate_bh_lrus_cpu(cpu);
}

/**
@@ -828,7 +830,8 @@ static void __lru_add_drain_all(bool force_all_cpus)
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
- need_activate_page_drain(cpu)) {
+ need_activate_page_drain(cpu) ||
+ has_bh_in_lru(cpu, NULL)) {
INIT_WORK(work, lru_add_drain_per_cpu);
queue_work_on(cpu, mm_percpu_wq, work);
__cpumask_set_cpu(cpu, &has_work);
--
2.31.0.rc2.261.g7f71774620-goog


2021-03-18 00:15:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

On Wed, 10 Mar 2021 08:14:28 -0800 Minchan Kim <[email protected]> wrote:

> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
>
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
>
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
>
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
>
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> ..
> ..
>
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> dump_page(page, "fail to migrate");
> }
>
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
>
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
>

This is really a rather ugly thing, particularly from a maintainability
point of view. Are you sure you found all the sites which need the
enable/disable? How do we prevent new ones from creeping in which need
the same treatment? Is there some way of adding a runtime check which
will trip if a conversion was missed?

> ...
>
> +bool lru_cache_disabled(void)
> +{
> + return atomic_read(&lru_disable_count);
> +}
> +
> +void lru_cache_enable(void)
> +{
> + atomic_dec(&lru_disable_count);
> +}
> +
> +/*
> + * lru_cache_disable() needs to be called before we start compiling
> + * a list of pages to be migrated using isolate_lru_page().
> + * It drains pages on LRU cache and then disable on all cpus until
> + * lru_cache_enable is called.
> + *
> + * Must be paired with a call to lru_cache_enable().
> + */
> +void lru_cache_disable(void)
> +{
> + atomic_inc(&lru_disable_count);
> +#ifdef CONFIG_SMP
> + /*
> + * lru_add_drain_all in the force mode will schedule draining on
> + * all online CPUs so any calls of lru_cache_disabled wrapped by
> + * local_lock or preemption disabled would be ordered by that.
> + * The atomic operation doesn't need to have stronger ordering
> + * requirements because that is enforeced by the scheduling
> + * guarantees.
> + */
> + __lru_add_drain_all(true);
> +#else
> + lru_add_drain();
> +#endif
> +}

I guess at least the first two of these functions should be inlined.

2021-03-18 01:16:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

On Wed, Mar 17, 2021 at 05:13:16PM -0700, Andrew Morton wrote:
> On Wed, 10 Mar 2021 08:14:28 -0800 Minchan Kim <[email protected]> wrote:
>
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> >
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> >
> > To close the race, this patch disables lru caches(i.e, pagevec)
> > during ongoing migration until migrate is done.
> >
> > Since it's really hard to reproduce, I measured how many times
> > migrate_pages retried with force mode(it is about a fallback to a
> > sync migration) with below debug code.
> >
> > int migrate_pages(struct list_head *from, new_page_t get_new_page,
> > ..
> > ..
> >
> > if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> > printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> > dump_page(page, "fail to migrate");
> > }
> >
> > The test was repeating android apps launching with cma allocation
> > in background every five seconds. Total cma allocation count was
> > about 500 during the testing. With this patch, the dump_page count
> > was reduced from 400 to 30.
> >
> > The new interface is also useful for memory hotplug which currently
> > drains lru pcp caches after each migration failure. This is rather
> > suboptimal as it has to disrupt others running during the operation.
> > With the new interface the operation happens only once. This is also in
> > line with pcp allocator cache which are disabled for the offlining as
> > well.
> >
>
> This is really a rather ugly thing, particularly from a maintainability
> point of view. Are you sure you found all the sites which need the

If you meant maintainability concern as "need pair but might miss",
we have lots of examples on such API(zone_pcp_disable, inc_tlb_flush,
kmap_atomic and so on) so I don't think you meant it.

If you meant how user could decide whether they should use
lru_add_drain_all or lru_cache_disable/enable pair, we had already
carried the concept by migrate_prep. IOW, if someone want to increase
migration success ratio at the cost of drainning overhead,
they could use the lru_cache_disable instead of lru_add_drain_all.

Personally, I prefered migrate_prep/finish since it could include
other stuffs(e.g., zone_pcp_disable) as well as lru_cache_disable
but reviewerd didn't like to wrap it.

I realized by your comment. During the trasition from v2 to v3,
I missed a site which was most important site for me. :(

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f1f0ee08628f..39775c8f8c90 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8470,7 +8470,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
};

- lru_add_drain_all();
+ lru_cache_disable();

while (pfn < end || !list_empty(&cc->migratepages)) {
if (fatal_signal_pending(current)) {
@@ -8498,6 +8498,9 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
ret = migrate_pages(&cc->migratepages, alloc_migration_target,
NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
}
+
+ lru_cache_enable();
+
if (ret < 0) {
putback_movable_pages(&cc->migratepages);
return ret;

However, it was just my mistake during patch stacking and didn't
comes from semantic PoV.

Do you see still any concern? Otherwise, I will submit the fix, again.

> enable/disable? How do we prevent new ones from creeping in which need

> the same treatment? Is there some way of adding a runtime check which
> will trip if a conversion was missed?

Are you concerning losing the pair? or places we should use
lru_cache_disable, not lru_cache_drain_all?
As I mentioned, I just replaced all of migrate_prep places with
lru_cache_disable except the mistake above.

>
> > ...
> >
> > +bool lru_cache_disabled(void)
> > +{
> > + return atomic_read(&lru_disable_count);
> > +}
> > +
> > +void lru_cache_enable(void)
> > +{
> > + atomic_dec(&lru_disable_count);
> > +}
> > +
> > +/*
> > + * lru_cache_disable() needs to be called before we start compiling
> > + * a list of pages to be migrated using isolate_lru_page().
> > + * It drains pages on LRU cache and then disable on all cpus until
> > + * lru_cache_enable is called.
> > + *
> > + * Must be paired with a call to lru_cache_enable().
> > + */
> > +void lru_cache_disable(void)
> > +{
> > + atomic_inc(&lru_disable_count);
> > +#ifdef CONFIG_SMP
> > + /*
> > + * lru_add_drain_all in the force mode will schedule draining on
> > + * all online CPUs so any calls of lru_cache_disabled wrapped by
> > + * local_lock or preemption disabled would be ordered by that.
> > + * The atomic operation doesn't need to have stronger ordering
> > + * requirements because that is enforeced by the scheduling
> > + * guarantees.
> > + */
> > + __lru_add_drain_all(true);
> > +#else
> > + lru_add_drain();
> > +#endif
> > +}
>
> I guess at least the first two of these functions should be inlined.

Sure. Let me respin with fixing missing piece above once we get some
direction.

2021-03-18 08:11:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

On Wed 17-03-21 17:13:16, Andrew Morton wrote:
> On Wed, 10 Mar 2021 08:14:28 -0800 Minchan Kim <[email protected]> wrote:
>
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> >
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> >
> > To close the race, this patch disables lru caches(i.e, pagevec)
> > during ongoing migration until migrate is done.
> >
> > Since it's really hard to reproduce, I measured how many times
> > migrate_pages retried with force mode(it is about a fallback to a
> > sync migration) with below debug code.
> >
> > int migrate_pages(struct list_head *from, new_page_t get_new_page,
> > ..
> > ..
> >
> > if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> > printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> > dump_page(page, "fail to migrate");
> > }
> >
> > The test was repeating android apps launching with cma allocation
> > in background every five seconds. Total cma allocation count was
> > about 500 during the testing. With this patch, the dump_page count
> > was reduced from 400 to 30.
> >
> > The new interface is also useful for memory hotplug which currently
> > drains lru pcp caches after each migration failure. This is rather
> > suboptimal as it has to disrupt others running during the operation.
> > With the new interface the operation happens only once. This is also in
> > line with pcp allocator cache which are disabled for the offlining as
> > well.
> >
>
> This is really a rather ugly thing, particularly from a maintainability
> point of view. Are you sure you found all the sites which need the
> enable/disable? How do we prevent new ones from creeping in which need
> the same treatment? Is there some way of adding a runtime check which
> will trip if a conversion was missed?

I am not sure I am following. What is your concern here? This is a
lock-like interface to disable a certain optimization because it stands
in the way. Not using the interface is not a correctness problem.

If you refer to disable/enable interface and one potentially missing
enable for some reason then again this will not become a correctness
problem. It will result in a suboptimal behavior. So in the end this is
much less of a probel than leaving a lock behind.

The functionality is not exported to modules and I would agree that this
is not something for out of core/MM code to be used. We can hide it into
an internal mm header of you want?
--
Michal Hocko
SUSE Labs

2021-03-19 14:11:06

by kernel test robot

[permalink] [raw]
Subject: Re: [mm] 8fd8d23ab1: WARNING:at_fs/buffer.c:#__brelse

Hi Minchan,

On Wed, Mar 17, 2021 at 09:29:38AM -0700, Minchan Kim wrote:
> On Wed, Mar 17, 2021 at 10:37:57AM +0800, kernel test robot wrote:
> >
> >
> > Greeting,
> >
> > FYI, we noticed the following commit (built with gcc-9):
> >
> > commit: 8fd8d23ab10cc2fceeac25ea7b0e2eaf98e78d64 ("[PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration")
> > url: https://github.com/0day-ci/linux/commits/Minchan-Kim/mm-replace-migrate_prep-with-lru_add_drain_all/20210311-001714
> > base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 144c79ef33536b4ecb4951e07dbc1f2b7fa99d32
> >
> > in testcase: blktests
> > version: blktests-x86_64-a210761-1_20210124
> > with following parameters:
> >
> > test: nbd-group-01
> > ucode: 0xe2
> >
> >
> >
> > on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory
> >
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> >
> >
> >
> > If you fix the issue, kindly add following tag
> > Reported-by: kernel test robot <[email protected]>
> >
> >
> > [ 40.465061] WARNING: CPU: 2 PID: 5207 at fs/buffer.c:1177 __brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
> > [ 40.465066] Modules linked in: nbd loop xfs libcrc32c dm_multipath dm_mod ipmi_devintf ipmi_msghandler sd_mod t10_pi sg intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel rapl i915 mei_wdt intel_cstate wmi_bmof intel_gtt drm_kms_helper syscopyarea ahci intel_uncore sysfillrect sysimgblt libahci fb_sys_fops drm libata mei_me mei intel_pch_thermal wmi video intel_pmc_core acpi_pad ip_tables
> > [ 40.465086] CPU: 2 PID: 5207 Comm: mount_clear_soc Tainted: G I 5.12.0-rc2-00062-g8fd8d23ab10c #1
> > [ 40.465088] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
> > [ 40.465089] RIP: 0010:__brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
> > [ 40.465091] Code: 00 00 00 00 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 47 60 85 c0 74 05 f0 ff 4f 60 c3 48 c7 c7 d8 99 57 82 e8 02 5d 80 00 <0f> 0b c3 0f 1f 44 00 00 55 65 ff 05 13 79 c8 7e 53 48 c7 c3 c0 89
>
> Hi,
>
> Unfortunately, I couldn't set the lkp test in my local mahcine
> since installation failed(I guess my linux distribution is
> very minor)
>
> Do you mind testing this patch? (Please replace the original
> patch with this one)

by replacing the original patch with below one, we confirmed the issue fixed. Thanks

>
> From 23cfe5a8e939e2c077223e009887af8a0b5d6381 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Tue, 2 Mar 2021 12:05:08 -0800
> Subject: [PATCH] mm: fs: Invalidate BH LRU during page migration
>
> Pages containing buffer_heads that are in one of the per-CPU
> buffer_head LRU caches will be pinned and thus cannot be migrated.
> This can prevent CMA allocations from succeeding, which are often used
> on platforms with co-processors (such as a DSP) that can only use
> physically contiguous memory. It can also prevent memory
> hot-unplugging from succeeding, which involves migrating at least
> MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
> GiB based on the architecture in use.
>
> Correspondingly, invalidate the BH LRU caches before a migration
> starts and stop any buffer_head from being cached in the LRU caches,
> until migration has finished.
>
> Signed-off-by: Chris Goldsworthy <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> fs/buffer.c | 36 ++++++++++++++++++++++++++++++------
> include/linux/buffer_head.h | 4 ++++
> mm/swap.c | 5 ++++-
> 3 files changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 0cb7ffd4977c..e9872d0dcbf1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1264,6 +1264,15 @@ static void bh_lru_install(struct buffer_head *bh)
> int i;
>
> check_irqs_on();
> + /*
> + * the refcount of buffer_head in bh_lru prevents dropping the
> + * attached page(i.e., try_to_free_buffers) so it could cause
> + * failing page migration.
> + * Skip putting upcoming bh into bh_lru until migration is done.
> + */
> + if (lru_cache_disabled())
> + return;
> +
> bh_lru_lock();
>
> b = this_cpu_ptr(&bh_lrus);
> @@ -1404,6 +1413,15 @@ __bread_gfp(struct block_device *bdev, sector_t block,
> }
> EXPORT_SYMBOL(__bread_gfp);
>
> +static void __invalidate_bh_lrus(struct bh_lru *b)
> +{
> + int i;
> +
> + for (i = 0; i < BH_LRU_SIZE; i++) {
> + brelse(b->bhs[i]);
> + b->bhs[i] = NULL;
> + }
> +}
> /*
> * invalidate_bh_lrus() is called rarely - but not only at unmount.
> * This doesn't race because it runs in each cpu either in irq
> @@ -1412,16 +1430,12 @@ EXPORT_SYMBOL(__bread_gfp);
> static void invalidate_bh_lru(void *arg)
> {
> struct bh_lru *b = &get_cpu_var(bh_lrus);
> - int i;
>
> - for (i = 0; i < BH_LRU_SIZE; i++) {
> - brelse(b->bhs[i]);
> - b->bhs[i] = NULL;
> - }
> + __invalidate_bh_lrus(b);
> put_cpu_var(bh_lrus);
> }
>
> -static bool has_bh_in_lru(int cpu, void *dummy)
> +bool has_bh_in_lru(int cpu, void *dummy)
> {
> struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
> int i;
> @@ -1440,6 +1454,16 @@ void invalidate_bh_lrus(void)
> }
> EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
>
> +void invalidate_bh_lrus_cpu(int cpu)
> +{
> + struct bh_lru *b;
> +
> + bh_lru_lock();
> + b = per_cpu_ptr(&bh_lrus, cpu);
> + __invalidate_bh_lrus(b);
> + bh_lru_unlock();
> +}
> +
> void set_bh_page(struct buffer_head *bh,
> struct page *page, unsigned long offset)
> {
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 6b47f94378c5..e7e99da31349 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -194,6 +194,8 @@ void __breadahead_gfp(struct block_device *, sector_t block, unsigned int size,
> struct buffer_head *__bread_gfp(struct block_device *,
> sector_t block, unsigned size, gfp_t gfp);
> void invalidate_bh_lrus(void);
> +void invalidate_bh_lrus_cpu(int cpu);
> +bool has_bh_in_lru(int cpu, void *dummy);
> struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
> void free_buffer_head(struct buffer_head * bh);
> void unlock_buffer(struct buffer_head *bh);
> @@ -406,6 +408,8 @@ static inline int inode_has_buffers(struct inode *inode) { return 0; }
> static inline void invalidate_inode_buffers(struct inode *inode) {}
> static inline int remove_inode_buffers(struct inode *inode) { return 1; }
> static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
> +static inline void invalidate_bh_lrus_cpu(int cpu) {}
> +static inline bool has_bh_in_lru(int cpu, void *dummy) { return 0; }
> #define buffer_heads_over_limit 0
>
> #endif /* CONFIG_BLOCK */
> diff --git a/mm/swap.c b/mm/swap.c
> index fbdf6ac05aec..b962fe45bc02 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -36,6 +36,7 @@
> #include <linux/hugetlb.h>
> #include <linux/page_idle.h>
> #include <linux/local_lock.h>
> +#include <linux/buffer_head.h>
>
> #include "internal.h"
>
> @@ -641,6 +642,7 @@ void lru_add_drain_cpu(int cpu)
> pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
>
> activate_page_drain(cpu);
> + invalidate_bh_lrus_cpu(cpu);
> }
>
> /**
> @@ -828,7 +830,8 @@ static void __lru_add_drain_all(bool force_all_cpus)
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
> - need_activate_page_drain(cpu)) {
> + need_activate_page_drain(cpu) ||
> + has_bh_in_lru(cpu, NULL)) {
> INIT_WORK(work, lru_add_drain_per_cpu);
> queue_work_on(cpu, mm_percpu_wq, work);
> __cpumask_set_cpu(cpu, &has_work);
> --
> 2.31.0.rc2.261.g7f71774620-goog
>
>

2021-03-19 16:48:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [mm] 8fd8d23ab1: WARNING:at_fs/buffer.c:#__brelse

On Fri, Mar 19, 2021 at 10:05:28PM +0800, Oliver Sang wrote:
> Hi Minchan,
>
> On Wed, Mar 17, 2021 at 09:29:38AM -0700, Minchan Kim wrote:
> > On Wed, Mar 17, 2021 at 10:37:57AM +0800, kernel test robot wrote:
> > >
> > >
> > > Greeting,
> > >
> > > FYI, we noticed the following commit (built with gcc-9):
> > >
> > > commit: 8fd8d23ab10cc2fceeac25ea7b0e2eaf98e78d64 ("[PATCH v3 3/3] mm: fs: Invalidate BH LRU during page migration")
> > > url: https://github.com/0day-ci/linux/commits/Minchan-Kim/mm-replace-migrate_prep-with-lru_add_drain_all/20210311-001714
> > > base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 144c79ef33536b4ecb4951e07dbc1f2b7fa99d32
> > >
> > > in testcase: blktests
> > > version: blktests-x86_64-a210761-1_20210124
> > > with following parameters:
> > >
> > > test: nbd-group-01
> > > ucode: 0xe2
> > >
> > >
> > >
> > > on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory
> > >
> > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > >
> > >
> > >
> > > If you fix the issue, kindly add following tag
> > > Reported-by: kernel test robot <[email protected]>
> > >
> > >
> > > [ 40.465061] WARNING: CPU: 2 PID: 5207 at fs/buffer.c:1177 __brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
> > > [ 40.465066] Modules linked in: nbd loop xfs libcrc32c dm_multipath dm_mod ipmi_devintf ipmi_msghandler sd_mod t10_pi sg intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel rapl i915 mei_wdt intel_cstate wmi_bmof intel_gtt drm_kms_helper syscopyarea ahci intel_uncore sysfillrect sysimgblt libahci fb_sys_fops drm libata mei_me mei intel_pch_thermal wmi video intel_pmc_core acpi_pad ip_tables
> > > [ 40.465086] CPU: 2 PID: 5207 Comm: mount_clear_soc Tainted: G I 5.12.0-rc2-00062-g8fd8d23ab10c #1
> > > [ 40.465088] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
> > > [ 40.465089] RIP: 0010:__brelse (kbuild/src/consumer/fs/buffer.c:1177 kbuild/src/consumer/fs/buffer.c:1171)
> > > [ 40.465091] Code: 00 00 00 00 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 47 60 85 c0 74 05 f0 ff 4f 60 c3 48 c7 c7 d8 99 57 82 e8 02 5d 80 00 <0f> 0b c3 0f 1f 44 00 00 55 65 ff 05 13 79 c8 7e 53 48 c7 c3 c0 89
> >
> > Hi,
> >
> > Unfortunately, I couldn't set the lkp test in my local mahcine
> > since installation failed(I guess my linux distribution is
> > very minor)
> >
> > Do you mind testing this patch? (Please replace the original
> > patch with this one)
>
> by replacing the original patch with below one, we confirmed the issue fixed. Thanks

Hi Oliver,

Let me resend formal patch with following

Reported-by: kernel test robot <[email protected]>
Tested-by: Oliver Sang <[email protected]>

Thanks for the testing!