LinuxLists.cc - [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

2023-02-13 12:35:20

[permalink] [raw]

Subject: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

From: "Huang, Ying" <[email protected]>

Now, migrate_pages() migrate folios one by one, like the fake code as
follows,

for each folio
unmap
flush TLB
copy
restore map

If multiple folios are passed to migrate_pages(), there are
opportunities to batch the TLB flushing and copying. That is, we can
change the code to something as follows,

for each folio
unmap
for each folio
flush TLB
for each folio
copy
for each folio
restore map

The total number of TLB flushing IPI can be reduced considerably. And
we may use some hardware accelerator such as DSA to accelerate the
folio copying.

So in this patch, we refactor the migrate_pages() implementation and
implement the TLB flushing batching. Base on this, hardware
accelerated folio copying can be implemented.

If too many folios are passed to migrate_pages(), in the naive batched
implementation, we may unmap too many folios at the same time. The
possibility for a task to wait for the migrated folios to be mapped
again increases. So the latency may be hurt. To deal with this
issue, the max number of folios be unmapped in batch is restricted to
no more than HPAGE_PMD_NR in the unit of page. That is, the influence
is at the same level of THP migration.

We use the following test to measure the performance impact of the
patchset,

On a 2-socket Intel server,

- Run pmbench memory accessing benchmark

- Run `migratepages` to migrate pages of pmbench between node 0 and
node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.

Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
2 NUMA nodes. Test results show that the page migration performance
increases up to 78%.

This patchset is based on mm-unstable 2023-02-10.

Changes:

v5:

- Added Xin's test results on ARM 64. Thanks Xin!

- Fixed many comments. Thanks Zi and Xin!

- Collected reviewed-by and tested-by.

v4:

- Fixed another bug about non-LRU folio migration. Thanks Hyeonggon!

v3:

- Rebased on v6.2-rc4

- Fixed a bug about non-LRU folio migration. Thanks Mike!

- Fixed some comments. Thanks Baolin!

- Collected reviewed-by.

v2:

- Rebased on v6.2-rc3

- Fixed type force cast warning. Thanks Kees!

- Added more comments and cleaned up the code. Thanks Andrew, Zi, Alistair, Dan!

- Collected reviewed-by.

from rfc to v1:

- Rebased on v6.2-rc1

- Fix the deadlock issue caused by locking multiple pages synchronously
per Alistair's comments. Thanks!

- Fix the autonumabench panic per Rao's comments and fix. Thanks!

- Other minor fixes per comments. Thanks!

Best Regards,
Huang, Ying

2023-02-13 12:35:26

[permalink] [raw]

Subject: [PATCH -v5 1/9] migrate_pages: organize stats with struct migrate_pages_stats

Define struct migrate_pages_stats to organize the various statistics
in migrate_pages(). This makes it easier to collect and consume the
statistics in multiple functions. This will be needed in the
following patches in the series.

Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Reviewed-by: Xin Hao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
mm/migrate.c | 60 +++++++++++++++++++++++++++++-----------------------
1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5b40b9040ba6..1a9cfcf857d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1414,6 +1414,16 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
return rc;
}

+struct migrate_pages_stats {
+ int nr_succeeded; /* Normal and large folios migrated successfully, in
+ units of base pages */
+ int nr_failed_pages; /* Normal and large folios failed to be migrated, in
+ units of base pages. Untried folios aren't counted */
+ int nr_thp_succeeded; /* THP migrated successfully */
+ int nr_thp_failed; /* THP failed to be migrated */
+ int nr_thp_split; /* THP split before migrating */
+};
+
/*
* migrate_pages - migrate the folios specified in a list, to the free folios
* supplied as the target for the page migration
@@ -1448,13 +1458,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
int large_retry = 1;
int thp_retry = 1;
int nr_failed = 0;
- int nr_failed_pages = 0;
int nr_retry_pages = 0;
- int nr_succeeded = 0;
- int nr_thp_succeeded = 0;
int nr_large_failed = 0;
- int nr_thp_failed = 0;
- int nr_thp_split = 0;
int pass = 0;
bool is_large = false;
bool is_thp = false;
@@ -1464,9 +1469,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
LIST_HEAD(split_folios);
bool nosplit = (reason == MR_NUMA_MISPLACED);
bool no_split_folio_counting = false;
+ struct migrate_pages_stats stats;

trace_mm_migrate_pages_start(mode, reason);

+ memset(&stats, 0, sizeof(stats));
split_folio_migration:
for (pass = 0; pass < 10 && (retry || large_retry); pass++) {
retry = 0;
@@ -1520,9 +1527,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
/* Large folio migration is unsupported */
if (is_large) {
nr_large_failed++;
- nr_thp_failed += is_thp;
+ stats.nr_thp_failed += is_thp;
if (!try_split_folio(folio, &split_folios)) {
- nr_thp_split += is_thp;
+ stats.nr_thp_split += is_thp;
break;
}
/* Hugetlb migration is unsupported */
@@ -1530,7 +1537,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
nr_failed++;
}

- nr_failed_pages += nr_pages;
+ stats.nr_failed_pages += nr_pages;
list_move_tail(&folio->lru, &ret_folios);
break;
case -ENOMEM:
@@ -1540,13 +1547,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
*/
if (is_large) {
nr_large_failed++;
- nr_thp_failed += is_thp;
+ stats.nr_thp_failed += is_thp;
/* Large folio NUMA faulting doesn't split to retry. */
if (!nosplit) {
int ret = try_split_folio(folio, &split_folios);

if (!ret) {
- nr_thp_split += is_thp;
+ stats.nr_thp_split += is_thp;
break;
} else if (reason == MR_LONGTERM_PIN &&
ret == -EAGAIN) {
@@ -1564,7 +1571,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
nr_failed++;
}

- nr_failed_pages += nr_pages + nr_retry_pages;
+ stats.nr_failed_pages += nr_pages + nr_retry_pages;
/*
* There might be some split folios of fail-to-migrate large
* folios left in split_folios list. Move them back to migration
@@ -1574,7 +1581,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
list_splice_init(&split_folios, from);
/* nr_failed isn't updated for not used */
nr_large_failed += large_retry;
- nr_thp_failed += thp_retry;
+ stats.nr_thp_failed += thp_retry;
goto out;
case -EAGAIN:
if (is_large) {
@@ -1586,8 +1593,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
nr_retry_pages += nr_pages;
break;
case MIGRATEPAGE_SUCCESS:
- nr_succeeded += nr_pages;
- nr_thp_succeeded += is_thp;
+ stats.nr_succeeded += nr_pages;
+ stats.nr_thp_succeeded += is_thp;
break;
default:
/*
@@ -1598,20 +1605,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
*/
if (is_large) {
nr_large_failed++;
- nr_thp_failed += is_thp;
+ stats.nr_thp_failed += is_thp;
} else if (!no_split_folio_counting) {
nr_failed++;
}

- nr_failed_pages += nr_pages;
+ stats.nr_failed_pages += nr_pages;
break;
}
}
}
nr_failed += retry;
nr_large_failed += large_retry;
- nr_thp_failed += thp_retry;
- nr_failed_pages += nr_retry_pages;
+ stats.nr_thp_failed += thp_retry;
+ stats.nr_failed_pages += nr_retry_pages;
/*
* Try to migrate split folios of fail-to-migrate large folios, no
* nr_failed counting in this round, since all split folios of a
@@ -1644,16 +1651,17 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
if (list_empty(from))
rc = 0;

- count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
- count_vm_events(PGMIGRATE_FAIL, nr_failed_pages);
- count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
- count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
- count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
- trace_mm_migrate_pages(nr_succeeded, nr_failed_pages, nr_thp_succeeded,
- nr_thp_failed, nr_thp_split, mode, reason);
+ count_vm_events(PGMIGRATE_SUCCESS, stats.nr_succeeded);
+ count_vm_events(PGMIGRATE_FAIL, stats.nr_failed_pages);
+ count_vm_events(THP_MIGRATION_SUCCESS, stats.nr_thp_succeeded);
+ count_vm_events(THP_MIGRATION_FAIL, stats.nr_thp_failed);
+ count_vm_events(THP_MIGRATION_SPLIT, stats.nr_thp_split);
+ trace_mm_migrate_pages(stats.nr_succeeded, stats.nr_failed_pages,
+ stats.nr_thp_succeeded, stats.nr_thp_failed,
+ stats.nr_thp_split, mode, reason);

if (ret_succeeded)
- *ret_succeeded = nr_succeeded;
+ *ret_succeeded = stats.nr_succeeded;

return rc;
}
--
2.35.1

2023-02-13 12:35:31

[permalink] [raw]

Subject: [PATCH -v5 2/9] migrate_pages: separate hugetlb folios migration

This is a preparation patch to batch the folio unmapping and moving
for the non-hugetlb folios. Based on that we can batch the TLB
shootdown during the folio migration and make it possible to use some
hardware accelerator for the folio copying.

In this patch the hugetlb folios and non-hugetlb folios migration is
separated in migrate_pages() to make it easy to change the non-hugetlb
folios migration implementation.

Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Reviewed-by: Xin Hao <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
mm/migrate.c | 141 +++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 119 insertions(+), 22 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 1a9cfcf857d2..586a32bdaa71 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1414,6 +1414,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
return rc;
}

+#define NR_MAX_MIGRATE_PAGES_RETRY 10
+
struct migrate_pages_stats {
int nr_succeeded; /* Normal and large folios migrated successfully, in
units of base pages */
@@ -1424,6 +1426,95 @@ struct migrate_pages_stats {
int nr_thp_split; /* THP split before migrating */
};

+/*
+ * Returns the number of hugetlb folios that were not migrated, or an error code
+ * after NR_MAX_MIGRATE_PAGES_RETRY attempts or if no hugetlb folios are movable
+ * any more because the list has become empty or no retryable hugetlb folios
+ * exist any more. It is caller's responsibility to call putback_movable_pages()
+ * only if ret != 0.
+ */
+static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
+ free_page_t put_new_page, unsigned long private,
+ enum migrate_mode mode, int reason,
+ struct migrate_pages_stats *stats,
+ struct list_head *ret_folios)
+{
+ int retry = 1;
+ int nr_failed = 0;
+ int nr_retry_pages = 0;
+ int pass = 0;
+ struct folio *folio, *folio2;
+ int rc, nr_pages;
+
+ for (pass = 0; pass < NR_MAX_MIGRATE_PAGES_RETRY && retry; pass++) {
+ retry = 0;
+ nr_retry_pages = 0;
+
+ list_for_each_entry_safe(folio, folio2, from, lru) {
+ if (!folio_test_hugetlb(folio))
+ continue;
+
+ nr_pages = folio_nr_pages(folio);
+
+ cond_resched();
+
+ rc = unmap_and_move_huge_page(get_new_page,
+ put_new_page, private,
+ &folio->page, pass > 2, mode,
+ reason, ret_folios);
+ /*
+ * The rules are:
+ * Success: hugetlb folio will be put back
+ * -EAGAIN: stay on the from list
+ * -ENOMEM: stay on the from list
+ * -ENOSYS: stay on the from list
+ * Other errno: put on ret_folios list
+ */
+ switch(rc) {
+ case -ENOSYS:
+ /* Hugetlb migration is unsupported */
+ nr_failed++;
+ stats->nr_failed_pages += nr_pages;
+ list_move_tail(&folio->lru, ret_folios);
+ break;
+ case -ENOMEM:
+ /*
+ * When memory is low, don't bother to try to migrate
+ * other folios, just exit.
+ */
+ stats->nr_failed_pages += nr_pages + nr_retry_pages;
+ return -ENOMEM;
+ case -EAGAIN:
+ retry++;
+ nr_retry_pages += nr_pages;
+ break;
+ case MIGRATEPAGE_SUCCESS:
+ stats->nr_succeeded += nr_pages;
+ break;
+ default:
+ /*
+ * Permanent failure (-EBUSY, etc.):
+ * unlike -EAGAIN case, the failed folio is
+ * removed from migration folio list and not
+ * retried in the next outer loop.
+ */
+ nr_failed++;
+ stats->nr_failed_pages += nr_pages;
+ break;
+ }
+ }
+ }
+ /*
+ * nr_failed is number of hugetlb folios failed to be migrated. After
+ * NR_MAX_MIGRATE_PAGES_RETRY attempts, give up and count retried hugetlb
+ * folios as failed.
+ */
+ nr_failed += retry;
+ stats->nr_failed_pages += nr_retry_pages;
+
+ return nr_failed;
+}
+
/*
* migrate_pages - migrate the folios specified in a list, to the free folios
* supplied as the target for the page migration
@@ -1440,10 +1531,10 @@ struct migrate_pages_stats {
* @ret_succeeded: Set to the number of folios migrated successfully if
* the caller passes a non-NULL pointer.
*
- * The function returns after 10 attempts or if no folios are movable any more
- * because the list has become empty or no retryable folios exist any more.
- * It is caller's responsibility to call putback_movable_pages() to return folios
- * to the LRU or free list only if ret != 0.
+ * The function returns after NR_MAX_MIGRATE_PAGES_RETRY attempts or if no folios
+ * are movable any more because the list has become empty or no retryable folios
+ * exist any more. It is caller's responsibility to call putback_movable_pages()
+ * only if ret != 0.
*
* Returns the number of {normal folio, large folio, hugetlb} that were not
* migrated, or an error code. The number of large folio splits will be
@@ -1457,7 +1548,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
int retry = 1;
int large_retry = 1;
int thp_retry = 1;
- int nr_failed = 0;
+ int nr_failed;
int nr_retry_pages = 0;
int nr_large_failed = 0;
int pass = 0;
@@ -1474,38 +1565,45 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
trace_mm_migrate_pages_start(mode, reason);

memset(&stats, 0, sizeof(stats));
+ rc = migrate_hugetlbs(from, get_new_page, put_new_page, private, mode, reason,
+ &stats, &ret_folios);
+ if (rc < 0)
+ goto out;
+ nr_failed = rc;
+
split_folio_migration:
- for (pass = 0; pass < 10 && (retry || large_retry); pass++) {
+ for (pass = 0;
+ pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
+ pass++) {
retry = 0;
large_retry = 0;
thp_retry = 0;
nr_retry_pages = 0;

list_for_each_entry_safe(folio, folio2, from, lru) {
+ /* Retried hugetlb folios will be kept in list */
+ if (folio_test_hugetlb(folio)) {
+ list_move_tail(&folio->lru, &ret_folios);
+ continue;
+ }
+
/*
* Large folio statistics is based on the source large
* folio. Capture required information that might get
* lost during migration.
*/
- is_large = folio_test_large(folio) && !folio_test_hugetlb(folio);
+ is_large = folio_test_large(folio);
is_thp = is_large && folio_test_pmd_mappable(folio);
nr_pages = folio_nr_pages(folio);
+
cond_resched();

- if (folio_test_hugetlb(folio))
- rc = unmap_and_move_huge_page(get_new_page,
- put_new_page, private,
- &folio->page, pass > 2, mode,
- reason,
- &ret_folios);
- else
- rc = unmap_and_move(get_new_page, put_new_page,
- private, folio, pass > 2, mode,
- reason, &ret_folios);
+ rc = unmap_and_move(get_new_page, put_new_page,
+ private, folio, pass > 2, mode,
+ reason, &ret_folios);
/*
* The rules are:
- * Success: non hugetlb folio will be freed, hugetlb
- * folio will be put back
+ * Success: folio will be freed
* -EAGAIN: stay on the from list
* -ENOMEM: stay on the from list
* -ENOSYS: stay on the from list
@@ -1532,7 +1630,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
stats.nr_thp_split += is_thp;
break;
}
- /* Hugetlb migration is unsupported */
} else if (!no_split_folio_counting) {
nr_failed++;
}
@@ -1626,8 +1723,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
*/
if (!list_empty(&split_folios)) {
/*
- * Move non-migrated folios (after 10 retries) to ret_folios
- * to avoid migrating them again.
+ * Move non-migrated folios (after NR_MAX_MIGRATE_PAGES_RETRY
+ * retries) to ret_folios to avoid migrating them again.
*/
list_splice_init(from, &ret_folios);
list_splice_init(&split_folios, from);
--
2.35.1

2023-02-13 12:35:39

[permalink] [raw]

Subject: [PATCH -v5 3/9] migrate_pages: restrict number of pages to migrate in batch

This is a preparation patch to batch the folio unmapping and moving
for non-hugetlb folios.

If we had batched the folio unmapping, all folios to be migrated would
be unmapped before copying the contents and flags of the folios. If
the folios that were passed to migrate_pages() were too many in unit
of pages, the execution of the processes would be stopped for too long
time, thus too long latency. For example, migrate_pages() syscall
will call migrate_pages() with all folios of a process. To avoid this
possible issue, in this patch, we restrict the number of pages to be
migrated to be no more than HPAGE_PMD_NR. That is, the influence is
at the same level of THP migration.

Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Xin Hao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
mm/migrate.c | 174 +++++++++++++++++++++++++++++++--------------------
1 file changed, 106 insertions(+), 68 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 586a32bdaa71..d436f35fa145 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1414,6 +1414,11 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
return rc;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define NR_MAX_BATCHED_MIGRATION HPAGE_PMD_NR
+#else
+#define NR_MAX_BATCHED_MIGRATION 512
+#endif
#define NR_MAX_MIGRATE_PAGES_RETRY 10

struct migrate_pages_stats {
@@ -1515,40 +1520,15 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
return nr_failed;
}

-/*
- * migrate_pages - migrate the folios specified in a list, to the free folios
- * supplied as the target for the page migration
- *
- * @from: The list of folios to be migrated.
- * @get_new_page: The function used to allocate free folios to be used
- * as the target of the folio migration.
- * @put_new_page: The function used to free target folios if migration
- * fails, or NULL if no special handling is necessary.
- * @private: Private data to be passed on to get_new_page()
- * @mode: The migration mode that specifies the constraints for
- * folio migration, if any.
- * @reason: The reason for folio migration.
- * @ret_succeeded: Set to the number of folios migrated successfully if
- * the caller passes a non-NULL pointer.
- *
- * The function returns after NR_MAX_MIGRATE_PAGES_RETRY attempts or if no folios
- * are movable any more because the list has become empty or no retryable folios
- * exist any more. It is caller's responsibility to call putback_movable_pages()
- * only if ret != 0.
- *
- * Returns the number of {normal folio, large folio, hugetlb} that were not
- * migrated, or an error code. The number of large folio splits will be
- * considered as the number of non-migrated large folio, no matter how many
- * split folios of the large folio are migrated successfully.
- */
-int migrate_pages(struct list_head *from, new_page_t get_new_page,
+static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
- enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
+ enum migrate_mode mode, int reason, struct list_head *ret_folios,
+ struct migrate_pages_stats *stats)
{
int retry = 1;
int large_retry = 1;
int thp_retry = 1;
- int nr_failed;
+ int nr_failed = 0;
int nr_retry_pages = 0;
int nr_large_failed = 0;
int pass = 0;
@@ -1556,20 +1536,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
bool is_thp = false;
struct folio *folio, *folio2;
int rc, nr_pages;
- LIST_HEAD(ret_folios);
LIST_HEAD(split_folios);
bool nosplit = (reason == MR_NUMA_MISPLACED);
bool no_split_folio_counting = false;
- struct migrate_pages_stats stats;
-
- trace_mm_migrate_pages_start(mode, reason);
-
- memset(&stats, 0, sizeof(stats));
- rc = migrate_hugetlbs(from, get_new_page, put_new_page, private, mode, reason,
- &stats, &ret_folios);
- if (rc < 0)
- goto out;
- nr_failed = rc;

split_folio_migration:
for (pass = 0;
@@ -1581,12 +1550,6 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
nr_retry_pages = 0;

list_for_each_entry_safe(folio, folio2, from, lru) {
- /* Retried hugetlb folios will be kept in list */
- if (folio_test_hugetlb(folio)) {
- list_move_tail(&folio->lru, &ret_folios);
- continue;
- }
-
/*
* Large folio statistics is based on the source large
* folio. Capture required information that might get
@@ -1600,15 +1563,14 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,

rc = unmap_and_move(get_new_page, put_new_page,
private, folio, pass > 2, mode,
- reason, &ret_folios);
+ reason, ret_folios);
/*
* The rules are:
* Success: folio will be freed
* -EAGAIN: stay on the from list
* -ENOMEM: stay on the from list
* -ENOSYS: stay on the from list
- * Other errno: put on ret_folios list then splice to
- * from list
+ * Other errno: put on ret_folios list
*/
switch(rc) {
/*
@@ -1625,17 +1587,17 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
/* Large folio migration is unsupported */
if (is_large) {
nr_large_failed++;
- stats.nr_thp_failed += is_thp;
+ stats->nr_thp_failed += is_thp;
if (!try_split_folio(folio, &split_folios)) {
- stats.nr_thp_split += is_thp;
+ stats->nr_thp_split += is_thp;
break;
}
} else if (!no_split_folio_counting) {
nr_failed++;
}

- stats.nr_failed_pages += nr_pages;
- list_move_tail(&folio->lru, &ret_folios);
+ stats->nr_failed_pages += nr_pages;
+ list_move_tail(&folio->lru, ret_folios);
break;
case -ENOMEM:
/*
@@ -1644,13 +1606,13 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
*/
if (is_large) {
nr_large_failed++;
- stats.nr_thp_failed += is_thp;
+ stats->nr_thp_failed += is_thp;
/* Large folio NUMA faulting doesn't split to retry. */
if (!nosplit) {
int ret = try_split_folio(folio, &split_folios);

if (!ret) {
- stats.nr_thp_split += is_thp;
+ stats->nr_thp_split += is_thp;
break;
} else if (reason == MR_LONGTERM_PIN &&
ret == -EAGAIN) {
@@ -1668,17 +1630,17 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
nr_failed++;
}

- stats.nr_failed_pages += nr_pages + nr_retry_pages;
+ stats->nr_failed_pages += nr_pages + nr_retry_pages;
/*
* There might be some split folios of fail-to-migrate large
- * folios left in split_folios list. Move them back to migration
+ * folios left in split_folios list. Move them to ret_folios
* list so that they could be put back to the right list by
* the caller otherwise the folio refcnt will be leaked.
*/
- list_splice_init(&split_folios, from);
+ list_splice_init(&split_folios, ret_folios);
/* nr_failed isn't updated for not used */
nr_large_failed += large_retry;
- stats.nr_thp_failed += thp_retry;
+ stats->nr_thp_failed += thp_retry;
goto out;
case -EAGAIN:
if (is_large) {
@@ -1690,8 +1652,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
nr_retry_pages += nr_pages;
break;
case MIGRATEPAGE_SUCCESS:
- stats.nr_succeeded += nr_pages;
- stats.nr_thp_succeeded += is_thp;
+ stats->nr_succeeded += nr_pages;
+ stats->nr_thp_succeeded += is_thp;
break;
default:
/*
@@ -1702,20 +1664,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
*/
if (is_large) {
nr_large_failed++;
- stats.nr_thp_failed += is_thp;
+ stats->nr_thp_failed += is_thp;
} else if (!no_split_folio_counting) {
nr_failed++;
}

- stats.nr_failed_pages += nr_pages;
+ stats->nr_failed_pages += nr_pages;
break;
}
}
}
nr_failed += retry;
nr_large_failed += large_retry;
- stats.nr_thp_failed += thp_retry;
- stats.nr_failed_pages += nr_retry_pages;
+ stats->nr_thp_failed += thp_retry;
+ stats->nr_failed_pages += nr_retry_pages;
/*
* Try to migrate split folios of fail-to-migrate large folios, no
* nr_failed counting in this round, since all split folios of a
@@ -1726,7 +1688,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
* Move non-migrated folios (after NR_MAX_MIGRATE_PAGES_RETRY
* retries) to ret_folios to avoid migrating them again.
*/
- list_splice_init(from, &ret_folios);
+ list_splice_init(from, ret_folios);
list_splice_init(&split_folios, from);
no_split_folio_counting = true;
retry = 1;
@@ -1734,6 +1696,82 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
}

rc = nr_failed + nr_large_failed;
+out:
+ return rc;
+}
+
+/*
+ * migrate_pages - migrate the folios specified in a list, to the free folios
+ * supplied as the target for the page migration
+ *
+ * @from: The list of folios to be migrated.
+ * @get_new_page: The function used to allocate free folios to be used
+ * as the target of the folio migration.
+ * @put_new_page: The function used to free target folios if migration
+ * fails, or NULL if no special handling is necessary.
+ * @private: Private data to be passed on to get_new_page()
+ * @mode: The migration mode that specifies the constraints for
+ * folio migration, if any.
+ * @reason: The reason for folio migration.
+ * @ret_succeeded: Set to the number of folios migrated successfully if
+ * the caller passes a non-NULL pointer.
+ *
+ * The function returns after NR_MAX_MIGRATE_PAGES_RETRY attempts or if no folios
+ * are movable any more because the list has become empty or no retryable folios
+ * exist any more. It is caller's responsibility to call putback_movable_pages()
+ * only if ret != 0.
+ *
+ * Returns the number of {normal folio, large folio, hugetlb} that were not
+ * migrated, or an error code. The number of large folio splits will be
+ * considered as the number of non-migrated large folio, no matter how many
+ * split folios of the large folio are migrated successfully.
+ */
+int migrate_pages(struct list_head *from, new_page_t get_new_page,
+ free_page_t put_new_page, unsigned long private,
+ enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
+{
+ int rc, rc_gather;
+ int nr_pages;
+ struct folio *folio, *folio2;
+ LIST_HEAD(folios);
+ LIST_HEAD(ret_folios);
+ struct migrate_pages_stats stats;
+
+ trace_mm_migrate_pages_start(mode, reason);
+
+ memset(&stats, 0, sizeof(stats));
+
+ rc_gather = migrate_hugetlbs(from, get_new_page, put_new_page, private,
+ mode, reason, &stats, &ret_folios);
+ if (rc_gather < 0)
+ goto out;
+again:
+ nr_pages = 0;
+ list_for_each_entry_safe(folio, folio2, from, lru) {
+ /* Retried hugetlb folios will be kept in list */
+ if (folio_test_hugetlb(folio)) {
+ list_move_tail(&folio->lru, &ret_folios);
+ continue;
+ }
+
+ nr_pages += folio_nr_pages(folio);
+ if (nr_pages > NR_MAX_BATCHED_MIGRATION)
+ break;
+ }
+ if (nr_pages > NR_MAX_BATCHED_MIGRATION)
+ list_cut_before(&folios, from, &folio->lru);
+ else
+ list_splice_init(from, &folios);
+ rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
+ mode, reason, &ret_folios, &stats);
+ list_splice_tail_init(&folios, &ret_folios);
+ if (rc < 0) {
+ rc_gather = rc;
+ goto out;
+ }
+ rc_gather += rc;
+ if (!list_empty(from))
+ goto again;
out:
/*
* Put the permanent failure folio back to migration list, they
@@ -1746,7 +1784,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
* are migrated successfully.
*/
if (list_empty(from))
- rc = 0;
+ rc_gather = 0;

count_vm_events(PGMIGRATE_SUCCESS, stats.nr_succeeded);
count_vm_events(PGMIGRATE_FAIL, stats.nr_failed_pages);
@@ -1760,7 +1798,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
if (ret_succeeded)
*ret_succeeded = stats.nr_succeeded;

- return rc;
+ return rc_gather;
}

struct page *alloc_migration_target(struct page *page, unsigned long private)
--
2.35.1

2023-02-13 12:35:41

[permalink] [raw]

Subject: [PATCH -v5 4/9] migrate_pages: split unmap_and_move() to _unmap() and _move()

This is a preparation patch to batch the folio unmapping and moving.

In this patch, unmap_and_move() is split to migrate_folio_unmap() and
migrate_folio_move(). So, we can batch _unmap() and _move() in
different loops later. To pass some information between unmap and
move, the original unused dst->mapping and dst->private are used.

Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Reviewed-by: Xin Hao <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
include/linux/migrate.h | 1 +
mm/migrate.c | 169 ++++++++++++++++++++++++++++++----------
2 files changed, 129 insertions(+), 41 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index bdff950a8bb4..c88b96b48be7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -18,6 +18,7 @@ struct migration_target_control;
* - zero on page migration success;
*/
#define MIGRATEPAGE_SUCCESS 0
+#define MIGRATEPAGE_UNMAP 1

/**
* struct movable_operations - Driver page migration
diff --git a/mm/migrate.c b/mm/migrate.c
index d436f35fa145..5fd18a7cce62 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1027,11 +1027,53 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
return rc;
}

-static int __unmap_and_move(struct folio *src, struct folio *dst,
+/*
+ * To record some information during migration, we use some unused
+ * fields (mapping and private) of struct folio of the newly allocated
+ * destination folio. This is safe because nobody is using them
+ * except us.
+ */
+static void __migrate_folio_record(struct folio *dst,
+ unsigned long page_was_mapped,
+ struct anon_vma *anon_vma)
+{
+ dst->mapping = (void *)anon_vma;
+ dst->private = (void *)page_was_mapped;
+}
+
+static void __migrate_folio_extract(struct folio *dst,
+ int *page_was_mappedp,
+ struct anon_vma **anon_vmap)
+{
+ *anon_vmap = (void *)dst->mapping;
+ *page_was_mappedp = (unsigned long)dst->private;
+ dst->mapping = NULL;
+ dst->private = NULL;
+}
+
+/* Cleanup src folio upon migration success */
+static void migrate_folio_done(struct folio *src,
+ enum migrate_reason reason)
+{
+ /*
+ * Compaction can migrate also non-LRU pages which are
+ * not accounted to NR_ISOLATED_*. They can be recognized
+ * as __PageMovable
+ */
+ if (likely(!__folio_test_movable(src)))
+ mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
+ folio_is_file_lru(src), -folio_nr_pages(src));
+
+ if (reason != MR_MEMORY_FAILURE)
+ /* We release the page in page_handle_poison. */
+ folio_put(src);
+}
+
+static int __migrate_folio_unmap(struct folio *src, struct folio *dst,
int force, enum migrate_mode mode)
{
int rc = -EAGAIN;
- bool page_was_mapped = false;
+ int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
bool is_lru = !__PageMovable(&src->page);

@@ -1107,8 +1149,8 @@ static int __unmap_and_move(struct folio *src, struct folio *dst,
goto out_unlock;

if (unlikely(!is_lru)) {
- rc = move_to_new_folio(dst, src, mode);
- goto out_unlock_both;
+ __migrate_folio_record(dst, page_was_mapped, anon_vma);
+ return MIGRATEPAGE_UNMAP;
}

/*
@@ -1133,11 +1175,42 @@ static int __unmap_and_move(struct folio *src, struct folio *dst,
VM_BUG_ON_FOLIO(folio_test_anon(src) &&
!folio_test_ksm(src) && !anon_vma, src);
try_to_migrate(src, 0);
- page_was_mapped = true;
+ page_was_mapped = 1;
}

- if (!folio_mapped(src))
- rc = move_to_new_folio(dst, src, mode);
+ if (!folio_mapped(src)) {
+ __migrate_folio_record(dst, page_was_mapped, anon_vma);
+ return MIGRATEPAGE_UNMAP;
+ }
+
+ if (page_was_mapped)
+ remove_migration_ptes(src, src, false);
+
+out_unlock_both:
+ folio_unlock(dst);
+out_unlock:
+ /* Drop an anon_vma reference if we took one */
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+ folio_unlock(src);
+out:
+
+ return rc;
+}
+
+static int __migrate_folio_move(struct folio *src, struct folio *dst,
+ enum migrate_mode mode)
+{
+ int rc;
+ int page_was_mapped = 0;
+ struct anon_vma *anon_vma = NULL;
+ bool is_lru = !__PageMovable(&src->page);
+
+ __migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
+
+ rc = move_to_new_folio(dst, src, mode);
+ if (unlikely(!is_lru))
+ goto out_unlock_both;

/*
* When successful, push dst to LRU immediately: so that if it
@@ -1160,12 +1233,10 @@ static int __unmap_and_move(struct folio *src, struct folio *dst,

out_unlock_both:
folio_unlock(dst);
-out_unlock:
/* Drop an anon_vma reference if we took one */
if (anon_vma)
put_anon_vma(anon_vma);
folio_unlock(src);
-out:
/*
* If migration is successful, decrease refcount of dst,
* which will not free the page because new page owner increased
@@ -1177,19 +1248,15 @@ static int __unmap_and_move(struct folio *src, struct folio *dst,
return rc;
}

-/*
- * Obtain the lock on folio, remove all ptes and migrate the folio
- * to the newly allocated folio in dst.
- */
-static int unmap_and_move(new_page_t get_new_page,
- free_page_t put_new_page,
- unsigned long private, struct folio *src,
- int force, enum migrate_mode mode,
- enum migrate_reason reason,
- struct list_head *ret)
+/* Obtain the lock on page, remove all ptes. */
+static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
+ unsigned long private, struct folio *src,
+ struct folio **dstp, int force,
+ enum migrate_mode mode, enum migrate_reason reason,
+ struct list_head *ret)
{
struct folio *dst;
- int rc = MIGRATEPAGE_SUCCESS;
+ int rc = MIGRATEPAGE_UNMAP;
struct page *newpage = NULL;

if (!thp_migration_supported() && folio_test_transhuge(src))
@@ -1200,20 +1267,49 @@ static int unmap_and_move(new_page_t get_new_page,
folio_clear_active(src);
folio_clear_unevictable(src);
/* free_pages_prepare() will clear PG_isolated. */
- goto out;
+ list_del(&src->lru);
+ migrate_folio_done(src, reason);
+ return MIGRATEPAGE_SUCCESS;
}

newpage = get_new_page(&src->page, private);
if (!newpage)
return -ENOMEM;
dst = page_folio(newpage);
+ *dstp = dst;

dst->private = NULL;
- rc = __unmap_and_move(src, dst, force, mode);
+ rc = __migrate_folio_unmap(src, dst, force, mode);
+ if (rc == MIGRATEPAGE_UNMAP)
+ return rc;
+
+ /*
+ * A folio that has not been unmapped will be restored to
+ * right list unless we want to retry.
+ */
+ if (rc != -EAGAIN)
+ list_move_tail(&src->lru, ret);
+
+ if (put_new_page)
+ put_new_page(&dst->page, private);
+ else
+ folio_put(dst);
+
+ return rc;
+}
+
+/* Migrate the folio to the newly allocated folio in dst. */
+static int migrate_folio_move(free_page_t put_new_page, unsigned long private,
+ struct folio *src, struct folio *dst,
+ enum migrate_mode mode, enum migrate_reason reason,
+ struct list_head *ret)
+{
+ int rc;
+
+ rc = __migrate_folio_move(src, dst, mode);
if (rc == MIGRATEPAGE_SUCCESS)
set_page_owner_migrate_reason(&dst->page, reason);

-out:
if (rc != -EAGAIN) {
/*
* A folio that has been migrated has all references
@@ -1229,20 +1325,7 @@ static int unmap_and_move(new_page_t get_new_page,
* we want to retry.
*/
if (rc == MIGRATEPAGE_SUCCESS) {
- /*
- * Compaction can migrate also non-LRU folios which are
- * not accounted to NR_ISOLATED_*. They can be recognized
- * as __folio_test_movable
- */
- if (likely(!__folio_test_movable(src)))
- mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
- folio_is_file_lru(src), -folio_nr_pages(src));
-
- if (reason != MR_MEMORY_FAILURE)
- /*
- * We release the folio in page_handle_poison.
- */
- folio_put(src);
+ migrate_folio_done(src, reason);
} else {
if (rc != -EAGAIN)
list_add_tail(&src->lru, ret);
@@ -1534,7 +1617,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
int pass = 0;
bool is_large = false;
bool is_thp = false;
- struct folio *folio, *folio2;
+ struct folio *folio, *folio2, *dst = NULL;
int rc, nr_pages;
LIST_HEAD(split_folios);
bool nosplit = (reason == MR_NUMA_MISPLACED);
@@ -1561,9 +1644,13 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,

cond_resched();

- rc = unmap_and_move(get_new_page, put_new_page,
- private, folio, pass > 2, mode,
- reason, ret_folios);
+ rc = migrate_folio_unmap(get_new_page, put_new_page, private,
+ folio, &dst, pass > 2, mode,
+ reason, ret_folios);
+ if (rc == MIGRATEPAGE_UNMAP)
+ rc = migrate_folio_move(put_new_page, private,
+ folio, dst, mode,
+ reason, ret_folios);
/*
* The rules are:
* Success: folio will be freed
--
2.35.1

2023-02-13 12:35:59

[permalink] [raw]

Subject: [PATCH -v5 5/9] migrate_pages: batch _unmap and _move

In this patch the _unmap and _move stage of the folio migration is
batched. That for, previously, it is,

for each folio
_unmap()
_move()

Now, it is,

for each folio
_unmap()
for each folio
_move()

Based on this, we can batch the TLB flushing and use some hardware
accelerator to copy folios between batched _unmap and batched _move
stages.

Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Hyeonggon Yoo <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Xin Hao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
mm/migrate.c | 214 +++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 189 insertions(+), 25 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5fd18a7cce62..ee3e21f1061c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1051,6 +1051,33 @@ static void __migrate_folio_extract(struct folio *dst,
dst->private = NULL;
}

+/* Restore the source folio to the original state upon failure */
+static void migrate_folio_undo_src(struct folio *src,
+ int page_was_mapped,
+ struct anon_vma *anon_vma,
+ struct list_head *ret)
+{
+ if (page_was_mapped)
+ remove_migration_ptes(src, src, false);
+ /* Drop an anon_vma reference if we took one */
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+ folio_unlock(src);
+ list_move_tail(&src->lru, ret);
+}
+
+/* Restore the destination folio to the original state upon failure */
+static void migrate_folio_undo_dst(struct folio *dst,
+ free_page_t put_new_page,
+ unsigned long private)
+{
+ folio_unlock(dst);
+ if (put_new_page)
+ put_new_page(&dst->page, private);
+ else
+ folio_put(dst);
+}
+
/* Cleanup src folio upon migration success */
static void migrate_folio_done(struct folio *src,
enum migrate_reason reason)
@@ -1069,8 +1096,8 @@ static void migrate_folio_done(struct folio *src,
folio_put(src);
}

-static int __migrate_folio_unmap(struct folio *src, struct folio *dst,
- int force, enum migrate_mode mode)
+static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force,
+ bool avoid_force_lock, enum migrate_mode mode)
{
int rc = -EAGAIN;
int page_was_mapped = 0;
@@ -1097,6 +1124,17 @@ static int __migrate_folio_unmap(struct folio *src, struct folio *dst,
if (current->flags & PF_MEMALLOC)
goto out;

+ /*
+ * We have locked some folios and are going to wait to lock
+ * this folio. To avoid a potential deadlock, let's bail
+ * out and not do that. The locked folios will be moved and
+ * unlocked, then we can wait to lock this folio.
+ */
+ if (avoid_force_lock) {
+ rc = -EDEADLOCK;
+ goto out;
+ }
+
folio_lock(src);
}

@@ -1205,10 +1243,20 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
bool is_lru = !__PageMovable(&src->page);
+ struct list_head *prev;

__migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
+ prev = dst->lru.prev;
+ list_del(&dst->lru);

rc = move_to_new_folio(dst, src, mode);
+
+ if (rc == -EAGAIN) {
+ list_add(&dst->lru, prev);
+ __migrate_folio_record(dst, page_was_mapped, anon_vma);
+ return rc;
+ }
+
if (unlikely(!is_lru))
goto out_unlock_both;

@@ -1251,7 +1299,7 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
/* Obtain the lock on page, remove all ptes. */
static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
unsigned long private, struct folio *src,
- struct folio **dstp, int force,
+ struct folio **dstp, int force, bool avoid_force_lock,
enum migrate_mode mode, enum migrate_reason reason,
struct list_head *ret)
{
@@ -1279,7 +1327,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
*dstp = dst;

dst->private = NULL;
- rc = __migrate_folio_unmap(src, dst, force, mode);
+ rc = __migrate_folio_unmap(src, dst, force, avoid_force_lock, mode);
if (rc == MIGRATEPAGE_UNMAP)
return rc;

@@ -1287,7 +1335,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
* A folio that has not been unmapped will be restored to
* right list unless we want to retry.
*/
- if (rc != -EAGAIN)
+ if (rc != -EAGAIN && rc != -EDEADLOCK)
list_move_tail(&src->lru, ret);

if (put_new_page)
@@ -1326,9 +1374,8 @@ static int migrate_folio_move(free_page_t put_new_page, unsigned long private,
*/
if (rc == MIGRATEPAGE_SUCCESS) {
migrate_folio_done(src, reason);
- } else {
- if (rc != -EAGAIN)
- list_add_tail(&src->lru, ret);
+ } else if (rc != -EAGAIN) {
+ list_add_tail(&src->lru, ret);

if (put_new_page)
put_new_page(&dst->page, private);
@@ -1603,12 +1650,16 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
return nr_failed;
}

+/*
+ * migrate_pages_batch() first unmaps folios in the from list as many as
+ * possible, then move the unmapped folios.
+ */
static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
enum migrate_mode mode, int reason, struct list_head *ret_folios,
struct migrate_pages_stats *stats)
{
- int retry = 1;
+ int retry;
int large_retry = 1;
int thp_retry = 1;
int nr_failed = 0;
@@ -1617,13 +1668,19 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
int pass = 0;
bool is_large = false;
bool is_thp = false;
- struct folio *folio, *folio2, *dst = NULL;
- int rc, nr_pages;
+ struct folio *folio, *folio2, *dst = NULL, *dst2;
+ int rc, rc_saved, nr_pages;
LIST_HEAD(split_folios);
+ LIST_HEAD(unmap_folios);
+ LIST_HEAD(dst_folios);
bool nosplit = (reason == MR_NUMA_MISPLACED);
bool no_split_folio_counting = false;
+ bool avoid_force_lock;

-split_folio_migration:
+retry:
+ rc_saved = 0;
+ avoid_force_lock = false;
+ retry = 1;
for (pass = 0;
pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
pass++) {
@@ -1645,16 +1702,15 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
cond_resched();

rc = migrate_folio_unmap(get_new_page, put_new_page, private,
- folio, &dst, pass > 2, mode,
- reason, ret_folios);
- if (rc == MIGRATEPAGE_UNMAP)
- rc = migrate_folio_move(put_new_page, private,
- folio, dst, mode,
- reason, ret_folios);
+ folio, &dst, pass > 2, avoid_force_lock,
+ mode, reason, ret_folios);
/*
* The rules are:
* Success: folio will be freed
+ * Unmap: folio will be put on unmap_folios list,
+ * dst folio put on dst_folios list
* -EAGAIN: stay on the from list
+ * -EDEADLOCK: stay on the from list
* -ENOMEM: stay on the from list
* -ENOSYS: stay on the from list
* Other errno: put on ret_folios list
@@ -1689,7 +1745,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
case -ENOMEM:
/*
* When memory is low, don't bother to try to migrate
- * other folios, just exit.
+ * other folios, move unmapped folios, then exit.
*/
if (is_large) {
nr_large_failed++;
@@ -1728,7 +1784,19 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
/* nr_failed isn't updated for not used */
nr_large_failed += large_retry;
stats->nr_thp_failed += thp_retry;
- goto out;
+ rc_saved = rc;
+ if (list_empty(&unmap_folios))
+ goto out;
+ else
+ goto move;
+ case -EDEADLOCK:
+ /*
+ * The folio cannot be locked for potential deadlock.
+ * Go move (and unlock) all locked folios. Then we can
+ * try again.
+ */
+ rc_saved = rc;
+ goto move;
case -EAGAIN:
if (is_large) {
large_retry++;
@@ -1742,6 +1810,15 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
stats->nr_succeeded += nr_pages;
stats->nr_thp_succeeded += is_thp;
break;
+ case MIGRATEPAGE_UNMAP:
+ /*
+ * We have locked some folios, don't force lock
+ * to avoid deadlock.
+ */
+ avoid_force_lock = true;
+ list_move_tail(&folio->lru, &unmap_folios);
+ list_add_tail(&dst->lru, &dst_folios);
+ break;
default:
/*
* Permanent failure (-EBUSY, etc.):
@@ -1765,12 +1842,95 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
nr_large_failed += large_retry;
stats->nr_thp_failed += thp_retry;
stats->nr_failed_pages += nr_retry_pages;
+move:
+ retry = 1;
+ for (pass = 0;
+ pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
+ pass++) {
+ retry = 0;
+ large_retry = 0;
+ thp_retry = 0;
+ nr_retry_pages = 0;
+
+ dst = list_first_entry(&dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
+ is_large = folio_test_large(folio);
+ is_thp = is_large && folio_test_pmd_mappable(folio);
+ nr_pages = folio_nr_pages(folio);
+
+ cond_resched();
+
+ rc = migrate_folio_move(put_new_page, private,
+ folio, dst, mode,
+ reason, ret_folios);
+ /*
+ * The rules are:
+ * Success: folio will be freed
+ * -EAGAIN: stay on the unmap_folios list
+ * Other errno: put on ret_folios list
+ */
+ switch(rc) {
+ case -EAGAIN:
+ if (is_large) {
+ large_retry++;
+ thp_retry += is_thp;
+ } else if (!no_split_folio_counting) {
+ retry++;
+ }
+ nr_retry_pages += nr_pages;
+ break;
+ case MIGRATEPAGE_SUCCESS:
+ stats->nr_succeeded += nr_pages;
+ stats->nr_thp_succeeded += is_thp;
+ break;
+ default:
+ if (is_large) {
+ nr_large_failed++;
+ stats->nr_thp_failed += is_thp;
+ } else if (!no_split_folio_counting) {
+ nr_failed++;
+ }
+
+ stats->nr_failed_pages += nr_pages;
+ break;
+ }
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+ }
+ nr_failed += retry;
+ nr_large_failed += large_retry;
+ stats->nr_thp_failed += thp_retry;
+ stats->nr_failed_pages += nr_retry_pages;
+
+ if (rc_saved)
+ rc = rc_saved;
+ else
+ rc = nr_failed + nr_large_failed;
+out:
+ /* Cleanup remaining folios */
+ dst = list_first_entry(&dst_folios, struct folio, lru);
+ dst2 = list_next_entry(dst, lru);
+ list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
+ int page_was_mapped = 0;
+ struct anon_vma *anon_vma = NULL;
+
+ __migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
+ migrate_folio_undo_src(folio, page_was_mapped, anon_vma,
+ ret_folios);
+ list_del(&dst->lru);
+ migrate_folio_undo_dst(dst, put_new_page, private);
+ dst = dst2;
+ dst2 = list_next_entry(dst, lru);
+ }
+
/*
* Try to migrate split folios of fail-to-migrate large folios, no
* nr_failed counting in this round, since all split folios of a
* large folio is counted as 1 failure in the first round.
*/
- if (!list_empty(&split_folios)) {
+ if (rc >= 0 && !list_empty(&split_folios)) {
/*
* Move non-migrated folios (after NR_MAX_MIGRATE_PAGES_RETRY
* retries) to ret_folios to avoid migrating them again.
@@ -1778,12 +1938,16 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
list_splice_init(from, ret_folios);
list_splice_init(&split_folios, from);
no_split_folio_counting = true;
- retry = 1;
- goto split_folio_migration;
+ goto retry;
}

- rc = nr_failed + nr_large_failed;
-out:
+ /*
+ * We have unlocked all locked folios, so we can force lock now, let's
+ * try again.
+ */
+ if (rc == -EDEADLOCK)
+ goto retry;
+
return rc;
}

--
2.35.1

2023-02-13 12:36:02

[permalink] [raw]

Subject: [PATCH -v5 6/9] migrate_pages: move migrate_folio_unmap()

Just move the position of the functions. There's no any functionality
change. This is to make it easier to review the next patch via
putting code near its position in the next patch.

Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Xin Hao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
mm/migrate.c | 100 +++++++++++++++++++++++++--------------------------
1 file changed, 50 insertions(+), 50 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index ee3e21f1061c..0c7488ebe248 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1236,6 +1236,56 @@ static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force
return rc;
}

+/* Obtain the lock on page, remove all ptes. */
+static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
+ unsigned long private, struct folio *src,
+ struct folio **dstp, int force, bool avoid_force_lock,
+ enum migrate_mode mode, enum migrate_reason reason,
+ struct list_head *ret)
+{
+ struct folio *dst;
+ int rc = MIGRATEPAGE_UNMAP;
+ struct page *newpage = NULL;
+
+ if (!thp_migration_supported() && folio_test_transhuge(src))
+ return -ENOSYS;
+
+ if (folio_ref_count(src) == 1) {
+ /* Folio was freed from under us. So we are done. */
+ folio_clear_active(src);
+ folio_clear_unevictable(src);
+ /* free_pages_prepare() will clear PG_isolated. */
+ list_del(&src->lru);
+ migrate_folio_done(src, reason);
+ return MIGRATEPAGE_SUCCESS;
+ }
+
+ newpage = get_new_page(&src->page, private);
+ if (!newpage)
+ return -ENOMEM;
+ dst = page_folio(newpage);
+ *dstp = dst;
+
+ dst->private = NULL;
+ rc = __migrate_folio_unmap(src, dst, force, avoid_force_lock, mode);
+ if (rc == MIGRATEPAGE_UNMAP)
+ return rc;
+
+ /*
+ * A folio that has not been unmapped will be restored to
+ * right list unless we want to retry.
+ */
+ if (rc != -EAGAIN && rc != -EDEADLOCK)
+ list_move_tail(&src->lru, ret);
+
+ if (put_new_page)
+ put_new_page(&dst->page, private);
+ else
+ folio_put(dst);
+
+ return rc;
+}
+
static int __migrate_folio_move(struct folio *src, struct folio *dst,
enum migrate_mode mode)
{
@@ -1296,56 +1346,6 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
return rc;
}

-/* Obtain the lock on page, remove all ptes. */
-static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
- unsigned long private, struct folio *src,
- struct folio **dstp, int force, bool avoid_force_lock,
- enum migrate_mode mode, enum migrate_reason reason,
- struct list_head *ret)
-{
- struct folio *dst;
- int rc = MIGRATEPAGE_UNMAP;
- struct page *newpage = NULL;
-
- if (!thp_migration_supported() && folio_test_transhuge(src))
- return -ENOSYS;
-
- if (folio_ref_count(src) == 1) {
- /* Folio was freed from under us. So we are done. */
- folio_clear_active(src);
- folio_clear_unevictable(src);
- /* free_pages_prepare() will clear PG_isolated. */
- list_del(&src->lru);
- migrate_folio_done(src, reason);
- return MIGRATEPAGE_SUCCESS;
- }
-
- newpage = get_new_page(&src->page, private);
- if (!newpage)
- return -ENOMEM;
- dst = page_folio(newpage);
- *dstp = dst;
-
- dst->private = NULL;
- rc = __migrate_folio_unmap(src, dst, force, avoid_force_lock, mode);
- if (rc == MIGRATEPAGE_UNMAP)
- return rc;
-
- /*
- * A folio that has not been unmapped will be restored to
- * right list unless we want to retry.
- */
- if (rc != -EAGAIN && rc != -EDEADLOCK)
- list_move_tail(&src->lru, ret);
-
- if (put_new_page)
- put_new_page(&dst->page, private);
- else
- folio_put(dst);
-
- return rc;
-}
-
/* Migrate the folio to the newly allocated folio in dst. */
static int migrate_folio_move(free_page_t put_new_page, unsigned long private,
struct folio *src, struct folio *dst,
--
2.35.1

2023-02-13 12:36:10

[permalink] [raw]

Subject: [PATCH -v5 7/9] migrate_pages: share more code between _unmap and _move

This is a code cleanup patch to reduce the duplicated code between the
_unmap and _move stages of migrate_pages(). No functionality change
is expected.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Xin Hao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
mm/migrate.c | 207 +++++++++++++++++++++------------------------------
1 file changed, 85 insertions(+), 122 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0c7488ebe248..00713ccb6643 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1055,6 +1055,7 @@ static void __migrate_folio_extract(struct folio *dst,
static void migrate_folio_undo_src(struct folio *src,
int page_was_mapped,
struct anon_vma *anon_vma,
+ bool locked,
struct list_head *ret)
{
if (page_was_mapped)
@@ -1062,16 +1063,20 @@ static void migrate_folio_undo_src(struct folio *src,
/* Drop an anon_vma reference if we took one */
if (anon_vma)
put_anon_vma(anon_vma);
- folio_unlock(src);
- list_move_tail(&src->lru, ret);
+ if (locked)
+ folio_unlock(src);
+ if (ret)
+ list_move_tail(&src->lru, ret);
}

/* Restore the destination folio to the original state upon failure */
static void migrate_folio_undo_dst(struct folio *dst,
+ bool locked,
free_page_t put_new_page,
unsigned long private)
{
- folio_unlock(dst);
+ if (locked)
+ folio_unlock(dst);
if (put_new_page)
put_new_page(&dst->page, private);
else
@@ -1096,13 +1101,42 @@ static void migrate_folio_done(struct folio *src,
folio_put(src);
}

-static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force,
- bool avoid_force_lock, enum migrate_mode mode)
+/* Obtain the lock on page, remove all ptes. */
+static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
+ unsigned long private, struct folio *src,
+ struct folio **dstp, int force, bool avoid_force_lock,
+ enum migrate_mode mode, enum migrate_reason reason,
+ struct list_head *ret)
{
+ struct folio *dst;
int rc = -EAGAIN;
+ struct page *newpage = NULL;
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
bool is_lru = !__PageMovable(&src->page);
+ bool locked = false;
+ bool dst_locked = false;
+
+ if (!thp_migration_supported() && folio_test_transhuge(src))
+ return -ENOSYS;
+
+ if (folio_ref_count(src) == 1) {
+ /* Folio was freed from under us. So we are done. */
+ folio_clear_active(src);
+ folio_clear_unevictable(src);
+ /* free_pages_prepare() will clear PG_isolated. */
+ list_del(&src->lru);
+ migrate_folio_done(src, reason);
+ return MIGRATEPAGE_SUCCESS;
+ }
+
+ newpage = get_new_page(&src->page, private);
+ if (!newpage)
+ return -ENOMEM;
+ dst = page_folio(newpage);
+ *dstp = dst;
+
+ dst->private = NULL;

if (!folio_trylock(src)) {
if (!force || mode == MIGRATE_ASYNC)
@@ -1137,6 +1171,7 @@ static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force

folio_lock(src);
}
+ locked = true;

if (folio_test_writeback(src)) {
/*
@@ -1151,10 +1186,10 @@ static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force
break;
default:
rc = -EBUSY;
- goto out_unlock;
+ goto out;
}
if (!force)
- goto out_unlock;
+ goto out;
folio_wait_writeback(src);
}

@@ -1184,7 +1219,8 @@ static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force
* This is much like races on refcount of oldpage: just don't BUG().
*/
if (unlikely(!folio_trylock(dst)))
- goto out_unlock;
+ goto out;
+ dst_locked = true;

if (unlikely(!is_lru)) {
__migrate_folio_record(dst, page_was_mapped, anon_vma);
@@ -1206,7 +1242,7 @@ static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force
if (!src->mapping) {
if (folio_test_private(src)) {
try_to_free_buffers(src);
- goto out_unlock_both;
+ goto out;
}
} else if (folio_mapped(src)) {
/* Establish migration ptes */
@@ -1221,73 +1257,25 @@ static int __migrate_folio_unmap(struct folio *src, struct folio *dst, int force
return MIGRATEPAGE_UNMAP;
}

- if (page_was_mapped)
- remove_migration_ptes(src, src, false);
-
-out_unlock_both:
- folio_unlock(dst);
-out_unlock:
- /* Drop an anon_vma reference if we took one */
- if (anon_vma)
- put_anon_vma(anon_vma);
- folio_unlock(src);
out:
-
- return rc;
-}
-
-/* Obtain the lock on page, remove all ptes. */
-static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
- unsigned long private, struct folio *src,
- struct folio **dstp, int force, bool avoid_force_lock,
- enum migrate_mode mode, enum migrate_reason reason,
- struct list_head *ret)
-{
- struct folio *dst;
- int rc = MIGRATEPAGE_UNMAP;
- struct page *newpage = NULL;
-
- if (!thp_migration_supported() && folio_test_transhuge(src))
- return -ENOSYS;
-
- if (folio_ref_count(src) == 1) {
- /* Folio was freed from under us. So we are done. */
- folio_clear_active(src);
- folio_clear_unevictable(src);
- /* free_pages_prepare() will clear PG_isolated. */
- list_del(&src->lru);
- migrate_folio_done(src, reason);
- return MIGRATEPAGE_SUCCESS;
- }
-
- newpage = get_new_page(&src->page, private);
- if (!newpage)
- return -ENOMEM;
- dst = page_folio(newpage);
- *dstp = dst;
-
- dst->private = NULL;
- rc = __migrate_folio_unmap(src, dst, force, avoid_force_lock, mode);
- if (rc == MIGRATEPAGE_UNMAP)
- return rc;
-
/*
* A folio that has not been unmapped will be restored to
* right list unless we want to retry.
*/
- if (rc != -EAGAIN && rc != -EDEADLOCK)
- list_move_tail(&src->lru, ret);
+ if (rc == -EAGAIN || rc == -EDEADLOCK)
+ ret = NULL;

- if (put_new_page)
- put_new_page(&dst->page, private);
- else
- folio_put(dst);
+ migrate_folio_undo_src(src, page_was_mapped, anon_vma, locked, ret);
+ migrate_folio_undo_dst(dst, dst_locked, put_new_page, private);

return rc;
}

-static int __migrate_folio_move(struct folio *src, struct folio *dst,
- enum migrate_mode mode)
+/* Migrate the folio to the newly allocated folio in dst. */
+static int migrate_folio_move(free_page_t put_new_page, unsigned long private,
+ struct folio *src, struct folio *dst,
+ enum migrate_mode mode, enum migrate_reason reason,
+ struct list_head *ret)
{
int rc;
int page_was_mapped = 0;
@@ -1300,12 +1288,8 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
list_del(&dst->lru);

rc = move_to_new_folio(dst, src, mode);
-
- if (rc == -EAGAIN) {
- list_add(&dst->lru, prev);
- __migrate_folio_record(dst, page_was_mapped, anon_vma);
- return rc;
- }
+ if (rc)
+ goto out;

if (unlikely(!is_lru))
goto out_unlock_both;
@@ -1319,70 +1303,49 @@ static int __migrate_folio_move(struct folio *src, struct folio *dst,
* unsuccessful, and other cases when a page has been temporarily
* isolated from the unevictable LRU: but this case is the easiest.
*/
- if (rc == MIGRATEPAGE_SUCCESS) {
- folio_add_lru(dst);
- if (page_was_mapped)
- lru_add_drain();
- }
+ folio_add_lru(dst);
+ if (page_was_mapped)
+ lru_add_drain();

if (page_was_mapped)
- remove_migration_ptes(src,
- rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+ remove_migration_ptes(src, dst, false);

out_unlock_both:
folio_unlock(dst);
- /* Drop an anon_vma reference if we took one */
- if (anon_vma)
- put_anon_vma(anon_vma);
- folio_unlock(src);
+ set_page_owner_migrate_reason(&dst->page, reason);
/*
* If migration is successful, decrease refcount of dst,
* which will not free the page because new page owner increased
* refcounter.
*/
- if (rc == MIGRATEPAGE_SUCCESS)
- folio_put(dst);
-
- return rc;
-}
-
-/* Migrate the folio to the newly allocated folio in dst. */
-static int migrate_folio_move(free_page_t put_new_page, unsigned long private,
- struct folio *src, struct folio *dst,
- enum migrate_mode mode, enum migrate_reason reason,
- struct list_head *ret)
-{
- int rc;
-
- rc = __migrate_folio_move(src, dst, mode);
- if (rc == MIGRATEPAGE_SUCCESS)
- set_page_owner_migrate_reason(&dst->page, reason);
-
- if (rc != -EAGAIN) {
- /*
- * A folio that has been migrated has all references
- * removed and will be freed. A folio that has not been
- * migrated will have kept its references and be restored.
- */
- list_del(&src->lru);
- }
+ folio_put(dst);

/*
- * If migration is successful, releases reference grabbed during
- * isolation. Otherwise, restore the folio to right list unless
- * we want to retry.
+ * A folio that has been migrated has all references removed
+ * and will be freed.
*/
- if (rc == MIGRATEPAGE_SUCCESS) {
- migrate_folio_done(src, reason);
- } else if (rc != -EAGAIN) {
- list_add_tail(&src->lru, ret);
+ list_del(&src->lru);
+ /* Drop an anon_vma reference if we took one */
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+ folio_unlock(src);
+ migrate_folio_done(src, reason);

- if (put_new_page)
- put_new_page(&dst->page, private);
- else
- folio_put(dst);
+ return rc;
+out:
+ /*
+ * A folio that has not been migrated will be restored to
+ * right list unless we want to retry.
+ */
+ if (rc == -EAGAIN) {
+ list_add(&dst->lru, prev);
+ __migrate_folio_record(dst, page_was_mapped, anon_vma);
+ return rc;
}

+ migrate_folio_undo_src(src, page_was_mapped, anon_vma, true, ret);
+ migrate_folio_undo_dst(dst, true, put_new_page, private);
+
return rc;
}

@@ -1918,9 +1881,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,

__migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
migrate_folio_undo_src(folio, page_was_mapped, anon_vma,
- ret_folios);
+ true, ret_folios);
list_del(&dst->lru);
- migrate_folio_undo_dst(dst, put_new_page, private);
+ migrate_folio_undo_dst(dst, true, put_new_page, private);
dst = dst2;
dst2 = list_next_entry(dst, lru);
}
--
2.35.1

2023-02-13 12:36:14

[permalink] [raw]

Subject: [PATCH -v5 8/9] migrate_pages: batch flushing TLB

The TLB flushing will cost quite some CPU cycles during the folio
migration in some situations. For example, when migrate a folio of a
process with multiple active threads that run on multiple CPUs. After
batching the _unmap and _move in migrate_pages(), the TLB flushing can
be batched easily with the existing TLB flush batching mechanism.
This patch implements that.

We use the following test case to test the patch.

On a 2-socket Intel server,

- Run pmbench memory accessing benchmark

- Run `migratepages` to migrate pages of pmbench between node 0 and
node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.

Haoxin helped to test the patchset on an ARM64 server with 128 cores,
2 NUMA nodes. Test results show that the page migration performance
increases up to 78%.

NOTE: TLB flushing is batched only for normal folios, not for THP
folios. Because the overhead of TLB flushing for THP folios is much
lower than that for normal folios (about 1/512 on x86 platform).

Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Xin Hao <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Xin Hao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
mm/migrate.c | 5 ++++-
mm/rmap.c | 20 +++++++++++++++++---
2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 00713ccb6643..2fa420e4f68c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1248,7 +1248,7 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
/* Establish migration ptes */
VM_BUG_ON_FOLIO(folio_test_anon(src) &&
!folio_test_ksm(src) && !anon_vma, src);
- try_to_migrate(src, 0);
+ try_to_migrate(src, TTU_BATCH_FLUSH);
page_was_mapped = 1;
}

@@ -1806,6 +1806,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
stats->nr_thp_failed += thp_retry;
stats->nr_failed_pages += nr_retry_pages;
move:
+ /* Flush TLBs for all unmapped folios */
+ try_to_unmap_flush();
+
retry = 1;
for (pass = 0;
pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry);
diff --git a/mm/rmap.c b/mm/rmap.c
index 8287f2cc327d..15ae24585fc4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1952,7 +1952,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
} else {
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
/* Nuke the page table entry. */
- pteval = ptep_clear_flush(vma, address, pvmw.pte);
+ if (should_defer_flush(mm, flags)) {
+ /*
+ * We clear the PTE but do not flush so potentially
+ * a remote CPU could still be writing to the folio.
+ * If the entry was previously clean then the
+ * architecture must guarantee that a clear->dirty
+ * transition on a cached TLB entry is written through
+ * and traps if the PTE is unmapped.
+ */
+ pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+ set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+ } else {
+ pteval = ptep_clear_flush(vma, address, pvmw.pte);
+ }
}

/* Set the dirty flag on the folio now the pte is gone. */
@@ -2124,10 +2138,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)

/*
* Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
- * TTU_SPLIT_HUGE_PMD and TTU_SYNC flags.
+ * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
*/
if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
- TTU_SYNC)))
+ TTU_SYNC | TTU_BATCH_FLUSH)))
return;

if (folio_is_zone_device(folio) &&
--
2.35.1

2023-02-13 12:36:28

[permalink] [raw]

Subject: [PATCH -v5 9/9] migrate_pages: move THP/hugetlb migration support check to simplify code

This is a code cleanup patch, no functionality change is expected.
After the change, the line number reduces especially in the long
migrate_pages_batch().

Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Alistair Popple <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Xin Hao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
---
mm/migrate.c | 83 +++++++++++++++++++++++-----------------------------
1 file changed, 36 insertions(+), 47 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2fa420e4f68c..ef68a1aff35c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1117,9 +1117,6 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
bool locked = false;
bool dst_locked = false;

- if (!thp_migration_supported() && folio_test_transhuge(src))
- return -ENOSYS;
-
if (folio_ref_count(src) == 1) {
/* Folio was freed from under us. So we are done. */
folio_clear_active(src);
@@ -1380,16 +1377,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;

- /*
- * Migratability of hugepages depends on architectures and their size.
- * This check is necessary because some callers of hugepage migration
- * like soft offline and memory hotremove don't walk through page
- * tables or check whether the hugepage is pmd-based or not before
- * kicking migration.
- */
- if (!hugepage_migration_supported(page_hstate(hpage)))
- return -ENOSYS;
-
if (folio_ref_count(src) == 1) {
/* page was freed from under us. So we are done. */
folio_putback_active_hugetlb(src);
@@ -1556,6 +1543,20 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,

cond_resched();

+ /*
+ * Migratability of hugepages depends on architectures and
+ * their size. This check is necessary because some callers
+ * of hugepage migration like soft offline and memory
+ * hotremove don't walk through page tables or check whether
+ * the hugepage is pmd-based or not before kicking migration.
+ */
+ if (!hugepage_migration_supported(folio_hstate(folio))) {
+ nr_failed++;
+ stats->nr_failed_pages += nr_pages;
+ list_move_tail(&folio->lru, ret_folios);
+ continue;
+ }
+
rc = unmap_and_move_huge_page(get_new_page,
put_new_page, private,
&folio->page, pass > 2, mode,
@@ -1565,16 +1566,9 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
* Success: hugetlb folio will be put back
* -EAGAIN: stay on the from list
* -ENOMEM: stay on the from list
- * -ENOSYS: stay on the from list
* Other errno: put on ret_folios list
*/
switch(rc) {
- case -ENOSYS:
- /* Hugetlb migration is unsupported */
- nr_failed++;
- stats->nr_failed_pages += nr_pages;
- list_move_tail(&folio->lru, ret_folios);
- break;
case -ENOMEM:
/*
* When memory is low, don't bother to try to migrate
@@ -1664,6 +1658,28 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,

cond_resched();

+ /*
+ * Large folio migration might be unsupported or
+ * the allocation might be failed so we should retry
+ * on the same folio with the large folio split
+ * to normal folios.
+ *
+ * Split folios are put in split_folios, and
+ * we will migrate them after the rest of the
+ * list is processed.
+ */
+ if (!thp_migration_supported() && is_thp) {
+ nr_large_failed++;
+ stats->nr_thp_failed++;
+ if (!try_split_folio(folio, &split_folios)) {
+ stats->nr_thp_split++;
+ continue;
+ }
+ stats->nr_failed_pages += nr_pages;
+ list_move_tail(&folio->lru, ret_folios);
+ continue;
+ }
+
rc = migrate_folio_unmap(get_new_page, put_new_page, private,
folio, &dst, pass > 2, avoid_force_lock,
mode, reason, ret_folios);
@@ -1675,36 +1691,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
* -EAGAIN: stay on the from list
* -EDEADLOCK: stay on the from list
* -ENOMEM: stay on the from list
- * -ENOSYS: stay on the from list
* Other errno: put on ret_folios list
*/
switch(rc) {
- /*
- * Large folio migration might be unsupported or
- * the allocation could've failed so we should retry
- * on the same folio with the large folio split
- * to normal folios.
- *
- * Split folios are put in split_folios, and
- * we will migrate them after the rest of the
- * list is processed.
- */
- case -ENOSYS:
- /* Large folio migration is unsupported */
- if (is_large) {
- nr_large_failed++;
- stats->nr_thp_failed += is_thp;
- if (!try_split_folio(folio, &split_folios)) {
- stats->nr_thp_split += is_thp;
- break;
- }
- } else if (!no_split_folio_counting) {
- nr_failed++;
- }
-
- stats->nr_failed_pages += nr_pages;
- list_move_tail(&folio->lru, ret_folios);
- break;
case -ENOMEM:
/*
* When memory is low, don't bother to try to migrate
--
2.35.1

2023-02-17 21:48:05

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

On Mon, 13 Feb 2023, Huang Ying wrote:

> From: "Huang, Ying" <[email protected]>
>
> Now, migrate_pages() migrate folios one by one, like the fake code as
> follows,
>
> for each folio
> unmap
> flush TLB
> copy
> restore map
>
> If multiple folios are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying. That is, we can
> change the code to something as follows,
>
> for each folio
> unmap
> for each folio
> flush TLB
> for each folio
> copy
> for each folio
> restore map
>
> The total number of TLB flushing IPI can be reduced considerably. And
> we may use some hardware accelerator such as DSA to accelerate the
> folio copying.
>
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching. Base on this, hardware
> accelerated folio copying can be implemented.
>
> If too many folios are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many folios at the same time. The
> possibility for a task to wait for the migrated folios to be mapped
> again increases. So the latency may be hurt. To deal with this
> issue, the max number of folios be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR in the unit of page. That is, the influence
> is at the same level of THP migration.
>
> We use the following test to measure the performance impact of the
> patchset,
>
> On a 2-socket Intel server,
>
> - Run pmbench memory accessing benchmark
>
> - Run `migratepages` to migrate pages of pmbench between node 0 and
> node 1 back and forth.
>
> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> the number of pages migrated successfully per second increases 291.7%.
>
> Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
> 2 NUMA nodes. Test results show that the page migration performance
> increases up to 78%.
>
> This patchset is based on mm-unstable 2023-02-10.

And back in linux-next this week: I tried next-20230217 overnight.

There is a deadlock in this patchset (and in previous versions: sorry
it's taken me so long to report), but I think one that's easily solved.

I've not bisected to precisely which patch (load can take several hours
to hit the deadlock), but it doesn't really matter, and I expect that
you can guess.

My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
So, plenty of ext4 page cache and buffer_heads.

Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
either in kcompactd0 or in khugepaged trying to compact, or in both:
it ends up calling __lock_buffer(), and that schedules away, waiting
forever to get BH_lock. I have not identified who is holding BH_lock,
but I imagine a jbd2 journalling thread, and presume that it wants one
of the folio locks which migrate_pages_batch() is already holding; or
maybe it's all more convoluted than that. Other tasks then back up
waiting on those folio locks held in the batch.

Never a problem with buffer_migrate_folio(), always with the "more
careful" buffer_migrate_folio_norefs(). And the patch below fixes
it for me: I've had enough hours with it now, on enough occasions,
to be confident of that.

Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
very well, and I hope can assure us that there is an understandable
deadlock here, from holding several random folio locks, then trying
to lock buffers. Cc'ing fsdevel, because there's a risk that mm
folk think something is safe, when it's not sufficient to cope with
the diversity of filesystems. I hope nothing more than the below is
needed (and I've had no other problems with the patchset: good job),
but cannot be sure.

[PATCH next] migrate_pages: fix deadlock on buffer heads

When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(),
force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only
trylock_buffer(), failing with -EAGAIN as usual if that does not succeed.

Signed-off-by: Hugh Dickins <[email protected]>

--- next-20230217/mm/migrate.c
+++ fixed/mm/migrate.c
@@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct
if (folio_ref_count(src) != expected_count)
return -EAGAIN;

- if (!buffer_migrate_lock_buffers(head, mode))
+ if (!buffer_migrate_lock_buffers(head,
+ check_refs ? MIGRATE_ASYNC : mode))
return -EAGAIN;

if (check_refs) {

2023-02-20 09:29:09

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

Hi, Hugh,

Hugh Dickins <[email protected]> writes:

> On Mon, 13 Feb 2023, Huang Ying wrote:
>
>> From: "Huang, Ying" <[email protected]>
>>
>> Now, migrate_pages() migrate folios one by one, like the fake code as
>> follows,
>>
>> for each folio
>> unmap
>> flush TLB
>> copy
>> restore map
>>
>> If multiple folios are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying. That is, we can
>> change the code to something as follows,
>>
>> for each folio
>> unmap
>> for each folio
>> flush TLB
>> for each folio
>> copy
>> for each folio
>> restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably. And
>> we may use some hardware accelerator such as DSA to accelerate the
>> folio copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching. Base on this, hardware
>> accelerated folio copying can be implemented.
>>
>> If too many folios are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many folios at the same time. The
>> possibility for a task to wait for the migrated folios to be mapped
>> again increases. So the latency may be hurt. To deal with this
>> issue, the max number of folios be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR in the unit of page. That is, the influence
>> is at the same level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>> - Run pmbench memory accessing benchmark
>>
>> - Run `migratepages` to migrate pages of pmbench between node 0 and
>> node 1 back and forth.
>>
>> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> the number of pages migrated successfully per second increases 291.7%.
>>
>> Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
>> 2 NUMA nodes. Test results show that the page migration performance
>> increases up to 78%.
>>
>> This patchset is based on mm-unstable 2023-02-10.
>
> And back in linux-next this week: I tried next-20230217 overnight.
>
> There is a deadlock in this patchset (and in previous versions: sorry
> it's taken me so long to report), but I think one that's easily solved.
>
> I've not bisected to precisely which patch (load can take several hours
> to hit the deadlock), but it doesn't really matter, and I expect that
> you can guess.
>
> My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
> and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
> So, plenty of ext4 page cache and buffer_heads.
>
> Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
> either in kcompactd0 or in khugepaged trying to compact, or in both:
> it ends up calling __lock_buffer(), and that schedules away, waiting
> forever to get BH_lock. I have not identified who is holding BH_lock,
> but I imagine a jbd2 journalling thread, and presume that it wants one
> of the folio locks which migrate_pages_batch() is already holding; or
> maybe it's all more convoluted than that. Other tasks then back up
> waiting on those folio locks held in the batch.
>
> Never a problem with buffer_migrate_folio(), always with the "more
> careful" buffer_migrate_folio_norefs(). And the patch below fixes
> it for me: I've had enough hours with it now, on enough occasions,
> to be confident of that.
>
> Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
> very well, and I hope can assure us that there is an understandable
> deadlock here, from holding several random folio locks, then trying
> to lock buffers. Cc'ing fsdevel, because there's a risk that mm
> folk think something is safe, when it's not sufficient to cope with
> the diversity of filesystems. I hope nothing more than the below is
> needed (and I've had no other problems with the patchset: good job),
> but cannot be sure.
>
> [PATCH next] migrate_pages: fix deadlock on buffer heads
>
> When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(),
> force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only
> trylock_buffer(), failing with -EAGAIN as usual if that does not succeed.
>
> Signed-off-by: Hugh Dickins <[email protected]>
>
> --- next-20230217/mm/migrate.c
> +++ fixed/mm/migrate.c
> @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct
> if (folio_ref_count(src) != expected_count)
> return -EAGAIN;
>
> - if (!buffer_migrate_lock_buffers(head, mode))
> + if (!buffer_migrate_lock_buffers(head,
> + check_refs ? MIGRATE_ASYNC : mode))
> return -EAGAIN;
>
> if (check_refs) {

Thank you very much for pointing this out and the fix patch. Today, my
colleague Pengfei reported a deadlock bug to me. It seems that we
cannot wait the writeback to complete when we have locked some folios.
Below patch can fix that deadlock. I don't know whether this is related
to the deadlock you run into. It appears that we should avoid to
lock/wait synchronously if we have locked more than one folios.

Best Regards,
Huang, Ying

------------------------------------8<------------------------------------
From 0699fa2f80a67e863107d49a25909c92b900a9be Mon Sep 17 00:00:00 2001
From: Huang Ying <[email protected]>
Date: Mon, 20 Feb 2023 14:56:34 +0800
Subject: [PATCH] migrate_pages: fix deadlock on waiting writeback

Pengfei reported a system soft lockup issue with Syzkaller. The stack
traces are as follows,

...
[ 300.124933] INFO: task kworker/u4:3:73 blocked for more than 147 seconds.
[ 300.125214] Not tainted 6.2.0-rc4-kvm+ #1314
[ 300.125408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 300.125736] task:kworker/u4:3 state:D stack:0 pid:73 ppid:2 flags:0x00004000
[ 300.126059] Workqueue: writeback wb_workfn (flush-7:3)
[ 300.126282] Call Trace:
[ 300.126378] <TASK>
[ 300.126464] __schedule+0x43b/0xd00
[ 300.126601] ? __blk_flush_plug+0x142/0x180
[ 300.126765] schedule+0x6a/0xf0
[ 300.126912] io_schedule+0x4a/0x80
[ 300.127051] folio_wait_bit_common+0x1b5/0x4e0
[ 300.127227] ? __pfx_wake_page_function+0x10/0x10
[ 300.127403] __folio_lock+0x27/0x40
[ 300.127541] write_cache_pages+0x350/0x870
[ 300.127699] ? __pfx_iomap_do_writepage+0x10/0x10
[ 300.127889] iomap_writepages+0x3f/0x80
[ 300.128037] xfs_vm_writepages+0x94/0xd0
[ 300.128192] ? __pfx_xfs_vm_writepages+0x10/0x10
[ 300.128370] do_writepages+0x10a/0x240
[ 300.128514] ? lock_is_held_type+0xe6/0x140
[ 300.128675] __writeback_single_inode+0x9f/0xa90
[ 300.128854] writeback_sb_inodes+0x2fb/0x8d0
[ 300.129030] __writeback_inodes_wb+0x68/0x150
[ 300.129212] wb_writeback+0x49c/0x770
[ 300.129357] wb_workfn+0x6fb/0x9d0
[ 300.129500] process_one_work+0x3cc/0x8d0
[ 300.129669] worker_thread+0x66/0x630
[ 300.129824] ? __pfx_worker_thread+0x10/0x10
[ 300.129989] kthread+0x153/0x190
[ 300.130116] ? __pfx_kthread+0x10/0x10
[ 300.130264] ret_from_fork+0x29/0x50
[ 300.130409] </TASK>
[ 300.179347] INFO: task repro:1023 blocked for more than 147 seconds.
[ 300.179905] Not tainted 6.2.0-rc4-kvm+ #1314
[ 300.180317] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 300.180955] task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
[ 300.181660] Call Trace:
[ 300.181879] <TASK>
[ 300.182085] __schedule+0x43b/0xd00
[ 300.182407] schedule+0x6a/0xf0
[ 300.182694] io_schedule+0x4a/0x80
[ 300.183020] folio_wait_bit_common+0x1b5/0x4e0
[ 300.183506] ? compaction_alloc+0x77/0x1150
[ 300.183892] ? __pfx_wake_page_function+0x10/0x10
[ 300.184304] folio_wait_bit+0x30/0x40
[ 300.184640] folio_wait_writeback+0x2e/0x1e0
[ 300.185034] migrate_pages_batch+0x555/0x1ac0
[ 300.185462] ? __pfx_compaction_alloc+0x10/0x10
[ 300.185808] ? __pfx_compaction_free+0x10/0x10
[ 300.186022] ? __this_cpu_preempt_check+0x17/0x20
[ 300.186234] ? lock_is_held_type+0xe6/0x140
[ 300.186423] migrate_pages+0x100e/0x1180
[ 300.186603] ? __pfx_compaction_free+0x10/0x10
[ 300.186800] ? __pfx_compaction_alloc+0x10/0x10
[ 300.187011] compact_zone+0xe10/0x1b50
[ 300.187182] ? lock_is_held_type+0xe6/0x140
[ 300.187374] ? check_preemption_disabled+0x80/0xf0
[ 300.187588] compact_node+0xa3/0x100
[ 300.187755] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 300.187993] ? _find_first_bit+0x7b/0x90
[ 300.188171] sysctl_compaction_handler+0x5d/0xb0
[ 300.188376] proc_sys_call_handler+0x29d/0x420
[ 300.188583] proc_sys_write+0x2b/0x40
[ 300.188749] vfs_write+0x3a3/0x780
[ 300.188912] ksys_write+0xb7/0x180
[ 300.189070] __x64_sys_write+0x26/0x30
[ 300.189260] do_syscall_64+0x3b/0x90
[ 300.189424] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 300.189654] RIP: 0033:0x7f3a2471f59d
[ 300.189815] RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
[ 300.190137] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
[ 300.190397] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
[ 300.190653] RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
[ 300.190910] R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
[ 300.191172] R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
[ 300.191440] </TASK>
...

To migrate a folio, we may wait the writeback of a folio to complete
when we already have held the lock of some folios. But the writeback
code may wait to lock some folio we held lock. This causes the
deadlock. To fix the issue, we will avoid to wait the writeback to
complete if we have locked some folios. After moving the locked
folios and unlocked, we will retry.

Signed-off-by: "Huang, Ying" <[email protected]>
Reported-by: "Xu, Pengfei" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Stefan Roesch <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Xin Hao <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
mm/migrate.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 28b435cdeac8..bc9a8050f1b0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1205,6 +1205,18 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
}
if (!force)
goto out;
+ /*
+ * We have locked some folios and are going to wait the
+ * writeback of this folio to complete. But it's possible for
+ * the writeback to wait to lock the folios we have locked. To
+ * avoid a potential deadlock, let's bail out and not do that.
+ * The locked folios will be moved and unlocked, then we
+ * can wait the writeback of this folio.
+ */
+ if (avoid_force_lock) {
+ rc = -EDEADLOCK;
+ goto out;
+ }
folio_wait_writeback(src);
}

--
2.39.1

2023-02-21 02:50:56

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

On Mon, 20 Feb 2023, Huang, Ying wrote:

> Hi, Hugh,
>
> Hugh Dickins <[email protected]> writes:
>
> > On Mon, 13 Feb 2023, Huang Ying wrote:
> >
> >> From: "Huang, Ying" <[email protected]>
> >>
> >> Now, migrate_pages() migrate folios one by one, like the fake code as
> >> follows,
> >>
> >> for each folio
> >> unmap
> >> flush TLB
> >> copy
> >> restore map
> >>
> >> If multiple folios are passed to migrate_pages(), there are
> >> opportunities to batch the TLB flushing and copying. That is, we can
> >> change the code to something as follows,
> >>
> >> for each folio
> >> unmap
> >> for each folio
> >> flush TLB
> >> for each folio
> >> copy
> >> for each folio
> >> restore map
> >>
> >> The total number of TLB flushing IPI can be reduced considerably. And
> >> we may use some hardware accelerator such as DSA to accelerate the
> >> folio copying.
> >>
> >> So in this patch, we refactor the migrate_pages() implementation and
> >> implement the TLB flushing batching. Base on this, hardware
> >> accelerated folio copying can be implemented.
> >>
> >> If too many folios are passed to migrate_pages(), in the naive batched
> >> implementation, we may unmap too many folios at the same time. The
> >> possibility for a task to wait for the migrated folios to be mapped
> >> again increases. So the latency may be hurt. To deal with this
> >> issue, the max number of folios be unmapped in batch is restricted to
> >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence
> >> is at the same level of THP migration.
> >>
> >> We use the following test to measure the performance impact of the
> >> patchset,
> >>
> >> On a 2-socket Intel server,
> >>
> >> - Run pmbench memory accessing benchmark
> >>
> >> - Run `migratepages` to migrate pages of pmbench between node 0 and
> >> node 1 back and forth.
> >>
> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> >> the number of pages migrated successfully per second increases 291.7%.
> >>
> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
> >> 2 NUMA nodes. Test results show that the page migration performance
> >> increases up to 78%.
> >>
> >> This patchset is based on mm-unstable 2023-02-10.
> >
> > And back in linux-next this week: I tried next-20230217 overnight.
> >
> > There is a deadlock in this patchset (and in previous versions: sorry
> > it's taken me so long to report), but I think one that's easily solved.
> >
> > I've not bisected to precisely which patch (load can take several hours
> > to hit the deadlock), but it doesn't really matter, and I expect that
> > you can guess.
> >
> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
> > So, plenty of ext4 page cache and buffer_heads.
> >
> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
> > either in kcompactd0 or in khugepaged trying to compact, or in both:
> > it ends up calling __lock_buffer(), and that schedules away, waiting
> > forever to get BH_lock. I have not identified who is holding BH_lock,
> > but I imagine a jbd2 journalling thread, and presume that it wants one
> > of the folio locks which migrate_pages_batch() is already holding; or
> > maybe it's all more convoluted than that. Other tasks then back up
> > waiting on those folio locks held in the batch.
> >
> > Never a problem with buffer_migrate_folio(), always with the "more
> > careful" buffer_migrate_folio_norefs(). And the patch below fixes
> > it for me: I've had enough hours with it now, on enough occasions,
> > to be confident of that.
> >
> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
> > very well, and I hope can assure us that there is an understandable
> > deadlock here, from holding several random folio locks, then trying
> > to lock buffers. Cc'ing fsdevel, because there's a risk that mm
> > folk think something is safe, when it's not sufficient to cope with
> > the diversity of filesystems. I hope nothing more than the below is
> > needed (and I've had no other problems with the patchset: good job),
> > but cannot be sure.
> >
> > [PATCH next] migrate_pages: fix deadlock on buffer heads
> >
> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(),
> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only
> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > --- next-20230217/mm/migrate.c
> > +++ fixed/mm/migrate.c
> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct
> > if (folio_ref_count(src) != expected_count)
> > return -EAGAIN;
> >
> > - if (!buffer_migrate_lock_buffers(head, mode))
> > + if (!buffer_migrate_lock_buffers(head,
> > + check_refs ? MIGRATE_ASYNC : mode))
> > return -EAGAIN;
> >
> > if (check_refs) {
>
> Thank you very much for pointing this out and the fix patch. Today, my
> colleague Pengfei reported a deadlock bug to me. It seems that we
> cannot wait the writeback to complete when we have locked some folios.
> Below patch can fix that deadlock. I don't know whether this is related
> to the deadlock you run into. It appears that we should avoid to
> lock/wait synchronously if we have locked more than one folios.

Thanks, I've checked now, on next-20230217 without my patch but
with your patch below: it took a few hours, but still deadlocks
as I described above, so it's not the same issue.

Yes, that's a good principle, that we should avoid to lock/wait
synchronously once we have locked one folio (hmm, above you say
"more than one": I think we mean the same thing, we're just
stating it differently, given how the code runs at present).

I'm not a great fan of migrate_folio_unmap()'s arguments,
"force" followed by "oh, but don't force" (but applaud the recent
"avoid_force_lock" as much better than the original "force_lock").
I haven't tried, but I wonder if you can avoid both those arguments,
and both of these patches, by passing down an adjusted mode (perhaps
MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first
folio of a batch has been acquired (then restore to the original mode
when starting a new batch).

(My patch is weak in that it trylocks for buffer_head even on the
first folio of a MIGRATE_SYNC norefs batch, although that has never
given a problem historically: adjusting the mode after acquiring
the first folio would correct that weakness.)

Hugh

>
> Best Regards,
> Huang, Ying
>
> ------------------------------------8<------------------------------------
> From 0699fa2f80a67e863107d49a25909c92b900a9be Mon Sep 17 00:00:00 2001
> From: Huang Ying <[email protected]>
> Date: Mon, 20 Feb 2023 14:56:34 +0800
> Subject: [PATCH] migrate_pages: fix deadlock on waiting writeback
>
> Pengfei reported a system soft lockup issue with Syzkaller. The stack
> traces are as follows,
>
> ...
> [ 300.124933] INFO: task kworker/u4:3:73 blocked for more than 147 seconds.
> [ 300.125214] Not tainted 6.2.0-rc4-kvm+ #1314
> [ 300.125408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 300.125736] task:kworker/u4:3 state:D stack:0 pid:73 ppid:2 flags:0x00004000
> [ 300.126059] Workqueue: writeback wb_workfn (flush-7:3)
> [ 300.126282] Call Trace:
> [ 300.126378] <TASK>
> [ 300.126464] __schedule+0x43b/0xd00
> [ 300.126601] ? __blk_flush_plug+0x142/0x180
> [ 300.126765] schedule+0x6a/0xf0
> [ 300.126912] io_schedule+0x4a/0x80
> [ 300.127051] folio_wait_bit_common+0x1b5/0x4e0
> [ 300.127227] ? __pfx_wake_page_function+0x10/0x10
> [ 300.127403] __folio_lock+0x27/0x40
> [ 300.127541] write_cache_pages+0x350/0x870
> [ 300.127699] ? __pfx_iomap_do_writepage+0x10/0x10
> [ 300.127889] iomap_writepages+0x3f/0x80
> [ 300.128037] xfs_vm_writepages+0x94/0xd0
> [ 300.128192] ? __pfx_xfs_vm_writepages+0x10/0x10
> [ 300.128370] do_writepages+0x10a/0x240
> [ 300.128514] ? lock_is_held_type+0xe6/0x140
> [ 300.128675] __writeback_single_inode+0x9f/0xa90
> [ 300.128854] writeback_sb_inodes+0x2fb/0x8d0
> [ 300.129030] __writeback_inodes_wb+0x68/0x150
> [ 300.129212] wb_writeback+0x49c/0x770
> [ 300.129357] wb_workfn+0x6fb/0x9d0
> [ 300.129500] process_one_work+0x3cc/0x8d0
> [ 300.129669] worker_thread+0x66/0x630
> [ 300.129824] ? __pfx_worker_thread+0x10/0x10
> [ 300.129989] kthread+0x153/0x190
> [ 300.130116] ? __pfx_kthread+0x10/0x10
> [ 300.130264] ret_from_fork+0x29/0x50
> [ 300.130409] </TASK>
> [ 300.179347] INFO: task repro:1023 blocked for more than 147 seconds.
> [ 300.179905] Not tainted 6.2.0-rc4-kvm+ #1314
> [ 300.180317] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 300.180955] task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
> [ 300.181660] Call Trace:
> [ 300.181879] <TASK>
> [ 300.182085] __schedule+0x43b/0xd00
> [ 300.182407] schedule+0x6a/0xf0
> [ 300.182694] io_schedule+0x4a/0x80
> [ 300.183020] folio_wait_bit_common+0x1b5/0x4e0
> [ 300.183506] ? compaction_alloc+0x77/0x1150
> [ 300.183892] ? __pfx_wake_page_function+0x10/0x10
> [ 300.184304] folio_wait_bit+0x30/0x40
> [ 300.184640] folio_wait_writeback+0x2e/0x1e0
> [ 300.185034] migrate_pages_batch+0x555/0x1ac0
> [ 300.185462] ? __pfx_compaction_alloc+0x10/0x10
> [ 300.185808] ? __pfx_compaction_free+0x10/0x10
> [ 300.186022] ? __this_cpu_preempt_check+0x17/0x20
> [ 300.186234] ? lock_is_held_type+0xe6/0x140
> [ 300.186423] migrate_pages+0x100e/0x1180
> [ 300.186603] ? __pfx_compaction_free+0x10/0x10
> [ 300.186800] ? __pfx_compaction_alloc+0x10/0x10
> [ 300.187011] compact_zone+0xe10/0x1b50
> [ 300.187182] ? lock_is_held_type+0xe6/0x140
> [ 300.187374] ? check_preemption_disabled+0x80/0xf0
> [ 300.187588] compact_node+0xa3/0x100
> [ 300.187755] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 300.187993] ? _find_first_bit+0x7b/0x90
> [ 300.188171] sysctl_compaction_handler+0x5d/0xb0
> [ 300.188376] proc_sys_call_handler+0x29d/0x420
> [ 300.188583] proc_sys_write+0x2b/0x40
> [ 300.188749] vfs_write+0x3a3/0x780
> [ 300.188912] ksys_write+0xb7/0x180
> [ 300.189070] __x64_sys_write+0x26/0x30
> [ 300.189260] do_syscall_64+0x3b/0x90
> [ 300.189424] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [ 300.189654] RIP: 0033:0x7f3a2471f59d
> [ 300.189815] RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
> [ 300.190137] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
> [ 300.190397] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
> [ 300.190653] RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
> [ 300.190910] R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
> [ 300.191172] R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
> [ 300.191440] </TASK>
> ...
>
> To migrate a folio, we may wait the writeback of a folio to complete
> when we already have held the lock of some folios. But the writeback
> code may wait to lock some folio we held lock. This causes the
> deadlock. To fix the issue, we will avoid to wait the writeback to
> complete if we have locked some folios. After moving the locked
> folios and unlocked, we will retry.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Reported-by: "Xu, Pengfei" <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Stefan Roesch <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Xin Hao <[email protected]>
> Cc: Zi Yan <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: Baolin Wang <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Mike Kravetz <[email protected]>
> ---
> mm/migrate.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 28b435cdeac8..bc9a8050f1b0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1205,6 +1205,18 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
> }
> if (!force)
> goto out;
> + /*
> + * We have locked some folios and are going to wait the
> + * writeback of this folio to complete. But it's possible for
> + * the writeback to wait to lock the folios we have locked. To
> + * avoid a potential deadlock, let's bail out and not do that.
> + * The locked folios will be moved and unlocked, then we
> + * can wait the writeback of this folio.
> + */
> + if (avoid_force_lock) {
> + rc = -EDEADLOCK;
> + goto out;
> + }
> folio_wait_writeback(src);
> }
>
> --
> 2.39.1
>
>

2023-02-21 03:35:47

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

Hugh Dickins <[email protected]> writes:

> On Mon, 20 Feb 2023, Huang, Ying wrote:
>
>> Hi, Hugh,
>>
>> Hugh Dickins <[email protected]> writes:
>>
>> > On Mon, 13 Feb 2023, Huang Ying wrote:
>> >
>> >> From: "Huang, Ying" <[email protected]>
>> >>
>> >> Now, migrate_pages() migrate folios one by one, like the fake code as
>> >> follows,
>> >>
>> >> for each folio
>> >> unmap
>> >> flush TLB
>> >> copy
>> >> restore map
>> >>
>> >> If multiple folios are passed to migrate_pages(), there are
>> >> opportunities to batch the TLB flushing and copying. That is, we can
>> >> change the code to something as follows,
>> >>
>> >> for each folio
>> >> unmap
>> >> for each folio
>> >> flush TLB
>> >> for each folio
>> >> copy
>> >> for each folio
>> >> restore map
>> >>
>> >> The total number of TLB flushing IPI can be reduced considerably. And
>> >> we may use some hardware accelerator such as DSA to accelerate the
>> >> folio copying.
>> >>
>> >> So in this patch, we refactor the migrate_pages() implementation and
>> >> implement the TLB flushing batching. Base on this, hardware
>> >> accelerated folio copying can be implemented.
>> >>
>> >> If too many folios are passed to migrate_pages(), in the naive batched
>> >> implementation, we may unmap too many folios at the same time. The
>> >> possibility for a task to wait for the migrated folios to be mapped
>> >> again increases. So the latency may be hurt. To deal with this
>> >> issue, the max number of folios be unmapped in batch is restricted to
>> >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence
>> >> is at the same level of THP migration.
>> >>
>> >> We use the following test to measure the performance impact of the
>> >> patchset,
>> >>
>> >> On a 2-socket Intel server,
>> >>
>> >> - Run pmbench memory accessing benchmark
>> >>
>> >> - Run `migratepages` to migrate pages of pmbench between node 0 and
>> >> node 1 back and forth.
>> >>
>> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> >> the number of pages migrated successfully per second increases 291.7%.
>> >>
>> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
>> >> 2 NUMA nodes. Test results show that the page migration performance
>> >> increases up to 78%.
>> >>
>> >> This patchset is based on mm-unstable 2023-02-10.
>> >
>> > And back in linux-next this week: I tried next-20230217 overnight.
>> >
>> > There is a deadlock in this patchset (and in previous versions: sorry
>> > it's taken me so long to report), but I think one that's easily solved.
>> >
>> > I've not bisected to precisely which patch (load can take several hours
>> > to hit the deadlock), but it doesn't really matter, and I expect that
>> > you can guess.
>> >
>> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
>> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
>> > So, plenty of ext4 page cache and buffer_heads.
>> >
>> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
>> > either in kcompactd0 or in khugepaged trying to compact, or in both:
>> > it ends up calling __lock_buffer(), and that schedules away, waiting
>> > forever to get BH_lock. I have not identified who is holding BH_lock,
>> > but I imagine a jbd2 journalling thread, and presume that it wants one
>> > of the folio locks which migrate_pages_batch() is already holding; or
>> > maybe it's all more convoluted than that. Other tasks then back up
>> > waiting on those folio locks held in the batch.
>> >
>> > Never a problem with buffer_migrate_folio(), always with the "more
>> > careful" buffer_migrate_folio_norefs(). And the patch below fixes
>> > it for me: I've had enough hours with it now, on enough occasions,
>> > to be confident of that.
>> >
>> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
>> > very well, and I hope can assure us that there is an understandable
>> > deadlock here, from holding several random folio locks, then trying
>> > to lock buffers. Cc'ing fsdevel, because there's a risk that mm
>> > folk think something is safe, when it's not sufficient to cope with
>> > the diversity of filesystems. I hope nothing more than the below is
>> > needed (and I've had no other problems with the patchset: good job),
>> > but cannot be sure.
>> >
>> > [PATCH next] migrate_pages: fix deadlock on buffer heads
>> >
>> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(),
>> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only
>> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed.
>> >
>> > Signed-off-by: Hugh Dickins <[email protected]>
>> >
>> > --- next-20230217/mm/migrate.c
>> > +++ fixed/mm/migrate.c
>> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct
>> > if (folio_ref_count(src) != expected_count)
>> > return -EAGAIN;
>> >
>> > - if (!buffer_migrate_lock_buffers(head, mode))
>> > + if (!buffer_migrate_lock_buffers(head,
>> > + check_refs ? MIGRATE_ASYNC : mode))
>> > return -EAGAIN;
>> >
>> > if (check_refs) {
>>
>> Thank you very much for pointing this out and the fix patch. Today, my
>> colleague Pengfei reported a deadlock bug to me. It seems that we
>> cannot wait the writeback to complete when we have locked some folios.
>> Below patch can fix that deadlock. I don't know whether this is related
>> to the deadlock you run into. It appears that we should avoid to
>> lock/wait synchronously if we have locked more than one folios.
>
> Thanks, I've checked now, on next-20230217 without my patch but
> with your patch below: it took a few hours, but still deadlocks
> as I described above, so it's not the same issue.
>
> Yes, that's a good principle, that we should avoid to lock/wait
> synchronously once we have locked one folio (hmm, above you say
> "more than one": I think we mean the same thing, we're just
> stating it differently, given how the code runs at present).
>
> I'm not a great fan of migrate_folio_unmap()'s arguments,
> "force" followed by "oh, but don't force" (but applaud the recent
> "avoid_force_lock" as much better than the original "force_lock").
> I haven't tried, but I wonder if you can avoid both those arguments,
> and both of these patches, by passing down an adjusted mode (perhaps
> MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first
> folio of a batch has been acquired (then restore to the original mode
> when starting a new batch).

Thanks for suggestion! I think it's a good idea, but it will take me
some time to think how to implement.

> (My patch is weak in that it trylocks for buffer_head even on the
> first folio of a MIGRATE_SYNC norefs batch, although that has never
> given a problem historically: adjusting the mode after acquiring
> the first folio would correct that weakness.)

Further digging shows that this is related to loop device. That can be
shown in the following trace,

[ 300.109786] INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
[ 300.110616] Not tainted 6.2.0-rc4-kvm+ #1314
[ 300.111143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 300.111985] task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
[ 300.112964] Workqueue: loop4 loop_rootcg_workfn
[ 300.113658] Call Trace:
[ 300.113996] <TASK>
[ 300.114315] __schedule+0x43b/0xd00
[ 300.114848] schedule+0x6a/0xf0
[ 300.115319] io_schedule+0x4a/0x80
[ 300.115860] folio_wait_bit_common+0x1b5/0x4e0
[ 300.116430] ? __pfx_wake_page_function+0x10/0x10
[ 300.116615] __filemap_get_folio+0x73d/0x770
[ 300.116790] shmem_get_folio_gfp+0x1fd/0xc80
[ 300.116963] shmem_write_begin+0x91/0x220
[ 300.117121] generic_perform_write+0x10e/0x2e0
[ 300.117312] __generic_file_write_iter+0x17e/0x290
[ 300.117498] ? generic_write_checks+0x12b/0x1a0
[ 300.117693] generic_file_write_iter+0x97/0x180
[ 300.117881] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 300.118087] do_iter_readv_writev+0x13c/0x210
[ 300.118256] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 300.118463] do_iter_write+0xf6/0x330
[ 300.118608] vfs_iter_write+0x46/0x70
[ 300.118754] loop_process_work+0x723/0xfe0
[ 300.118922] loop_rootcg_workfn+0x28/0x40
[ 300.119078] process_one_work+0x3cc/0x8d0
[ 300.119240] worker_thread+0x66/0x630
[ 300.119384] ? __pfx_worker_thread+0x10/0x10
[ 300.119551] kthread+0x153/0x190
[ 300.119681] ? __pfx_kthread+0x10/0x10
[ 300.119833] ret_from_fork+0x29/0x50
[ 300.119984] </TASK>

When a folio of the loop device is written back, the underlying shmem
needs to be written back. Which will acquire the lock of the folio of
shmem. While that folio may be locked by migrate_pages_batch() already.

Your testing involves the loop device too. So is it related to loop?
For example, after the buffer head was locked, is it possible to acquire
lock for folios of the underlying file system?

Best Regards,
Huang, Ying

2023-02-21 04:30:51

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

On Mon, Feb 20, 2023 at 06:48:38PM -0800, Hugh Dickins wrote:
> Yes, that's a good principle, that we should avoid to lock/wait
> synchronously once we have locked one folio (hmm, above you say
> "more than one": I think we mean the same thing, we're just
> stating it differently, given how the code runs at present).

I suspect the migrate page code is disobeying the locking ordering
rules for multiple folios. if two folios belong to the same file,
they must be locked by folio->index order, low to high. If two folios
belong to different files, they must be ordered by folio->mapping, the
mapping lowest in memory first. You can see this locking rule embedded
in vfs_lock_two_folios() in fs/remap_range.c.

I don't know what the locking rules are for two folios which are not file
folios, or for two folios when one is anonymous and the other belongs
to a file. Maybe it's the same; you can lock them ordered by ->mapping
first, then by ->index.

Or you can just trylock multiple folios and skip the ones that don't work.

2023-02-21 04:39:39

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

Matthew Wilcox <[email protected]> writes:

> On Mon, Feb 20, 2023 at 06:48:38PM -0800, Hugh Dickins wrote:
>> Yes, that's a good principle, that we should avoid to lock/wait
>> synchronously once we have locked one folio (hmm, above you say
>> "more than one": I think we mean the same thing, we're just
>> stating it differently, given how the code runs at present).
>
> I suspect the migrate page code is disobeying the locking ordering
> rules for multiple folios. if two folios belong to the same file,
> they must be locked by folio->index order, low to high. If two folios
> belong to different files, they must be ordered by folio->mapping, the
> mapping lowest in memory first. You can see this locking rule embedded
> in vfs_lock_two_folios() in fs/remap_range.c.
>
> I don't know what the locking rules are for two folios which are not file
> folios, or for two folios when one is anonymous and the other belongs
> to a file. Maybe it's the same; you can lock them ordered by ->mapping
> first, then by ->index.
>
> Or you can just trylock multiple folios and skip the ones that don't work.

Yes. We will only trylock when we have locked some folios (including
one). And retry locking only after having processed and unlocked the
already locked folios.

Best Regards,
Huang, Ying

2023-02-21 14:05:56

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

Hugh Dickins <[email protected]> writes:

> On Mon, 20 Feb 2023, Huang, Ying wrote:
>
>> Hi, Hugh,
>>
>> Hugh Dickins <[email protected]> writes:
>>
>> > On Mon, 13 Feb 2023, Huang Ying wrote:
>> >
>> >> From: "Huang, Ying" <[email protected]>
>> >>
>> >> Now, migrate_pages() migrate folios one by one, like the fake code as
>> >> follows,
>> >>
>> >> for each folio
>> >> unmap
>> >> flush TLB
>> >> copy
>> >> restore map
>> >>
>> >> If multiple folios are passed to migrate_pages(), there are
>> >> opportunities to batch the TLB flushing and copying. That is, we can
>> >> change the code to something as follows,
>> >>
>> >> for each folio
>> >> unmap
>> >> for each folio
>> >> flush TLB
>> >> for each folio
>> >> copy
>> >> for each folio
>> >> restore map
>> >>
>> >> The total number of TLB flushing IPI can be reduced considerably. And
>> >> we may use some hardware accelerator such as DSA to accelerate the
>> >> folio copying.
>> >>
>> >> So in this patch, we refactor the migrate_pages() implementation and
>> >> implement the TLB flushing batching. Base on this, hardware
>> >> accelerated folio copying can be implemented.
>> >>
>> >> If too many folios are passed to migrate_pages(), in the naive batched
>> >> implementation, we may unmap too many folios at the same time. The
>> >> possibility for a task to wait for the migrated folios to be mapped
>> >> again increases. So the latency may be hurt. To deal with this
>> >> issue, the max number of folios be unmapped in batch is restricted to
>> >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence
>> >> is at the same level of THP migration.
>> >>
>> >> We use the following test to measure the performance impact of the
>> >> patchset,
>> >>
>> >> On a 2-socket Intel server,
>> >>
>> >> - Run pmbench memory accessing benchmark
>> >>
>> >> - Run `migratepages` to migrate pages of pmbench between node 0 and
>> >> node 1 back and forth.
>> >>
>> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> >> the number of pages migrated successfully per second increases 291.7%.
>> >>
>> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
>> >> 2 NUMA nodes. Test results show that the page migration performance
>> >> increases up to 78%.
>> >>
>> >> This patchset is based on mm-unstable 2023-02-10.
>> >
>> > And back in linux-next this week: I tried next-20230217 overnight.
>> >
>> > There is a deadlock in this patchset (and in previous versions: sorry
>> > it's taken me so long to report), but I think one that's easily solved.
>> >
>> > I've not bisected to precisely which patch (load can take several hours
>> > to hit the deadlock), but it doesn't really matter, and I expect that
>> > you can guess.
>> >
>> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
>> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
>> > So, plenty of ext4 page cache and buffer_heads.
>> >
>> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
>> > either in kcompactd0 or in khugepaged trying to compact, or in both:
>> > it ends up calling __lock_buffer(), and that schedules away, waiting
>> > forever to get BH_lock. I have not identified who is holding BH_lock,
>> > but I imagine a jbd2 journalling thread, and presume that it wants one
>> > of the folio locks which migrate_pages_batch() is already holding; or
>> > maybe it's all more convoluted than that. Other tasks then back up
>> > waiting on those folio locks held in the batch.
>> >
>> > Never a problem with buffer_migrate_folio(), always with the "more
>> > careful" buffer_migrate_folio_norefs(). And the patch below fixes
>> > it for me: I've had enough hours with it now, on enough occasions,
>> > to be confident of that.
>> >
>> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
>> > very well, and I hope can assure us that there is an understandable
>> > deadlock here, from holding several random folio locks, then trying
>> > to lock buffers. Cc'ing fsdevel, because there's a risk that mm
>> > folk think something is safe, when it's not sufficient to cope with
>> > the diversity of filesystems. I hope nothing more than the below is
>> > needed (and I've had no other problems with the patchset: good job),
>> > but cannot be sure.
>> >
>> > [PATCH next] migrate_pages: fix deadlock on buffer heads
>> >
>> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(),
>> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only
>> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed.
>> >
>> > Signed-off-by: Hugh Dickins <[email protected]>
>> >
>> > --- next-20230217/mm/migrate.c
>> > +++ fixed/mm/migrate.c
>> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct
>> > if (folio_ref_count(src) != expected_count)
>> > return -EAGAIN;
>> >
>> > - if (!buffer_migrate_lock_buffers(head, mode))
>> > + if (!buffer_migrate_lock_buffers(head,
>> > + check_refs ? MIGRATE_ASYNC : mode))
>> > return -EAGAIN;
>> >
>> > if (check_refs) {
>>
>> Thank you very much for pointing this out and the fix patch. Today, my
>> colleague Pengfei reported a deadlock bug to me. It seems that we
>> cannot wait the writeback to complete when we have locked some folios.
>> Below patch can fix that deadlock. I don't know whether this is related
>> to the deadlock you run into. It appears that we should avoid to
>> lock/wait synchronously if we have locked more than one folios.
>
> Thanks, I've checked now, on next-20230217 without my patch but
> with your patch below: it took a few hours, but still deadlocks
> as I described above, so it's not the same issue.
>
> Yes, that's a good principle, that we should avoid to lock/wait
> synchronously once we have locked one folio (hmm, above you say
> "more than one": I think we mean the same thing, we're just
> stating it differently, given how the code runs at present).
>
> I'm not a great fan of migrate_folio_unmap()'s arguments,
> "force" followed by "oh, but don't force" (but applaud the recent
> "avoid_force_lock" as much better than the original "force_lock").
> I haven't tried, but I wonder if you can avoid both those arguments,
> and both of these patches, by passing down an adjusted mode (perhaps
> MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first
> folio of a batch has been acquired (then restore to the original mode
> when starting a new batch).
>
> (My patch is weak in that it trylocks for buffer_head even on the
> first folio of a MIGRATE_SYNC norefs batch, although that has never
> given a problem historically: adjusting the mode after acquiring
> the first folio would correct that weakness.)

On second thought, I think that it may be better to provide a fix as
simple as possible firstly. Then we can work on a more complex fix as
we discussed above. The simple fix is easy to review now. And, we will
have more time to test and review the complex fix.

In the following fix, I disabled the migration batching except for the
MIGRATE_ASYNC mode, or the split folios of a THP folio. After that, I
will work on the complex fix to enable migration batching for all modes.

What do you think about that?

Best Regards,
Huang, Ying

-------------------------------8<---------------------------------
From 8e475812eacd9f2eeac76776c2b1a17af3e59b89 Mon Sep 17 00:00:00 2001
From: Huang Ying <[email protected]>
Date: Tue, 21 Feb 2023 16:37:50 +0800
Subject: [PATCH] migrate_pages: fix deadlock in batched migration

Two deadlock bugs were reported for the migrate_pages() batching
series. Thanks Hugh and Pengfei! For example, in the following
deadlock trace snippet,

INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
Workqueue: loop4 loop_rootcg_workfn
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? __pfx_wake_page_function+0x10/0x10
__filemap_get_folio+0x73d/0x770
shmem_get_folio_gfp+0x1fd/0xc80
shmem_write_begin+0x91/0x220
generic_perform_write+0x10e/0x2e0
__generic_file_write_iter+0x17e/0x290
? generic_write_checks+0x12b/0x1a0
generic_file_write_iter+0x97/0x180
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_readv_writev+0x13c/0x210
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_write+0xf6/0x330
vfs_iter_write+0x46/0x70
loop_process_work+0x723/0xfe0
loop_rootcg_workfn+0x28/0x40
process_one_work+0x3cc/0x8d0
worker_thread+0x66/0x630
? __pfx_worker_thread+0x10/0x10
kthread+0x153/0x190
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>

INFO: task repro:1023 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? compaction_alloc+0x77/0x1150
? __pfx_wake_page_function+0x10/0x10
folio_wait_bit+0x30/0x40
folio_wait_writeback+0x2e/0x1e0
migrate_pages_batch+0x555/0x1ac0
? __pfx_compaction_alloc+0x10/0x10
? __pfx_compaction_free+0x10/0x10
? __this_cpu_preempt_check+0x17/0x20
? lock_is_held_type+0xe6/0x140
migrate_pages+0x100e/0x1180
? __pfx_compaction_free+0x10/0x10
? __pfx_compaction_alloc+0x10/0x10
compact_zone+0xe10/0x1b50
? lock_is_held_type+0xe6/0x140
? check_preemption_disabled+0x80/0xf0
compact_node+0xa3/0x100
? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
? _find_first_bit+0x7b/0x90
sysctl_compaction_handler+0x5d/0xb0
proc_sys_call_handler+0x29d/0x420
proc_sys_write+0x2b/0x40
vfs_write+0x3a3/0x780
ksys_write+0xb7/0x180
__x64_sys_write+0x26/0x30
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f3a2471f59d
RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
</TASK>

The page migration task has held the lock of the shmem folio A, and is
waiting the writeback of the folio B of the file system on the loop
block device to complete. While the loop worker task which writes
back the folio B is waiting to lock the shmem folio A, because the
folio A backs the folio B in the loop device. Thus deadlock is
triggered.

In general, if we have locked some other folios except the one we are
migrating, it's not safe to wait synchronously, for example, to wait
the writeback to complete or wait to lock the buffer head.

To fix the deadlock, in this patch, we avoid to batch the page
migration except for MIGRATE_ASYNC mode or the split folios of a THP
folio. In MIGRATE_ASYNC mode, synchronous waiting is avoided. And
there isn't any dependency relationship among the split folios of a
THP folio.

The fix can be improved via converting migration mode from synchronous
to asynchronous if we have locked some other folios except the one we
are migrating. We will do that in the near future.

Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: "Huang, Ying" <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
Reported-by: "Xu, Pengfei" <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Stefan Roesch <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Xin Hao <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
mm/migrate.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index ef68a1aff35c..bc04c34543f3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1937,7 +1937,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
{
int rc, rc_gather;
- int nr_pages;
+ int nr_pages, batch;
struct folio *folio, *folio2;
LIST_HEAD(folios);
LIST_HEAD(ret_folios);
@@ -1951,6 +1951,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
mode, reason, &stats, &ret_folios);
if (rc_gather < 0)
goto out;
+
+ if (mode == MIGRATE_ASYNC)
+ batch = NR_MAX_BATCHED_MIGRATION;
+ else
+ batch = 1;
again:
nr_pages = 0;
list_for_each_entry_safe(folio, folio2, from, lru) {
@@ -1961,11 +1966,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
}

nr_pages += folio_nr_pages(folio);
- if (nr_pages > NR_MAX_BATCHED_MIGRATION)
+ if (nr_pages >= batch)
break;
}
- if (nr_pages > NR_MAX_BATCHED_MIGRATION)
- list_cut_before(&folios, from, &folio->lru);
+ if (nr_pages >= batch)
+ list_cut_before(&folios, from, &folio2->lru);
else
list_splice_init(from, &folios);
rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
--
2.39.1

2023-02-21 22:15:53

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

On Tue, 21 Feb 2023, Huang, Ying wrote:
> Hugh Dickins <[email protected]> writes:
> > On Mon, 20 Feb 2023, Huang, Ying wrote:
> >
> >> Hi, Hugh,
> >>
> >> Hugh Dickins <[email protected]> writes:
> >>
> >> > On Mon, 13 Feb 2023, Huang Ying wrote:
> >> >
> >> >> From: "Huang, Ying" <[email protected]>
> >> >>
> >> >> Now, migrate_pages() migrate folios one by one, like the fake code as
> >> >> follows,
> >> >>
> >> >> for each folio
> >> >> unmap
> >> >> flush TLB
> >> >> copy
> >> >> restore map
> >> >>
> >> >> If multiple folios are passed to migrate_pages(), there are
> >> >> opportunities to batch the TLB flushing and copying. That is, we can
> >> >> change the code to something as follows,
> >> >>
> >> >> for each folio
> >> >> unmap
> >> >> for each folio
> >> >> flush TLB
> >> >> for each folio
> >> >> copy
> >> >> for each folio
> >> >> restore map
> >> >>
> >> >> The total number of TLB flushing IPI can be reduced considerably. And
> >> >> we may use some hardware accelerator such as DSA to accelerate the
> >> >> folio copying.
> >> >>
> >> >> So in this patch, we refactor the migrate_pages() implementation and
> >> >> implement the TLB flushing batching. Base on this, hardware
> >> >> accelerated folio copying can be implemented.
> >> >>
> >> >> If too many folios are passed to migrate_pages(), in the naive batched
> >> >> implementation, we may unmap too many folios at the same time. The
> >> >> possibility for a task to wait for the migrated folios to be mapped
> >> >> again increases. So the latency may be hurt. To deal with this
> >> >> issue, the max number of folios be unmapped in batch is restricted to
> >> >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence
> >> >> is at the same level of THP migration.
> >> >>
> >> >> We use the following test to measure the performance impact of the
> >> >> patchset,
> >> >>
> >> >> On a 2-socket Intel server,
> >> >>
> >> >> - Run pmbench memory accessing benchmark
> >> >>
> >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and
> >> >> node 1 back and forth.
> >> >>
> >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> >> >> the number of pages migrated successfully per second increases 291.7%.
> >> >>
> >> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
> >> >> 2 NUMA nodes. Test results show that the page migration performance
> >> >> increases up to 78%.
> >> >>
> >> >> This patchset is based on mm-unstable 2023-02-10.
> >> >
> >> > And back in linux-next this week: I tried next-20230217 overnight.
> >> >
> >> > There is a deadlock in this patchset (and in previous versions: sorry
> >> > it's taken me so long to report), but I think one that's easily solved.
> >> >
> >> > I've not bisected to precisely which patch (load can take several hours
> >> > to hit the deadlock), but it doesn't really matter, and I expect that
> >> > you can guess.
> >> >
> >> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
> >> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
> >> > So, plenty of ext4 page cache and buffer_heads.
> >> >
> >> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
> >> > either in kcompactd0 or in khugepaged trying to compact, or in both:
> >> > it ends up calling __lock_buffer(), and that schedules away, waiting
> >> > forever to get BH_lock. I have not identified who is holding BH_lock,
> >> > but I imagine a jbd2 journalling thread, and presume that it wants one
> >> > of the folio locks which migrate_pages_batch() is already holding; or
> >> > maybe it's all more convoluted than that. Other tasks then back up
> >> > waiting on those folio locks held in the batch.
> >> >
> >> > Never a problem with buffer_migrate_folio(), always with the "more
> >> > careful" buffer_migrate_folio_norefs(). And the patch below fixes
> >> > it for me: I've had enough hours with it now, on enough occasions,
> >> > to be confident of that.
> >> >
> >> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
> >> > very well, and I hope can assure us that there is an understandable
> >> > deadlock here, from holding several random folio locks, then trying
> >> > to lock buffers. Cc'ing fsdevel, because there's a risk that mm
> >> > folk think something is safe, when it's not sufficient to cope with
> >> > the diversity of filesystems. I hope nothing more than the below is
> >> > needed (and I've had no other problems with the patchset: good job),
> >> > but cannot be sure.
> >> >
> >> > [PATCH next] migrate_pages: fix deadlock on buffer heads
> >> >
> >> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(),
> >> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only
> >> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed.
> >> >
> >> > Signed-off-by: Hugh Dickins <[email protected]>
> >> >
> >> > --- next-20230217/mm/migrate.c
> >> > +++ fixed/mm/migrate.c
> >> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct
> >> > if (folio_ref_count(src) != expected_count)
> >> > return -EAGAIN;
> >> >
> >> > - if (!buffer_migrate_lock_buffers(head, mode))
> >> > + if (!buffer_migrate_lock_buffers(head,
> >> > + check_refs ? MIGRATE_ASYNC : mode))
> >> > return -EAGAIN;
> >> >
> >> > if (check_refs) {
> >>
> >> Thank you very much for pointing this out and the fix patch. Today, my
> >> colleague Pengfei reported a deadlock bug to me. It seems that we
> >> cannot wait the writeback to complete when we have locked some folios.
> >> Below patch can fix that deadlock. I don't know whether this is related
> >> to the deadlock you run into. It appears that we should avoid to
> >> lock/wait synchronously if we have locked more than one folios.
> >
> > Thanks, I've checked now, on next-20230217 without my patch but
> > with your patch below: it took a few hours, but still deadlocks
> > as I described above, so it's not the same issue.
> >
> > Yes, that's a good principle, that we should avoid to lock/wait
> > synchronously once we have locked one folio (hmm, above you say
> > "more than one": I think we mean the same thing, we're just
> > stating it differently, given how the code runs at present).
> >
> > I'm not a great fan of migrate_folio_unmap()'s arguments,
> > "force" followed by "oh, but don't force" (but applaud the recent
> > "avoid_force_lock" as much better than the original "force_lock").
> > I haven't tried, but I wonder if you can avoid both those arguments,
> > and both of these patches, by passing down an adjusted mode (perhaps
> > MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first
> > folio of a batch has been acquired (then restore to the original mode
> > when starting a new batch).
>
> Thanks for suggestion! I think it's a good idea, but it will take me
> some time to think how to implement.
>
> > (My patch is weak in that it trylocks for buffer_head even on the
> > first folio of a MIGRATE_SYNC norefs batch, although that has never
> > given a problem historically: adjusting the mode after acquiring
> > the first folio would correct that weakness.)
>
> Further digging shows that this is related to loop device. That can be
> shown in the following trace,
>
> [ 300.109786] INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
> [ 300.110616] Not tainted 6.2.0-rc4-kvm+ #1314
> [ 300.111143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 300.111985] task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
> [ 300.112964] Workqueue: loop4 loop_rootcg_workfn
> [ 300.113658] Call Trace:
> [ 300.113996] <TASK>
> [ 300.114315] __schedule+0x43b/0xd00
> [ 300.114848] schedule+0x6a/0xf0
> [ 300.115319] io_schedule+0x4a/0x80
> [ 300.115860] folio_wait_bit_common+0x1b5/0x4e0
> [ 300.116430] ? __pfx_wake_page_function+0x10/0x10
> [ 300.116615] __filemap_get_folio+0x73d/0x770
> [ 300.116790] shmem_get_folio_gfp+0x1fd/0xc80
> [ 300.116963] shmem_write_begin+0x91/0x220
> [ 300.117121] generic_perform_write+0x10e/0x2e0
> [ 300.117312] __generic_file_write_iter+0x17e/0x290
> [ 300.117498] ? generic_write_checks+0x12b/0x1a0
> [ 300.117693] generic_file_write_iter+0x97/0x180
> [ 300.117881] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 300.118087] do_iter_readv_writev+0x13c/0x210
> [ 300.118256] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 300.118463] do_iter_write+0xf6/0x330
> [ 300.118608] vfs_iter_write+0x46/0x70
> [ 300.118754] loop_process_work+0x723/0xfe0
> [ 300.118922] loop_rootcg_workfn+0x28/0x40
> [ 300.119078] process_one_work+0x3cc/0x8d0
> [ 300.119240] worker_thread+0x66/0x630
> [ 300.119384] ? __pfx_worker_thread+0x10/0x10
> [ 300.119551] kthread+0x153/0x190
> [ 300.119681] ? __pfx_kthread+0x10/0x10
> [ 300.119833] ret_from_fork+0x29/0x50
> [ 300.119984] </TASK>
>
> When a folio of the loop device is written back, the underlying shmem
> needs to be written back. Which will acquire the lock of the folio of
> shmem. While that folio may be locked by migrate_pages_batch() already.
>
> Your testing involves the loop device too. So is it related to loop?
> For example, after the buffer head was locked, is it possible to acquire
> lock for folios of the underlying file system?

To lock (other than trylock) a folio while holding a buffer head lock
would violate lock ordering: they are both bit spin locks, so lockdep
would not notice, but I'm sure any such violation would have been
caught long ago (unless on some very unlikely path).

There was nothing in the stack traces I looked at to implicate loop,
though admittedly I paid much more attention to the kcompactd0 and
khugepaged pattern which was emerging - the few times I looked at the
loop0 thread, IIRC it was either idle or waiting on the lower level.

If it had a stack trace like you show above, I would just have ignored
that as of little interest, as probably waiting on one of the migrate
batch's locked folios, which were themselves waiting on a buffer_head
lock somewhere. (And shmem itself doesn't use buffer_heads.)

But it's still an interesting idea that these deadlocks might be
related to loop: loop by its nature certainly has a propensity for
deadlocks, though I think those have all been ironed out years ago.

I've run my load overnight using a small ext4 partition instead of
ext4 on loop on tmpfs, and that has not deadlocked. I hesitate to
say that confirms your idea that loop is somehow causing the deadlock
- the timings will have changed, so I've no idea how long such a
modified load needs to run to give confidence; but it does suggest
that you're right.

Though I still think your principle, that we should avoid to lock/wait
synchronously once the batch holds one folio, is the safe principle to
follow anyway, whether or not the deadlocks can be pinned on loop.

Hugh

2023-02-21 22:25:54

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

On Tue, 21 Feb 2023, Huang, Ying wrote:
>
> On second thought, I think that it may be better to provide a fix as
> simple as possible firstly. Then we can work on a more complex fix as
> we discussed above. The simple fix is easy to review now. And, we will
> have more time to test and review the complex fix.
>
> In the following fix, I disabled the migration batching except for the
> MIGRATE_ASYNC mode, or the split folios of a THP folio. After that, I
> will work on the complex fix to enable migration batching for all modes.
>
> What do you think about that?

I don't think there's a need to rush in the wrong fix so quickly.
Your series was in (though sometimes out of) linux-next for some
while, without causing any widespread problems. Andrew did send
it to Linus yesterday, I expect he'll be pushing it out later today
or tomorrow, but I don't think it's going to cause big problems.
Aiming for a fix in -rc2 would be good. Why would it be complex?

Hugh

>
> Best Regards,
> Huang, Ying
>
> -------------------------------8<---------------------------------
> From 8e475812eacd9f2eeac76776c2b1a17af3e59b89 Mon Sep 17 00:00:00 2001
> From: Huang Ying <[email protected]>
> Date: Tue, 21 Feb 2023 16:37:50 +0800
> Subject: [PATCH] migrate_pages: fix deadlock in batched migration
>
> Two deadlock bugs were reported for the migrate_pages() batching
> series. Thanks Hugh and Pengfei! For example, in the following
> deadlock trace snippet,
>
> INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
> Not tainted 6.2.0-rc4-kvm+ #1314
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
> Workqueue: loop4 loop_rootcg_workfn
> Call Trace:
> <TASK>
> __schedule+0x43b/0xd00
> schedule+0x6a/0xf0
> io_schedule+0x4a/0x80
> folio_wait_bit_common+0x1b5/0x4e0
> ? __pfx_wake_page_function+0x10/0x10
> __filemap_get_folio+0x73d/0x770
> shmem_get_folio_gfp+0x1fd/0xc80
> shmem_write_begin+0x91/0x220
> generic_perform_write+0x10e/0x2e0
> __generic_file_write_iter+0x17e/0x290
> ? generic_write_checks+0x12b/0x1a0
> generic_file_write_iter+0x97/0x180
> ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> do_iter_readv_writev+0x13c/0x210
> ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> do_iter_write+0xf6/0x330
> vfs_iter_write+0x46/0x70
> loop_process_work+0x723/0xfe0
> loop_rootcg_workfn+0x28/0x40
> process_one_work+0x3cc/0x8d0
> worker_thread+0x66/0x630
> ? __pfx_worker_thread+0x10/0x10
> kthread+0x153/0x190
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x29/0x50
> </TASK>
>
> INFO: task repro:1023 blocked for more than 147 seconds.
> Not tainted 6.2.0-rc4-kvm+ #1314
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
> Call Trace:
> <TASK>
> __schedule+0x43b/0xd00
> schedule+0x6a/0xf0
> io_schedule+0x4a/0x80
> folio_wait_bit_common+0x1b5/0x4e0
> ? compaction_alloc+0x77/0x1150
> ? __pfx_wake_page_function+0x10/0x10
> folio_wait_bit+0x30/0x40
> folio_wait_writeback+0x2e/0x1e0
> migrate_pages_batch+0x555/0x1ac0
> ? __pfx_compaction_alloc+0x10/0x10
> ? __pfx_compaction_free+0x10/0x10
> ? __this_cpu_preempt_check+0x17/0x20
> ? lock_is_held_type+0xe6/0x140
> migrate_pages+0x100e/0x1180
> ? __pfx_compaction_free+0x10/0x10
> ? __pfx_compaction_alloc+0x10/0x10
> compact_zone+0xe10/0x1b50
> ? lock_is_held_type+0xe6/0x140
> ? check_preemption_disabled+0x80/0xf0
> compact_node+0xa3/0x100
> ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> ? _find_first_bit+0x7b/0x90
> sysctl_compaction_handler+0x5d/0xb0
> proc_sys_call_handler+0x29d/0x420
> proc_sys_write+0x2b/0x40
> vfs_write+0x3a3/0x780
> ksys_write+0xb7/0x180
> __x64_sys_write+0x26/0x30
> do_syscall_64+0x3b/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7f3a2471f59d
> RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
> RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
> R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
> R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
> </TASK>
>
> The page migration task has held the lock of the shmem folio A, and is
> waiting the writeback of the folio B of the file system on the loop
> block device to complete. While the loop worker task which writes
> back the folio B is waiting to lock the shmem folio A, because the
> folio A backs the folio B in the loop device. Thus deadlock is
> triggered.
>
> In general, if we have locked some other folios except the one we are
> migrating, it's not safe to wait synchronously, for example, to wait
> the writeback to complete or wait to lock the buffer head.
>
> To fix the deadlock, in this patch, we avoid to batch the page
> migration except for MIGRATE_ASYNC mode or the split folios of a THP
> folio. In MIGRATE_ASYNC mode, synchronous waiting is avoided. And
> there isn't any dependency relationship among the split folios of a
> THP folio.
>
> The fix can be improved via converting migration mode from synchronous
> to asynchronous if we have locked some other folios except the one we
> are migrating. We will do that in the near future.
>
> Link: https://lore.kernel.org/linux-mm/[email protected]/
> Link: https://lore.kernel.org/linux-mm/[email protected]/
> Signed-off-by: "Huang, Ying" <[email protected]>
> Reported-by: Hugh Dickins <[email protected]>
> Reported-by: "Xu, Pengfei" <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Stefan Roesch <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Xin Hao <[email protected]>
> Cc: Zi Yan <[email protected]>
> Cc: Yang Shi <[email protected]>
> Cc: Baolin Wang <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Mike Kravetz <[email protected]>
> ---
> mm/migrate.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ef68a1aff35c..bc04c34543f3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1937,7 +1937,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
> {
> int rc, rc_gather;
> - int nr_pages;
> + int nr_pages, batch;
> struct folio *folio, *folio2;
> LIST_HEAD(folios);
> LIST_HEAD(ret_folios);
> @@ -1951,6 +1951,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> mode, reason, &stats, &ret_folios);
> if (rc_gather < 0)
> goto out;
> +
> + if (mode == MIGRATE_ASYNC)
> + batch = NR_MAX_BATCHED_MIGRATION;
> + else
> + batch = 1;
> again:
> nr_pages = 0;
> list_for_each_entry_safe(folio, folio2, from, lru) {
> @@ -1961,11 +1966,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> }
>
> nr_pages += folio_nr_pages(folio);
> - if (nr_pages > NR_MAX_BATCHED_MIGRATION)
> + if (nr_pages >= batch)
> break;
> }
> - if (nr_pages > NR_MAX_BATCHED_MIGRATION)
> - list_cut_before(&folios, from, &folio->lru);
> + if (nr_pages >= batch)
> + list_cut_before(&folios, from, &folio2->lru);
> else
> list_splice_init(from, &folios);
> rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
> --
> 2.39.1

2023-02-22 01:04:31

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

Hugh Dickins <[email protected]> writes:

> On Tue, 21 Feb 2023, Huang, Ying wrote:
>>
>> On second thought, I think that it may be better to provide a fix as
>> simple as possible firstly. Then we can work on a more complex fix as
>> we discussed above. The simple fix is easy to review now. And, we will
>> have more time to test and review the complex fix.
>>
>> In the following fix, I disabled the migration batching except for the
>> MIGRATE_ASYNC mode, or the split folios of a THP folio. After that, I
>> will work on the complex fix to enable migration batching for all modes.
>>
>> What do you think about that?
>
> I don't think there's a need to rush in the wrong fix so quickly.
> Your series was in (though sometimes out of) linux-next for some
> while, without causing any widespread problems. Andrew did send
> it to Linus yesterday, I expect he'll be pushing it out later today
> or tomorrow, but I don't think it's going to cause big problems.
> Aiming for a fix in -rc2 would be good.

Sure, I will target to fix in -rc2. Thanks for suggestion!

> Why would it be complex?

Now, I think the big picture could be,

if (MIGRATE_ASYNC) {
migrate_pages_batch(from,);
} else {
migrate_pages_batch(from,, MIGRATE_ASYNC,);
list_for_each_entry_safe (folio,, from) {
migrate_pages_batch(one_folio, , MIGRATE_SYNC,);
}
}

That is, for synchronous migration, try asynchronous batched migration
firstly, then fall back to synchronous migration one by one. This will
make the retry logic easier to be understood.

This needs some code change. Anyway, I will try to do that and show the
code.

Best Regards,
Huang, Ying

2023-02-27 11:06:21

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

On Fri 17-02-23 13:47:48, Hugh Dickins wrote:
> On Mon, 13 Feb 2023, Huang Ying wrote:
>
> > From: "Huang, Ying" <[email protected]>
> >
> > Now, migrate_pages() migrate folios one by one, like the fake code as
> > follows,
> >
> > for each folio
> > unmap
> > flush TLB
> > copy
> > restore map
> >
> > If multiple folios are passed to migrate_pages(), there are
> > opportunities to batch the TLB flushing and copying. That is, we can
> > change the code to something as follows,
> >
> > for each folio
> > unmap
> > for each folio
> > flush TLB
> > for each folio
> > copy
> > for each folio
> > restore map
> >
> > The total number of TLB flushing IPI can be reduced considerably. And
> > we may use some hardware accelerator such as DSA to accelerate the
> > folio copying.
> >
> > So in this patch, we refactor the migrate_pages() implementation and
> > implement the TLB flushing batching. Base on this, hardware
> > accelerated folio copying can be implemented.
> >
> > If too many folios are passed to migrate_pages(), in the naive batched
> > implementation, we may unmap too many folios at the same time. The
> > possibility for a task to wait for the migrated folios to be mapped
> > again increases. So the latency may be hurt. To deal with this
> > issue, the max number of folios be unmapped in batch is restricted to
> > no more than HPAGE_PMD_NR in the unit of page. That is, the influence
> > is at the same level of THP migration.
> >
> > We use the following test to measure the performance impact of the
> > patchset,
> >
> > On a 2-socket Intel server,
> >
> > - Run pmbench memory accessing benchmark
> >
> > - Run `migratepages` to migrate pages of pmbench between node 0 and
> > node 1 back and forth.
> >
> > With the patch, the TLB flushing IPI reduces 99.1% during the test and
> > the number of pages migrated successfully per second increases 291.7%.
> >
> > Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
> > 2 NUMA nodes. Test results show that the page migration performance
> > increases up to 78%.
> >
> > This patchset is based on mm-unstable 2023-02-10.
>
> And back in linux-next this week: I tried next-20230217 overnight.
>
> There is a deadlock in this patchset (and in previous versions: sorry
> it's taken me so long to report), but I think one that's easily solved.
>
> I've not bisected to precisely which patch (load can take several hours
> to hit the deadlock), but it doesn't really matter, and I expect that
> you can guess.
>
> My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
> and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
> So, plenty of ext4 page cache and buffer_heads.
>
> Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
> either in kcompactd0 or in khugepaged trying to compact, or in both:
> it ends up calling __lock_buffer(), and that schedules away, waiting
> forever to get BH_lock. I have not identified who is holding BH_lock,
> but I imagine a jbd2 journalling thread, and presume that it wants one
> of the folio locks which migrate_pages_batch() is already holding; or
> maybe it's all more convoluted than that. Other tasks then back up
> waiting on those folio locks held in the batch.
>
> Never a problem with buffer_migrate_folio(), always with the "more
> careful" buffer_migrate_folio_norefs(). And the patch below fixes
> it for me: I've had enough hours with it now, on enough occasions,
> to be confident of that.
>
> Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
> very well, and I hope can assure us that there is an understandable
> deadlock here, from holding several random folio locks, then trying
> to lock buffers. Cc'ing fsdevel, because there's a risk that mm
> folk think something is safe, when it's not sufficient to cope with
> the diversity of filesystems. I hope nothing more than the below is
> needed (and I've had no other problems with the patchset: good job),
> but cannot be sure.

I suspect it can indeed be caused by the presence of the loop device as
Huang Ying has suggested. What filesystems using buffer_heads do is a
pattern like:

bh = page_buffers(loop device page cache page);
lock_buffer(bh);
submit_bh(bh);
- now on loop dev this ends up doing:
lo_write_bvec()
vfs_iter_write()
...
folio_lock(backing file folio);

So if migration code holds "backing file folio" lock and at the same time
waits for 'bh' lock (while trying to migrate loop device page cache page), it
is a deadlock.

Proposed solution of never waiting for locks in batched mode looks like a
sensible one to me...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-02-28 01:14:40

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

Hi, Honza,

Jan Kara <[email protected]> writes:

> On Fri 17-02-23 13:47:48, Hugh Dickins wrote:
>> On Mon, 13 Feb 2023, Huang Ying wrote:
>>
>> > From: "Huang, Ying" <[email protected]>
>> >
>> > Now, migrate_pages() migrate folios one by one, like the fake code as
>> > follows,
>> >
>> > for each folio
>> > unmap
>> > flush TLB
>> > copy
>> > restore map
>> >
>> > If multiple folios are passed to migrate_pages(), there are
>> > opportunities to batch the TLB flushing and copying. That is, we can
>> > change the code to something as follows,
>> >
>> > for each folio
>> > unmap
>> > for each folio
>> > flush TLB
>> > for each folio
>> > copy
>> > for each folio
>> > restore map
>> >
>> > The total number of TLB flushing IPI can be reduced considerably. And
>> > we may use some hardware accelerator such as DSA to accelerate the
>> > folio copying.
>> >
>> > So in this patch, we refactor the migrate_pages() implementation and
>> > implement the TLB flushing batching. Base on this, hardware
>> > accelerated folio copying can be implemented.
>> >
>> > If too many folios are passed to migrate_pages(), in the naive batched
>> > implementation, we may unmap too many folios at the same time. The
>> > possibility for a task to wait for the migrated folios to be mapped
>> > again increases. So the latency may be hurt. To deal with this
>> > issue, the max number of folios be unmapped in batch is restricted to
>> > no more than HPAGE_PMD_NR in the unit of page. That is, the influence
>> > is at the same level of THP migration.
>> >
>> > We use the following test to measure the performance impact of the
>> > patchset,
>> >
>> > On a 2-socket Intel server,
>> >
>> > - Run pmbench memory accessing benchmark
>> >
>> > - Run `migratepages` to migrate pages of pmbench between node 0 and
>> > node 1 back and forth.
>> >
>> > With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> > the number of pages migrated successfully per second increases 291.7%.
>> >
>> > Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
>> > 2 NUMA nodes. Test results show that the page migration performance
>> > increases up to 78%.
>> >
>> > This patchset is based on mm-unstable 2023-02-10.
>>
>> And back in linux-next this week: I tried next-20230217 overnight.
>>
>> There is a deadlock in this patchset (and in previous versions: sorry
>> it's taken me so long to report), but I think one that's easily solved.
>>
>> I've not bisected to precisely which patch (load can take several hours
>> to hit the deadlock), but it doesn't really matter, and I expect that
>> you can guess.
>>
>> My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE),
>> and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs.
>> So, plenty of ext4 page cache and buffer_heads.
>>
>> Again and again, the deadlock is seen with buffer_migrate_folio_norefs(),
>> either in kcompactd0 or in khugepaged trying to compact, or in both:
>> it ends up calling __lock_buffer(), and that schedules away, waiting
>> forever to get BH_lock. I have not identified who is holding BH_lock,
>> but I imagine a jbd2 journalling thread, and presume that it wants one
>> of the folio locks which migrate_pages_batch() is already holding; or
>> maybe it's all more convoluted than that. Other tasks then back up
>> waiting on those folio locks held in the batch.
>>
>> Never a problem with buffer_migrate_folio(), always with the "more
>> careful" buffer_migrate_folio_norefs(). And the patch below fixes
>> it for me: I've had enough hours with it now, on enough occasions,
>> to be confident of that.
>>
>> Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
>> very well, and I hope can assure us that there is an understandable
>> deadlock here, from holding several random folio locks, then trying
>> to lock buffers. Cc'ing fsdevel, because there's a risk that mm
>> folk think something is safe, when it's not sufficient to cope with
>> the diversity of filesystems. I hope nothing more than the below is
>> needed (and I've had no other problems with the patchset: good job),
>> but cannot be sure.
>
> I suspect it can indeed be caused by the presence of the loop device as
> Huang Ying has suggested. What filesystems using buffer_heads do is a
> pattern like:
>
> bh = page_buffers(loop device page cache page);
> lock_buffer(bh);
> submit_bh(bh);
> - now on loop dev this ends up doing:
> lo_write_bvec()
> vfs_iter_write()
> ...
> folio_lock(backing file folio);
>
> So if migration code holds "backing file folio" lock and at the same time
> waits for 'bh' lock (while trying to migrate loop device page cache page), it
> is a deadlock.
>
> Proposed solution of never waiting for locks in batched mode looks like a
> sensible one to me...

Thank you very much for detail explanation!

Best Regards,
Huang, Ying

2023-02-28 05:59:39

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH -v5 0/9] migrate_pages(): batch TLB flushing

On Tue, 28 Feb 2023, Huang, Ying wrote:
> Jan Kara <[email protected]> writes:
> > On Fri 17-02-23 13:47:48, Hugh Dickins wrote:
> >>
> >> Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2
> >> very well, and I hope can assure us that there is an understandable
> >> deadlock here, from holding several random folio locks, then trying
> >> to lock buffers. Cc'ing fsdevel, because there's a risk that mm
> >> folk think something is safe, when it's not sufficient to cope with
> >> the diversity of filesystems. I hope nothing more than the below is
> >> needed (and I've had no other problems with the patchset: good job),
> >> but cannot be sure.
> >
> > I suspect it can indeed be caused by the presence of the loop device as
> > Huang Ying has suggested. What filesystems using buffer_heads do is a
> > pattern like:
> >
> > bh = page_buffers(loop device page cache page);
> > lock_buffer(bh);
> > submit_bh(bh);
> > - now on loop dev this ends up doing:
> > lo_write_bvec()
> > vfs_iter_write()
> > ...
> > folio_lock(backing file folio);
> >
> > So if migration code holds "backing file folio" lock and at the same time
> > waits for 'bh' lock (while trying to migrate loop device page cache page), it
> > is a deadlock.
> >
> > Proposed solution of never waiting for locks in batched mode looks like a
> > sensible one to me...
>
> Thank you very much for detail explanation!

Yes, thanks a lot, Jan, for elucidating the deadlocks.

I was running with the 1/3,2/3,3/3 series for 48 hours on two machines
at the weekend, and hit no problems with all of them on.

I did try to review them this evening, but there's too much for me to
take in there to give an Acked-by: but I'll ask a couple of questions.

Hugh